Introduction

This document provides an overview and some exploratory plots of the Brazil Live Births “Sistema de Informações sobre Nascidos Vivos” (SINASC) data and the Brazil Census data, both available from Datasus. According to the website, SINASC “was officially deployed in 1990 with the objective of collecting data on births throughout the country and providing birth data for all levels of the Health System.”

The SINASC data is publicly available to download from the datasus ftp site. For the results shown in this document, data was downloaded for births from 2001 to 2015, resulting in 44.5 million birth records. The final section of this document will go into more detail about how the data was processed and the variables that are available in the data.

Given the quality, completeness, and size of this data, there are many research questions that could potentially be studied using it. Ultimately, we will be trying to understand the risk factors for low birth weight, but document mostly contains some introductory exploratory visualizations to help illustrate what is in the data.

SINASC Exploration

In this document, we will look at some visualizations of the variables in the data ranging from summaries at the population level, the state level, and the municipality level.

Population Summaries

Yearly Births

The annual number of births is around 3 million, having dipped down to under 2.9 million and rebounding in 2014.

Daily Births

This plot shows the number of daily births broken down by birth month and colored according to whether the birth was on a weekend or weekday. Breaking things up in this way helps highlight some interesting phenomena.

First, we can see that there is a clear annual cycle for births, with the most popular months being March, April, and May. Also, we see that there are significantly more births on weekdays vs. weekends.

How can the number of births vary so much over different temporal breakdowns? The weekday/weekend phenomenon is easier to try to explain, and this has to do with the fact that a large proportion of births in Brazil are scheduled cesareans (as we’ll see below), and doctors will most likely schedule these for weekdays rather than weekends.

Yearly Births by Delivery Type

The only two categories for delivery type are vaginal and cesarean. As can be seen, cesarean births have become dramatically more prevalent than vaginal births, with the inflection point in 2009.

Another way to look at this is as the two delivery types being presented as percentages of births over time.

Birth Weight vs. Year

This plot and all of the following summary plots examine the relationship between birth weight and other variables in the data. The dots represent the median birth weight value and the lines extend from the dots to the first and third quartiles, such that the span of the line represents the middle 50% of observations in the data to give an indication of variability.

This plot shows, as aggregated across the entire country, that the distribution of birth weight has not changed over time.

Birth Weight vs. Gestational Age at Birth

As we would expect to see, pre-term birth is strongly associated with birth weight. The horizontal dashed line at 2500 grams indicates the threshold for low birth weight.

Birth Weight vs. Pregnancy Type

Carrying more than one child has an obvious association with birth weight.

Birth Weight vs. Delivery Type

Birth Weight vs. Sex

Birth Weight vs. Race

Birth Weight vs. Mother’s Education

Birth Weight vs. Marital Status

Birth Weight vs. Mother’s Age

There is an interesting association between the mother’s age and birth weight. This plot is also interesting in that it shows that the span of a childbearing mother’s age is quite large.

Birth Weight vs. Number of Prenatal Visits

Although not extremely strong, there is an indication that there is some association between higher birth weights and number of prenatal visits.

By State

The previous plots looked at the data as a whole population. It is interesting to look on a geographical level to see how the variables vary across geography and time. In this section we will look at the geographical granularity of the 27 states of Brazil.

Most of the plots in this section show the states represented as a regular grid, with the grid laid out in a way as to mirror the geography as closely as possible.

Percent Change in Births Over Time

This plot shows the percentage change in number of births over time and by state, using the number of births in 2001 as the baseline for the percent change calculation. For example, in Alagoas, there were 20% fewer births in 2015 as in 2001, whereas in Espírito Santo, the number of births in 2015 has not changed from that in 2001, although the percentage dipped in the middle.

Each panel of the display is colored according to the 2015 percent change value to help draw attention to states that have had a net increase or decrease in number of births. This plot does not take population into account.

Percentage of Children Born With Low Birth Weight

This plot shows the yearly percentage of children born with low birth weight for each state. While we saw earlier that the overall distribution of birth weight for the entire population does not change over time, when looking at percentage of low birth weight children over time by state, We see that this statistic is not necessarily constant over time.

The overall range of percentage of low birth weight children spans from 5 to 10%. The panels of the display are colored based on the 2015 low birth weight percentage to help draw attention regions where this percentage is higher or lower than others. In general the states in the southern half of the country have a higher prevalence of low birth weight than those in the northern half.

Some states are showing an increasing rate of low birth weight births. For example, Ceará saw the largest 2001-2015 increase of 1.6%.

Distributon of Delivery Type

This plot shows the breakdown of delivery type as a percentage of births each year by state. From this plot is is clear that cesarean birth is increasing in all states, but it is more prevalent in some states than others. For example, many states in the northwest still have vaginal births as the majority.

Distribution of Mother’s Education

This plot shows the education of mothers giving birth has been improving over time across all states.

Distributon of Race

This plot shows the distribution of race of the children born over time and by state. From this we see that across nearly all states, the multiracial share of the population is growing relative to the other groups. We also see that the southern states have children being born that are predominantly white, and that a only few states (in the northwest) have a significant population of indigenous children being born.

Distribution of Marital Status

From this we can observe that single mothers giving birth is more common than other marital status and is on the rise across all states. In a couple of states like Minas Gerais and Espírito Santo, the comparison of single / married proportion is nearly equal, while in most other states, single mothers are more prevalent.

Distribution of Prenatal Visits

This plot shows mothers in the southern states are typically having more prenatal visits than those in the north.

Distribution of Binned Mother’s Age

Here we have binned the mother’s age into teens, twenties, thirties, and forties+. We see the general trend of mothers having babies at a later age over time, with some states such as São Paulo and the Federal District having nearly half of births coming from women aged 30 or older in 2015.

Median Mother’s Age

This shows the median yearly mother’s age for each state. Clearly mother’s age has been increasing over time everywhere, but many of the northern states have a lower median mother’s age.

By Municipality

There are over 5500 municipalities, so there will be many possible way to look at the data at the municipality level.

Mean Birth Weight by Municipality

To start, below is a plot of of the mean birth weight for each municipality plotted by rank within each region of Brazil.

For reference, below is a map with these regions indicated in the same colors.

Choropleth of Mean Birth Weight

Below is a screenshot of an interactive choropleth map that shows the average birth weight for each municipality.

The interactive map is large and takes a while to load, which is the reason for showing a screenshot here. To interact with the map, follow this link. There, you will be able to pan and zoom and hover to get more information about each municipality.

More to come soon.

Census Exploration

There are several census datasets (see census data preparation section below), but we will focus on the “RENDABR” dataset, which provides data for the average household income per capita at the municipality level.

Variables in this dataset include:

  • muni_code: Municipality code
  • year: Year of the data
  • race: color / race
  • house_inc: sum of the average household income (numerator)
  • pop: population considered (denominator)
  • n_child: number of children considered
  • pop_2mw: population with average household income per capita less than 1/2 minimum wage
  • pop_4mw: population with average household income per capita less than 1/4 minimum wage
  • n_child_2mw: children in a situation of average household income per capita less than 1/2 minimum wage
  • n_child_4mw: children in a situation of average household income per capita less than 1/4 minimum wage
  • pop_16unemp: resident economically active population aged 16 and over who are unemployed
  • pop_16: resident economically active population aged 16 and over
  • pop_10work: resident population with 10 to 15 years of age who is working or looking for work
  • pop_10: total resident population with 10 to 15 years of age

The census data is recorded for 1991, 2000, and 2010. For each municipality and race, numbers are provided that can use to construct statistics (means and proportions) at different levels of aggregation.

Population Summaries

2010 distribution of income less than 1/4 minimum wage by race

2010 distribution of average monthly household income by race

There are many other variables and other census datasets to add visualizations for here as well.

By State

2010 average income by state

Here we see that the southern states are the most wealthy, and the northern states having mostly average incomes less than R$ 500. The federal district really stands out, having an average income that is almost double that of most other states.

Note that since only sums and counts are reported, we do not have enough information to compute standard deviations when we do these aggregations.

2010 proportion of households with income less than 1/4 minimum wage by state

It’s also interesting to look at the tails of the income distribution by looking at the proportion of households with income less than 1/4 minimum wage. Here we see there are some states with over 1/3 of the population in this situation.

Income over time by state

This plot shows geographically the trend of the proportion of households living in poverty (here showing less than 1/4 minimum wage) across the three census years. Even though there is a lot of geographic income disparity, it is nice to see that it has been getting universally significantly better. The Northeast generally shows the most dramatic improvement.

Income over time by state and race

By Municipality

Average Income

Above are boxplots of the average municipality-level average household income for each state. The x-axis is on a log base 10 scale. This helps us see the variability of municipality income within each state and across all municapilities, and we see a large span. Another observation from this plot is that there are wealthy municipalities that do make the same average income as that in the federal district, such that the federal district does not seem like as much of an outlier when considering other wealthy pockets of geography.

Birth Weight vs Income

One ultimate goal of looking at the SINASC and census data is to join them together to explore the relationship between birth weight and income. Ideally we would be able to have individual-level income data and link this to the SINASC data. Since we don’t have that, the best we can do is summarize birth weight at the municipality level and merge that with the municipality-level census income data.

We have census data observed at 2000 and 2010. The SINASC data runs from 2001 to 2015. To join the data, we decided to match the 2000 census data with the 2001 SINASC data, and in a similar fashion match the 2010 census data to the 2011 SINASC data.

Average birth weight vs. average income for each municipality

This plot shows average birth weight vs. average income for each municipality for 2000 and 2010. Each point represents a municipality for the given year and gestational age group. A blue trend line is added. Note that the x axis is on a log scale.

There is an interesting result, which is that there is clear non-trivial (although the magnitude is not very large) decrease in overall average birth weight as income increases.

This warrants further exploration.

Average birth weight vs. average income for each municipality by gestational age group

There was a lot of variability in the previous plot when averaging across all births within each municipality. To try to see things more clearly, here we also take into account the most important factor impacting birth weight – gestational age at birth.

This plot shows average birth weight vs. average income for each municipality by gestational age group for 2000 and 2010. Each point represents a municipality for the given year and gestational age group. A blue trend line is added, and the dashed line indicates the cutoff for low birth weight. Note that both the x and y axes are on the log scale, with axis tick labels at “nice” break points.

We again see a very interesting result here, where low-income municipalities have children with higher birth weight on average in all of the pre-term gestational age groups, while income does not seem to be related to birth weight on term babies. Breaking out by gestational age helps us see that the relationship is more pronounced for certain gestational age groups. It also seems more pronounced in 2010 vs. 2000.

This phenomenon is interesting, and I cannot think of a logical explanation for it. It is good here to consider the idea of ecological fallacies, where when looking at relationships based on aggregate measures, it can be dangerous to draw conclusions about individuals.

A closer look at delivery type

One idea to investigate to try to get a better understanding of the lower-income / higher-birth-weight phenomenon that is more pronounced in 2010 vs. 2000 is to look at how the number of births in each gestational age group has changed over time.

Below is a plot showing, for each gestational age group, the percentage of births that fall within that group by delivery type and by year. For any given year, all of the values across all gestational age groups and delivery type sum to 1. The axes of each panel are rescaled to fit the data in the panel.

The 32-36 week period is interesting. The percentage of births in this range has increased significantly over time, both cesarean and vaginal. Is this increase due to people electing to give birth early, or is something happening in the general population that is skewing births earlier? . Also, is this increase in pre-term births more prominent in higher-income areas?. If people are electing to deliver early then the weights will be small.

Delivery type by gestational age and income bin

We want to explore the idea that maybe wealthier mothers are electing to have babies earlier than term. One thing we can investigate is whether patterns of the above plot vary across different income groups. To look at this, we binned municipalities into 5 groups of average income bins and then used these bins to aggregate births across time, delivery type, and gestational age.

For the lower income groups (R$ 0-200 and R$ 200-400), vaginal delivery always dominates cesarean, even in the most recent years. In these income groups, we also see a large jump in preterm birth proportion around 2010, particularly for babies born 32-36 weeks. For top three income groups, in the 32-36 gestational age group, we see cesarean births taking over vaginal, with both groups jumping significantly around 2010.

Approximate weight z-scores

Another way to look at this is whether the difference in weight is

[1] "Less than 22 weeks" "22-27 weeks"        "28-31 weeks"
[4] "32-36 weeks"        "37-41 weeks"        "42 weeks and more" 

This plot raises a different way of thinking about what might explain the difference between low and high income. This suggests that low income municipalities on average have pre-term babies that have a much higher birth weight than expected against the WHO standard (z-scores greater than 3) while higher income municipalities have babies with a more “expected” birth weight for pre term birth. This warrants some consideration as to why lower income babies would weigh more than they should at birth.

Note that there are some caveats associated with this plot. First, since we don’t know exact gestational age, we are using the midpoint of each gestational age range as the value against which we compute the z-score. Second, we are computing z-scores of the means, which have washed out the sex of the child, and the z-score standards are different for each case. However, the calculation should still be a useful approximation.

SINASC Data Prep

Access

The SINASC data is available publicly for download at the datasus ftp site.

There is a file for each state and year, and the data currently goes up to 2015.

Data Dictionary

The file located here contains a dictionary for many of the variables in the data, although it does not cover any of the many variables introduced in 2010. This file was used to construct a data dictionary data structure in R that was used to preprocess the data.

Variables in the Dictionary

The following variables with corresponding English-translated names and labels are available in the dictionary:

name name_en label_en
NUMERODN dn_number DN number sequential by UF and year
LOCNASC birth_place Place of birth
CODESTAB health_estbl_code Health establishment code
CODBAINASC birth_nbhd_code Birth neighborhood code
CODMUNNASC birth_muni_code Birth municipal code
IDADEMAE m_age_yrs Age of the mother in years
ESTCIVMAE marital_status Mother’s marital status
ESCMAE m_educ Mother’s education
CODOCUPMAE occ_code Mother’s occupation, according to the Brazilian Occupations (CBO-2002)
QTDFILVIVO n_live_child Number of living children
QTDFILMORT n_dead_child Number of deceased children
CODBAIRES res_nbhd_code Residence neighborhood code
CODMUNRES m_muni_code Residence municipal code of the mother
GESTACAO gest_weeks Weeks of gestation
GRAVIDEZ preg_type Type of pregnancy
PARTO deliv_type Type of delivery
CONSULTAS n_prenatal_visit Number of prenatal visits
DTNASC birth_date Date of birth in ddmmyyyy format
HORANASC birth_time Time of birth
SEXO sex Sex
APGAR1 apgar1 Apgar in the first minute (00 to 10)
APGAR5 apgar5 Apgar in the fifth minute (00 to 10)
RACACOR race Race / Color
PESO brthwt_g Birth weight, in grams
IDANOMAL cong_anom Congenital anomaly
CODANOMAL cong_icd10 Code of congenital malformation or anomaly chromosome, according to ICD-10
DTCADASTRO sys_reg_date Date of registration in the system
DTRECEBIM rec_reg_date Receipt date at central level, last date registry update
CODINST reg_gen_code Registration generation installation code
UFINFORM rep_uf_code UF code that reported the record

All Variables

To illustrate the available variables and how they have evolved over time, the plot below shows the variable name on the y-axis and the year on the x-axis. If a dot is plotted for a given variable and year, it means that data is available for that variable in that year. Variable names that are lowercase indicate variables that are present in the data dictionary, while uppercase variable names indicate variables that are not in the data dictionary.

As can be seen, the data dictionary covers the variables that are common across all years, and the dataset I have constructed contains this subset of variables to keep a common data structure across all years.

Given the many more variables that are available from 2010 and beyond, it may be worth considering whether to do a separate analysis with a subset of the data starting at 2010 so that we can make use of these variables.

Preprocessing

  • After downloading the files from the ftp site, I read them into R using the read.dbc package. The format of the files is “.DBC”, which is a compressed DBF file that this package can read in.
  • For each file, I select the 23 variables that all data files have in common.
  • Variables are renamed to be English-readable using the data dictionary.
  • Variables are cast to the appropriate type (such as dates, factors, etc.).
  • Variables that are factors are recast from numbers (e.g. 1, 2) to more meaningful factor levels (e.g. “Male”, “Female”) according to the data dictionary.
  • A few records with strange encodings are fixed.
  • The variable birth_year is added since it is frequently used.
  • The 7th character of birth_muni_code and m_muni_code is dropped to be consistent across all years (2001-2005 have 7 characters while 2006-2015 have 6 characters). The 7th character is extra and not needed.
  • Since only municipal codes are provided, two new variables, birth_state_code and m_state_code are created by merging a municipality / state code lookup table.
  • Municipality codes are converted to integers to save storage space.
  • A few implausible values are set to NA: m_age_yrs of 0 or 99, brthwt_g of 0 or 9999, apgar1 and apgar5 of 99.

Census Data Prep

Text coming soon. See code linked to below for more information.

Code

All of the R code to download, process, and replicate this analysis (including the code that generates this document) is available here.

Note that the SINASC dataset is quite large and I processed it in memory on a mid-2014 MacBook Pro 2.5GHz 16GB RAM. A machine with less RAM may have difficulty and you may need to resort to other means. Just reading in the SINASC data takes a few minutes, and doing computations against the data could take ~30 seconds on average. I separated out the code for intensive computation of artifacts against the SINASC data into a separate files and saved out the artifacts for use in exploration and visualization.