Chapter 2 Reading in Labelled Data: Stata

The first step to working with labeled data in R is loading a labeled data file, and making sure the labels are attached. Several of the packages in Chapter 1 have functions for loading labeled data while others have functions checking for labels. A few packages do both. The first table shows which packages read .dta files into R and their functions that do so. The second table shows which packages allow you to checkout the labels of the loaded data.

Table 2.1: Functions to read in Stata data
Package Function for loading .dta file
haven read_dta()
sjlabelled read_stata()
foreign read.dta() Up to 12
readstata13 read.dta13() Past 12
Table 2.2: Functions to look at labels
Package Labels Exist View Variable Label View Value Labels
labelled is.labelled() var_label() val_labels()
sjlabelled is_labelled() get_label() get_labels()
expss is.labelled() var_lab() val_lab()
readstata13 N/A varlabel() get.label()

Stata can attach labels both to the variable name and the variable values. Packages that work with labels tend to check for variable and value labels separately.

This chapter is organized as follows. Each section reads in .dta files according to a package: haven, sjlabelled, and readstata13. The foreign package is not covered as foreign::read.dta() is frozen and will not be updated to support Stata formats after 12. Once the data is loaded, we look at the class of the data as well as the class and structure of a variable that should be labeled (hint: not all packages retain labels of the data). The data used is City Temps, an example data frame found in Stata. The variable division has labeled numeric values. To follow along, you can find the structure of the data in the Appendix.

Each section has a subsection that fills in the table of functions to look at labels. That is, a variable that should be labeled is passed into the function in the table and TRUE, FALSE, or ERROR is recorded in the table as a result. A FALSE result means labels should appear, but the function returns NULL. A ERROR result means even if the function is given proper syntax, an error occurs. This happens mostly commonly when the data is not of an appropriate class for the function.

A note: The haven and labelled packages were written to work together, and thus inherit some functions from each other. In particular, labelled::is.labelled() is identical to haven::is.labelled(). Therefore, even though haven is a separate package, when it comes to checking for the existence of labels, we will only use labelled to avoid redundancy.

Another note: If you do not want to gain a deeper understanding of how the various packages treat labeled data and just want to see which functions work with which data inputs, you can use the following links to skip to the answers.

2.1 Loading .dta file with haven

The CityTemp_haven data structure can be copied from here.

CityTemp_haven <- haven::read_dta("CityTemp.dta")

names(attributes(CityTemp_haven))
## [1] "class"     "row.names" "label"     "names"
class(CityTemp_haven)
## [1] "tbl_df"     "tbl"        "data.frame"
attributes(CityTemp_haven$division)
## $label
## [1] "Census division"
## 
## $format.stata
## [1] "%16.0g"
## 
## $class
## [1] "haven_labelled" "vctrs_vctr"     "double"        
## 
## $labels
##   N Eng Mid Atl     ENC     WNC   S Atl     ESC     WSC     Mtn Pacific 
##       1       2       3       4       5       6       7       8       9
str(CityTemp_haven$division)
##  dbl+lbl [1:956] 1, 1, 2, 1, 1, 2, 2, 1, 2, 2, 1, 2, 2, 1, 2, 2, 1,...
##  @ label       : chr "Census division"
##  @ format.stata: chr "%16.0g"
##  @ labels      : Named num [1:9] 1 2 3 4 5 6 7 8 9
##   ..- attr(*, "names")= chr [1:9] "N Eng" "Mid Atl" "ENC" "WNC" ...

Since the haven package is part of the tidyverse, read_dta() naturally reads in the data as a tibble. The variable division has four attributes: label, format.stata, class, and labels. Importantly, the variable belongs to the "haven_labelled" class. The "haven_labelled" class is another product of the tidyverse, so functions from outside the tidyverse may not work very well.

2.1.1 Label Existence

labelled::is.labelled(CityTemp_haven$division)
## [1] TRUE
sjlabelled::is_labelled(CityTemp_haven$division)
## [1] TRUE
expss::is.labelled(CityTemp_haven$division)
## [1] FALSE

According to the packages labelled and sjlabelled, the division variable is labeled, but the expss package cannot find any. Why? Let’s look at what these functions are looking for.

labelled::is.labelled
## function (x) 
## inherits(x, "haven_labelled")
## <bytecode: 0x7fc916a31200>
## <environment: namespace:haven>
sjlabelled::is_labelled
## function (x) 
## inherits(x, c("labelled", "haven_labelled"))
## <bytecode: 0x7fc9164acc80>
## <environment: namespace:sjlabelled>
expss::is.labelled
## function (x) 
## {
##     inherits(x, "labelled")
## }
## <bytecode: 0x7fc91648b630>
## <environment: namespace:expss>

The function labelled::is.labelled() returns TRUE if the object is of the class "haven_labelled". The function sjlabelled::is_labelled() returns TRUE if the object is of class "haven_labelled" or "labelled". The function expss::is.labelled returns TRUE if the object is of the "labelled". When we read in the data, we saw that the variable division belongs to the class "haven_labelled", which causes labelled::is.labelled() and sjlabelled::is_labelled() to return TRUE, and expss::is.labelled() to return FALSE.

The next step is to try to see the variable and value labels.

2.1.2 Variable Labels

labelled::var_label(CityTemp_haven$division)
## [1] "Census division"
sjlabelled::get_label(CityTemp_haven$division)
## [1] "Census division"
expss::var_lab(CityTemp_haven$division)
## [1] "Census division"
readstata13::varlabel(CityTemp_haven,"division")
## Error in names(varlabel) <- vnames: attempt to set an attribute on NULL

Both the labelled and sjlabelled packages told us the variable division is labeled. It makes sense then that we can use their functions to look at the variable’s labels. The expss function, on the other hand, told us the variable was not labeled, and yet we can still see the labels.

It takes a little investigating, but eventually we can find that labelled::var_label(), sjlabelled::get_label(), and expss::var_lab() all fundamentally call the same method: attr(x, "label", exact = TRUE). As long as an object’s label is stored in an attribute called "label", any of these functions will return the label.

The functions in the readstata13 package are formulated differently. Trying to look at the variable label using readstata13::varlabel() throws an error. We can take a closer look at the function to see why.

readstata13::varlabel
## function (dat, var.name = NULL, lang = NA) 
## {
##     vnames <- names(dat)
##     if (is.na(lang) | lang == get.lang(dat, F)$default) {
##         varlabel <- attr(dat, "var.labels")
##         names(varlabel) <- vnames
##     }
##     else if (is.character(lang)) {
##         ex <- attr(dat, "expansion.fields")
##         varname <- sapply(ex[grep(paste0("_lang_v_", lang), ex)], 
##             function(x) x[1])
##         varlabel <- sapply(ex[grep(paste0("_lang_v_", lang), 
##             ex)], function(x) x[3])
##         names(varlabel) <- varname
##     }
##     if (is.null(var.name)) {
##         return(varlabel[vnames])
##     }
##     else {
##         return(varlabel[var.name])
##     }
## }
## <bytecode: 0x7fc9164788c8>
## <environment: namespace:readstata13>

The third lines of the function is varlabel <- attr(dat, "var.labels"). We know from looking at division’s attributes earlier that it does not have an attribute called "var.labels". When attr(CityTemp, "var.labels") is evaluated, the result is , hence the error message attempt to set an attribute on NULL.

2.1.3 Value Labels

labelled::val_labels(CityTemp_haven$division)
##   N Eng Mid Atl     ENC     WNC   S Atl     ESC     WSC     Mtn Pacific 
##       1       2       3       4       5       6       7       8       9
sjlabelled::get_labels(CityTemp_haven$division)
## [1] "N Eng"   "Mid Atl" "ENC"     "WNC"     "S Atl"   "ESC"    
## [7] "WSC"     "Mtn"     "Pacific"
expss::val_lab(CityTemp_haven$division)
##   N Eng Mid Atl     ENC     WNC   S Atl     ESC     WSC     Mtn Pacific 
##       1       2       3       4       5       6       7       8       9
readstata13::get.label(CityTemp_haven,"division")
## NULL

The same pattern seen for variable labels is repeated for value labels. The functions from the packages labelled, sjlabelled, and expss all return the value labels, even though expss::is.labelled() returned FALSE. While readstata13::get.label() does not throw an error, it returns NULL.

The functions labelled::val_labels(), sjlabelled::get_labels(), and expss::val_lab() are internally very similair to their variable label function counterparts. All three functions call attr(x, "labels", extract = TRUE). The function sjlabelled::get_labels() further processes the attributes, resulting in a slightly different output than labelled::val_labels() and expss::val_lab(). Notably, labelled::val_labels() only does this if the object belongs to either the "haven_labelled" class or the "data.frame" class. Otherwise, it will just return NULL.

The reason behind readstata13::get.label() returning NULL is similar to its variable label counter part. Let’s look at the function.

readstata13::get.label
## function (dat, label.name) 
## {
##     return(attr(dat, "label.table")[label.name][[1]])
## }
## <bytecode: 0x7fc916483070>
## <environment: namespace:readstata13>

It turns out readstata13::get.label() is a straightforward wrapper function to call the label of a specific variable from the data attribute list. Again, we know that the variable division from the CityTemp_haven data does not have an attribute named "label.table". When we run attr(CityTemp_haven, "label.table"), we get . Any indexing of NULL will simply return NULL.

With that, we can fill in our table for data read into R using the haven package.

Table 2.3: Table 2.2 filled in with data read in by haven::read_dta()
Package Labels Exists Variable Labels Value Label
haven TRUE TRUE TRUE
sjlabelled TRUE TRUE TRUE
expss FALSE FALSE TRUE
readstata13 NA ERROR FALSE

2.2 Loading .dta file with sjlabelled

The CityTemp_sjlabelled data structure can be copied from here.

CityTemp_sjlabelled <- sjlabelled::read_stata("CityTemp.dta")
## Converting atomic to factors. Please wait...
names(attributes(CityTemp_sjlabelled))
## [1] "names"     "class"     "row.names"
class(CityTemp_sjlabelled)
## [1] "data.frame"
attributes(CityTemp_sjlabelled$division)
## $levels
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9"
## 
## $class
## [1] "factor"
## 
## $labels
##   N Eng Mid Atl     ENC     WNC   S Atl     ESC     WSC     Mtn Pacific 
##       1       2       3       4       5       6       7       8       9 
## 
## $label
## [1] "Census division"
str(CityTemp_sjlabelled$division)
##  Factor w/ 9 levels "1","2","3","4",..: 1 1 2 1 1 2 2 1 2 2 ...
##  - attr(*, "labels")= Named num [1:9] 1 2 3 4 5 6 7 8 9
##   ..- attr(*, "names")= chr [1:9] "N Eng" "Mid Atl" "ENC" "WNC" ...
##  - attr(*, "label")= chr "Census division"

The sjlabelled package is not part of the tidyverse. We see the data read in by sjlabelled::read_stata() only belongs to the class "data.frame". It is not a tibble. The variable division again as four attributes. The attributes are similar to those of the variable division when the data set is read in with haven::read_dta(). There are two notable differences. First, the class is "factor", whereas the division variable from the CityTemp_haven belonged to the classes "haven_labelled", "vctrs_vctr", and "double". Next, since the variable is a factor, it has an attribute levels. The levels attribute of division from CityTemp_sjlabelled takes the place of format.stata attribute that division from CityTemp_haven has. An important similarity between the data set variables is that they both have two attributes named label and labels. With this knowledge, a reader could reasonably fill out their expectations for Table 2 with sjlabelled loaded data.

2.2.1 Label Existence

labelled::is.labelled(CityTemp_sjlabelled$division)
## [1] FALSE
sjlabelled::is_labelled(CityTemp_sjlabelled$division)
## [1] FALSE
expss::is.labelled(CityTemp_sjlabelled$division)
## [1] FALSE

All three functions return FALSE. This should not be surprising, since the variable division in this case does not belong to either the "haven_labelled" class or the "labelled" class. Thus, even when the user the sjlabelled package to load the data and check for labels, sjlabelled will return that the data is not labeled.

2.2.2 Variable Labels

labelled::var_label(CityTemp_sjlabelled$division)
## [1] "Census division"
sjlabelled::get_label(CityTemp_sjlabelled$division)
## [1] "Census division"
expss::var_lab(CityTemp_sjlabelled$division)
## [1] "Census division"
readstata13::varlabel(CityTemp_sjlabelled,"division")
## Error in names(varlabel) <- vnames: attempt to set an attribute on NULL

Just as with the haven loaded data, the first three functions return the variable label, while the readstata13::varlabel() throws an error. The reasoning is exactly the same as explained in section (#var-lab-haven): labelled::var_label(), sjlabelled::get_label(), and expss::var_lab() all look for an attribute label, while readstata13::varlabel() looks for an attribute var.label. The former exists while the latter does not.

2.2.3 Value Labels

labelled::val_labels(CityTemp_sjlabelled$division)
## NULL
sjlabelled::get_labels(CityTemp_sjlabelled$division)
## [1] "N Eng"   "Mid Atl" "ENC"     "WNC"     "S Atl"   "ESC"    
## [7] "WSC"     "Mtn"     "Pacific"
expss::val_lab(CityTemp_sjlabelled$division)
##   N Eng Mid Atl     ENC     WNC   S Atl     ESC     WSC     Mtn Pacific 
##       1       2       3       4       5       6       7       8       9
readstata13::get.label(CityTemp_sjlabelled,"division")
## NULL

In a turn of events, sjlabelled::get_labels() and expss::val_lab() return value labels, while both labelled::val_labels() and readstata13::get.label() return NULL. But… the functions from sjlabelled, expss and labelled are supposed to return the value of the labels attribute, right? WRONG. Recall: labelled::val_labels() will only return the value of the object’s labels attribute if the object belongs to the "haven_labelled" or "data.frame" class. Otherwise, it will return NULL, as it does in this case. The variable division from the CityTemp_sjlabelled data only belongs to the "factor" class, so labelled::val_labels() returns NULL.

The function readstata13::get.label() runs into the same problem as before; namely, the data CityTemp_sjlabelled does not have an attribute "label.table".

Now, we can fill in Table 2 for data read in by sjlabelled::read_stata().

Table 2.4: Table 2.2 filled in with data read in by sjlabelled::read_stata()
Package Labels Exists Variable Labels Value Label
haven FALSE TRUE FALSE
sjlabelled FALSE TRUE TRUE
expss FALSE TRUE TRUE
readstata13 NA ERROR FALSE

2.3 Loading .dta file with readstata13

The final examination of this chapter is data read in by readstata13::read.dta13(). The CityTemp_readstata13 data structure can be copied from here.

CityTemp_readstata13 <- readstata13::read.dta13("CityTemp.dta")

names(attributes(CityTemp_readstata13))
##  [1] "row.names"        "names"            "datalabel"       
##  [4] "time.stamp"       "formats"          "types"           
##  [7] "val.labels"       "var.labels"       "version"         
## [10] "label.table"      "expansion.fields" "byteorder"       
## [13] "orig.dim"         "data.label"       "class"
class(CityTemp_readstata13)
## [1] "data.frame"
attributes(CityTemp_readstata13$division)
## $levels
## [1] "N Eng"   "Mid Atl" "ENC"     "WNC"     "S Atl"   "ESC"    
## [7] "WSC"     "Mtn"     "Pacific"
## 
## $class
## [1] "factor"
str(CityTemp_readstata13$division)
##  Factor w/ 9 levels "N Eng","Mid Atl",..: 1 1 2 1 1 2 2 1 2 2 ...

It is immediately clear that data read into R by readstata13::read.dta13() has an entirely different structure than data read in by haven::read_dta() or sjlabelled::read_stata(). The data frame CityTemp_readstata13 has 14 attributes instead of four like CityTemp_haven or three like CityTemp_sjlabelled. Inversely, its variable division only has two attributes, levels and class, compared to the four the variables from CityTemp_haven and CityTemp_sjlabelled have. Notice that division has neither a label nor labels attribute, and belongs to the class "factor".

2.3.1 Label Existence

labelled::is.labelled(CityTemp_readstata13$division)
## [1] FALSE
sjlabelled::is_labelled(CityTemp_readstata13$division)
## [1] FALSE
expss::is.labelled(CityTemp_readstata13$division)
## [1] FALSE

As should be expected, all three functions checking for labels return FALSE. The variable division only belongs to the "factor" class, and these functions are checking if the object passed to them belongs to either the "labelled" or "haven_labelled" class.

2.3.2 Variables Labels

labelled::var_label(CityTemp_readstata13$division)
## NULL
sjlabelled::get_label(CityTemp_readstata13$division)
## NULL
expss::var_lab(CityTemp_readstata13$division)
## NULL
readstata13::varlabel(CityTemp_readstata13,"division")
##          division 
## "Census division"

At this point the user should not be surprised that neither labelled::var_label(), sjlabelled::get_label(), nor expss::var_lab() return a variable label. The variable division in this case only has two attributes, neither of which are label. Furthermore, it belongs to the "factor" class. Even if it did have a label attribute, labelled::var_label() would still return NULL.

For the first time, however, we see that readstata13::varlabel() does not return an error, and in fact returns the correct label for the variable. This is because the data frame CityTemp_readstata13 has an attribute "var.labels" that contains all the variable labels for all the variables in the data frame. The function readstata13::varlabel() uses this to look up the specific label for division.

2.3.3 Value Labels

labelled::val_labels(CityTemp_readstata13$division)
## NULL
sjlabelled::get_labels(CityTemp_readstata13$division)
## [1] "N Eng"   "Mid Atl" "ENC"     "WNC"     "S Atl"   "ESC"    
## [7] "WSC"     "Mtn"     "Pacific"
expss::val_lab(CityTemp_readstata13$division)
## NULL
readstata13::get.label(CityTemp_readstata13,"division")
##   N Eng Mid Atl     ENC     WNC   S Atl     ESC     WSC     Mtn Pacific 
##       1       2       3       4       5       6       7       8       9

Neither labelled::val_labels() nor expss::val_lab() return value labels. Recall both functions are looking for the attribute labels, which division in this case does not have.

The sjlabelled::get_labels() function, on the other hand, does return value labels. Huh? Didn’t we establish that sjlabelled::get_labels() also looks for the labels attribute, just like the others? Yes. But, recall it was also mentioned that sjlabelled::get_labels() is a bit more complicated than labelled::val_labels() and expss::val_lab(). Unlike the other two functions, if sjlabelled::get_labels() does not see a labels attributes, it will look for factor levels. Here, division belongs to the "factor" class and has a `"levels" attribute, so sjlabelled::get_labels() returns the factor levels as value labels.

Finally, readstata13::get.label() returns the values labels for division. The data frame CityTemp_readstata13 has a "label.tabel" attribute. From that attribute, readstata13::get.label() can look up the value labels specific to division.

With that, we can fill out our final iteration of Table 2.

Table 2.5: Table 2.2 filled in with data read in by readstata13::read.dta13()
Package Labels Exists Variable Labels Value Label
haven FALSE FALSE FALSE
sjlabelled FALSE FALSE TRUE
expss FALSE FALSE FALSE
readstata13 NA TRUE TRUE

2.4 Function Cohesion Summary

2.4.1 Variable Labels

City Temp Data labelled::var_label() sjlabelled::get_label() expss::var_lab() readstata13::varlab()
haven::read_dta() TRUE TRUE TRUE ERROR
sjlabelled::read_stata() TRUE TRUE TRUE ERROR
readstata13::read.dta13() FALSE FALSE FALSE FALSE

2.4.2 Value Labels

City Temp Data labelled::val_labels() sjlabelled::get_labels() expss::val_lab() readstata13::get.label()
haven::read_dta() FALSE TRUE TRUE FALSE
sjlabelled::read_stata() FALSE TRUE TRUE FALSE
readstata13::read.dta13() FALSE FALSE FALSE FALSE