Chapter 2 Reading in Labelled Data: Stata
The first step to working with labeled data in R is loading a labeled data file, and making sure the labels are attached. Several of the packages in Chapter 1 have functions for loading labeled data while others have functions checking for labels. A few packages do both. The first table shows which packages read .dta files into R and their functions that do so. The second table shows which packages allow you to checkout the labels of the loaded data.
Package | Function for loading .dta file |
---|---|
haven |
read_dta() |
sjlabelled |
read_stata() |
foreign |
read.dta() Up to 12 |
readstata13 |
read.dta13() Past 12 |
Package | Labels Exist | View Variable Label | View Value Labels |
---|---|---|---|
labelled |
is.labelled() |
var_label() |
val_labels() |
sjlabelled |
is_labelled() |
get_label() |
get_labels() |
expss |
is.labelled() |
var_lab() |
val_lab() |
readstata13 |
N/A | varlabel() |
get.label() |
Stata can attach labels both to the variable name and the variable values. Packages that work with labels tend to check for variable and value labels separately.
This chapter is organized as follows. Each section reads in .dta files according to a package: haven
, sjlabelled
, and readstata13
. The foreign
package is not covered as foreign::read.dta()
is frozen and will not be updated to support Stata formats after 12. Once the data is loaded, we look at the class of the data as well as the class and structure of a variable that should be labeled (hint: not all packages retain labels of the data). The data used is City Temps, an example data frame found in Stata. The variable division
has labeled numeric values. To follow along, you can find the structure of the data in the Appendix.
Each section has a subsection that fills in the table of functions to look at labels. That is, a variable that should be labeled is passed into the function in the table and TRUE, FALSE, or ERROR is recorded in the table as a result. A FALSE result means labels should appear, but the function returns NULL. A ERROR result means even if the function is given proper syntax, an error occurs. This happens mostly commonly when the data is not of an appropriate class for the function.
A note: The haven
and labelled
packages were written to work together, and thus inherit some functions from each other. In particular, labelled::is.labelled()
is identical to haven::is.labelled()
. Therefore, even though haven
is a separate package, when it comes to checking for the existence of labels, we will only use labelled
to avoid redundancy.
Another note: If you do not want to gain a deeper understanding of how the various packages treat labeled data and just want to see which functions work with which data inputs, you can use the following links to skip to the answers.
- Answers to Tabel 2.2 with data read in with
haven::read_dta()
- Answers to Tabel 2.2 with data read in with
sjlabelled::read_stata()
- Answers to Tabel 2.2 with data read in with
readstata13::read.dta13()
2.1 Loading .dta file with haven
The CityTemp_haven data structure can be copied from here.
<- haven::read_dta("CityTemp.dta")
CityTemp_haven
names(attributes(CityTemp_haven))
## [1] "class" "row.names" "label" "names"
class(CityTemp_haven)
## [1] "tbl_df" "tbl" "data.frame"
attributes(CityTemp_haven$division)
## $label
## [1] "Census division"
##
## $format.stata
## [1] "%16.0g"
##
## $class
## [1] "haven_labelled" "vctrs_vctr" "double"
##
## $labels
## N Eng Mid Atl ENC WNC S Atl ESC WSC Mtn Pacific
## 1 2 3 4 5 6 7 8 9
str(CityTemp_haven$division)
## dbl+lbl [1:956] 1, 1, 2, 1, 1, 2, 2, 1, 2, 2, 1, 2, 2, 1, 2, 2, 1,...
## @ label : chr "Census division"
## @ format.stata: chr "%16.0g"
## @ labels : Named num [1:9] 1 2 3 4 5 6 7 8 9
## ..- attr(*, "names")= chr [1:9] "N Eng" "Mid Atl" "ENC" "WNC" ...
Since the haven
package is part of the tidyverse, read_dta()
naturally reads in the data as a tibble. The variable division
has four attributes: label, format.stata, class, and labels. Importantly, the variable belongs to the "haven_labelled"
class. The "haven_labelled"
class is another product of the tidyverse, so functions from outside the tidyverse may not work very well.
2.1.1 Label Existence
::is.labelled(CityTemp_haven$division) labelled
## [1] TRUE
::is_labelled(CityTemp_haven$division) sjlabelled
## [1] TRUE
::is.labelled(CityTemp_haven$division) expss
## [1] FALSE
According to the packages labelled
and sjlabelled
, the division
variable is labeled, but the expss
package cannot find any. Why? Let’s look at what these functions are looking for.
::is.labelled labelled
## function (x)
## inherits(x, "haven_labelled")
## <bytecode: 0x7fc916a31200>
## <environment: namespace:haven>
::is_labelled sjlabelled
## function (x)
## inherits(x, c("labelled", "haven_labelled"))
## <bytecode: 0x7fc9164acc80>
## <environment: namespace:sjlabelled>
::is.labelled expss
## function (x)
## {
## inherits(x, "labelled")
## }
## <bytecode: 0x7fc91648b630>
## <environment: namespace:expss>
The function labelled::is.labelled()
returns TRUE if the object is of the class "haven_labelled"
. The function sjlabelled::is_labelled()
returns TRUE if the object is of class "haven_labelled"
or "labelled"
. The function expss::is.labelled
returns TRUE if the object is of the "labelled"
. When we read in the data, we saw that the variable division
belongs to the class "haven_labelled"
, which causes labelled::is.labelled()
and sjlabelled::is_labelled()
to return TRUE, and expss::is.labelled()
to return FALSE.
The next step is to try to see the variable and value labels.
2.1.2 Variable Labels
::var_label(CityTemp_haven$division) labelled
## [1] "Census division"
::get_label(CityTemp_haven$division) sjlabelled
## [1] "Census division"
::var_lab(CityTemp_haven$division) expss
## [1] "Census division"
::varlabel(CityTemp_haven,"division") readstata13
## Error in names(varlabel) <- vnames: attempt to set an attribute on NULL
Both the labelled
and sjlabelled
packages told us the variable division
is labeled. It makes sense then that we can use their functions to look at the variable’s labels. The expss
function, on the other hand, told us the variable was not labeled, and yet we can still see the labels.
It takes a little investigating, but eventually we can find that labelled::var_label()
, sjlabelled::get_label()
, and expss::var_lab()
all fundamentally call the same method: attr(x, "label", exact = TRUE)
. As long as an object’s label is stored in an attribute called "label"
, any of these functions will return the label.
The functions in the readstata13
package are formulated differently. Trying to look at the variable label using readstata13::varlabel()
throws an error. We can take a closer look at the function to see why.
::varlabel readstata13
## function (dat, var.name = NULL, lang = NA)
## {
## vnames <- names(dat)
## if (is.na(lang) | lang == get.lang(dat, F)$default) {
## varlabel <- attr(dat, "var.labels")
## names(varlabel) <- vnames
## }
## else if (is.character(lang)) {
## ex <- attr(dat, "expansion.fields")
## varname <- sapply(ex[grep(paste0("_lang_v_", lang), ex)],
## function(x) x[1])
## varlabel <- sapply(ex[grep(paste0("_lang_v_", lang),
## ex)], function(x) x[3])
## names(varlabel) <- varname
## }
## if (is.null(var.name)) {
## return(varlabel[vnames])
## }
## else {
## return(varlabel[var.name])
## }
## }
## <bytecode: 0x7fc9164788c8>
## <environment: namespace:readstata13>
The third lines of the function is varlabel <- attr(dat, "var.labels")
. We know from looking at division
’s attributes earlier that it does not have an attribute called "var.labels"
. When attr(CityTemp, "var.labels")
is evaluated, the result is , hence the error message attempt to set an attribute on NULL
.
2.1.3 Value Labels
::val_labels(CityTemp_haven$division) labelled
## N Eng Mid Atl ENC WNC S Atl ESC WSC Mtn Pacific
## 1 2 3 4 5 6 7 8 9
::get_labels(CityTemp_haven$division) sjlabelled
## [1] "N Eng" "Mid Atl" "ENC" "WNC" "S Atl" "ESC"
## [7] "WSC" "Mtn" "Pacific"
::val_lab(CityTemp_haven$division) expss
## N Eng Mid Atl ENC WNC S Atl ESC WSC Mtn Pacific
## 1 2 3 4 5 6 7 8 9
::get.label(CityTemp_haven,"division") readstata13
## NULL
The same pattern seen for variable labels is repeated for value labels. The functions from the packages labelled
, sjlabelled
, and expss
all return the value labels, even though expss::is.labelled()
returned FALSE. While readstata13::get.label()
does not throw an error, it returns NULL
.
The functions labelled::val_labels()
, sjlabelled::get_labels()
, and expss::val_lab()
are internally very similair to their variable label function counterparts. All three functions call attr(x, "labels", extract = TRUE)
. The function sjlabelled::get_labels()
further processes the attributes, resulting in a slightly different output than labelled::val_labels()
and expss::val_lab()
. Notably, labelled::val_labels()
only does this if the object belongs to either the "haven_labelled"
class or the "data.frame"
class. Otherwise, it will just return NULL
.
The reason behind readstata13::get.label()
returning NULL
is similar to its variable label counter part. Let’s look at the function.
::get.label readstata13
## function (dat, label.name)
## {
## return(attr(dat, "label.table")[label.name][[1]])
## }
## <bytecode: 0x7fc916483070>
## <environment: namespace:readstata13>
It turns out readstata13::get.label()
is a straightforward wrapper function to call the label of a specific variable from the data attribute list. Again, we know that the variable division
from the CityTemp_haven
data does not have an attribute named "label.table"
. When we run attr(CityTemp_haven, "label.table")
, we get . Any indexing of NULL
will simply return NULL
.
With that, we can fill in our table for data read into R using the haven
package.
Package | Labels Exists | Variable Labels | Value Label |
---|---|---|---|
haven | TRUE | TRUE | TRUE |
sjlabelled | TRUE | TRUE | TRUE |
expss | FALSE | FALSE | TRUE |
readstata13 | NA | ERROR | FALSE |
2.2 Loading .dta file with sjlabelled
The CityTemp_sjlabelled
data structure can be copied from here.
<- sjlabelled::read_stata("CityTemp.dta") CityTemp_sjlabelled
## Converting atomic to factors. Please wait...
names(attributes(CityTemp_sjlabelled))
## [1] "names" "class" "row.names"
class(CityTemp_sjlabelled)
## [1] "data.frame"
attributes(CityTemp_sjlabelled$division)
## $levels
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9"
##
## $class
## [1] "factor"
##
## $labels
## N Eng Mid Atl ENC WNC S Atl ESC WSC Mtn Pacific
## 1 2 3 4 5 6 7 8 9
##
## $label
## [1] "Census division"
str(CityTemp_sjlabelled$division)
## Factor w/ 9 levels "1","2","3","4",..: 1 1 2 1 1 2 2 1 2 2 ...
## - attr(*, "labels")= Named num [1:9] 1 2 3 4 5 6 7 8 9
## ..- attr(*, "names")= chr [1:9] "N Eng" "Mid Atl" "ENC" "WNC" ...
## - attr(*, "label")= chr "Census division"
The sjlabelled
package is not part of the tidyverse. We see the data read in by sjlabelled::read_stata()
only belongs to the class "data.frame"
. It is not a tibble. The variable division
again as four attributes. The attributes are similar to those of the variable division
when the data set is read in with haven::read_dta()
. There are two notable differences. First, the class is "factor"
, whereas the division
variable from the CityTemp_haven
belonged to the classes "haven_labelled"
, "vctrs_vctr"
, and "double"
. Next, since the variable is a factor, it has an attribute levels
. The levels
attribute of division
from CityTemp_sjlabelled
takes the place of format.stata
attribute that division
from CityTemp_haven
has. An important similarity between the data set variables is that they both have two attributes named label
and labels
. With this knowledge, a reader could reasonably fill out their expectations for Table 2 with sjlabelled
loaded data.
2.2.1 Label Existence
::is.labelled(CityTemp_sjlabelled$division) labelled
## [1] FALSE
::is_labelled(CityTemp_sjlabelled$division) sjlabelled
## [1] FALSE
::is.labelled(CityTemp_sjlabelled$division) expss
## [1] FALSE
All three functions return FALSE. This should not be surprising, since the variable division
in this case does not belong to either the "haven_labelled"
class or the "labelled"
class. Thus, even when the user the sjlabelled
package to load the data and check for labels, sjlabelled
will return that the data is not labeled.
2.2.2 Variable Labels
::var_label(CityTemp_sjlabelled$division) labelled
## [1] "Census division"
::get_label(CityTemp_sjlabelled$division) sjlabelled
## [1] "Census division"
::var_lab(CityTemp_sjlabelled$division) expss
## [1] "Census division"
::varlabel(CityTemp_sjlabelled,"division") readstata13
## Error in names(varlabel) <- vnames: attempt to set an attribute on NULL
Just as with the haven
loaded data, the first three functions return the variable label, while the readstata13::varlabel()
throws an error. The reasoning is exactly the same as explained in section (#var-lab-haven): labelled::var_label()
, sjlabelled::get_label()
, and expss::var_lab()
all look for an attribute label
, while readstata13::varlabel()
looks for an attribute var.label
. The former exists while the latter does not.
2.2.3 Value Labels
::val_labels(CityTemp_sjlabelled$division) labelled
## NULL
::get_labels(CityTemp_sjlabelled$division) sjlabelled
## [1] "N Eng" "Mid Atl" "ENC" "WNC" "S Atl" "ESC"
## [7] "WSC" "Mtn" "Pacific"
::val_lab(CityTemp_sjlabelled$division) expss
## N Eng Mid Atl ENC WNC S Atl ESC WSC Mtn Pacific
## 1 2 3 4 5 6 7 8 9
::get.label(CityTemp_sjlabelled,"division") readstata13
## NULL
In a turn of events, sjlabelled::get_labels()
and expss::val_lab()
return value labels, while both labelled::val_labels()
and readstata13::get.label()
return NULL
. But… the functions from sjlabelled
, expss
and labelled
are supposed to return the value of the labels
attribute, right? WRONG. Recall: labelled::val_labels()
will only return the value of the object’s labels
attribute if the object belongs to the "haven_labelled"
or "data.frame"
class. Otherwise, it will return NULL
, as it does in this case. The variable division
from the CityTemp_sjlabelled
data only belongs to the "factor"
class, so labelled::val_labels()
returns NULL
.
The function readstata13::get.label()
runs into the same problem as before; namely, the data CityTemp_sjlabelled
does not have an attribute "label.table"
.
Now, we can fill in Table 2 for data read in by sjlabelled::read_stata()
.
Package | Labels Exists | Variable Labels | Value Label |
---|---|---|---|
haven | FALSE | TRUE | FALSE |
sjlabelled | FALSE | TRUE | TRUE |
expss | FALSE | TRUE | TRUE |
readstata13 | NA | ERROR | FALSE |
2.3 Loading .dta file with readstata13
The final examination of this chapter is data read in by readstata13::read.dta13()
. The CityTemp_readstata13
data structure can be copied from here.
<- readstata13::read.dta13("CityTemp.dta")
CityTemp_readstata13
names(attributes(CityTemp_readstata13))
## [1] "row.names" "names" "datalabel"
## [4] "time.stamp" "formats" "types"
## [7] "val.labels" "var.labels" "version"
## [10] "label.table" "expansion.fields" "byteorder"
## [13] "orig.dim" "data.label" "class"
class(CityTemp_readstata13)
## [1] "data.frame"
attributes(CityTemp_readstata13$division)
## $levels
## [1] "N Eng" "Mid Atl" "ENC" "WNC" "S Atl" "ESC"
## [7] "WSC" "Mtn" "Pacific"
##
## $class
## [1] "factor"
str(CityTemp_readstata13$division)
## Factor w/ 9 levels "N Eng","Mid Atl",..: 1 1 2 1 1 2 2 1 2 2 ...
It is immediately clear that data read into R by readstata13::read.dta13()
has an entirely different structure than data read in by haven::read_dta()
or sjlabelled::read_stata()
. The data frame CityTemp_readstata13
has 14 attributes instead of four like CityTemp_haven
or three like CityTemp_sjlabelled
. Inversely, its variable division
only has two attributes, levels
and class
, compared to the four the variables from CityTemp_haven
and CityTemp_sjlabelled
have. Notice that division
has neither a label
nor labels
attribute, and belongs to the class "factor"
.
2.3.1 Label Existence
::is.labelled(CityTemp_readstata13$division) labelled
## [1] FALSE
::is_labelled(CityTemp_readstata13$division) sjlabelled
## [1] FALSE
::is.labelled(CityTemp_readstata13$division) expss
## [1] FALSE
As should be expected, all three functions checking for labels return FALSE
. The variable division
only belongs to the "factor"
class, and these functions are checking if the object passed to them belongs to either the "labelled"
or "haven_labelled"
class.
2.3.2 Variables Labels
::var_label(CityTemp_readstata13$division) labelled
## NULL
::get_label(CityTemp_readstata13$division) sjlabelled
## NULL
::var_lab(CityTemp_readstata13$division) expss
## NULL
::varlabel(CityTemp_readstata13,"division") readstata13
## division
## "Census division"
At this point the user should not be surprised that neither labelled::var_label()
, sjlabelled::get_label()
, nor expss::var_lab()
return a variable label. The variable division
in this case only has two attributes, neither of which are label
. Furthermore, it belongs to the "factor"
class. Even if it did have a label
attribute, labelled::var_label()
would still return NULL
.
For the first time, however, we see that readstata13::varlabel()
does not return an error, and in fact returns the correct label for the variable. This is because the data frame CityTemp_readstata13
has an attribute "var.labels"
that contains all the variable labels for all the variables in the data frame. The function readstata13::varlabel()
uses this to look up the specific label for division
.
2.3.3 Value Labels
::val_labels(CityTemp_readstata13$division) labelled
## NULL
::get_labels(CityTemp_readstata13$division) sjlabelled
## [1] "N Eng" "Mid Atl" "ENC" "WNC" "S Atl" "ESC"
## [7] "WSC" "Mtn" "Pacific"
::val_lab(CityTemp_readstata13$division) expss
## NULL
::get.label(CityTemp_readstata13,"division") readstata13
## N Eng Mid Atl ENC WNC S Atl ESC WSC Mtn Pacific
## 1 2 3 4 5 6 7 8 9
Neither labelled::val_labels()
nor expss::val_lab()
return value labels. Recall both functions are looking for the attribute labels
, which division
in this case does not have.
The sjlabelled::get_labels()
function, on the other hand, does return value labels. Huh? Didn’t we establish that sjlabelled::get_labels()
also looks for the labels
attribute, just like the others? Yes. But, recall it was also mentioned that sjlabelled::get_labels()
is a bit more complicated than labelled::val_labels()
and expss::val_lab()
. Unlike the other two functions, if sjlabelled::get_labels()
does not see a labels
attributes, it will look for factor levels. Here, division
belongs to the "factor"
class and has a `"levels"
attribute, so sjlabelled::get_labels()
returns the factor levels as value labels.
Finally, readstata13::get.label()
returns the values labels for division
. The data frame CityTemp_readstata13
has a "label.tabel"
attribute. From that attribute, readstata13::get.label()
can look up the value labels specific to division
.
With that, we can fill out our final iteration of Table 2.
Package | Labels Exists | Variable Labels | Value Label |
---|---|---|---|
haven | FALSE | FALSE | FALSE |
sjlabelled | FALSE | FALSE | TRUE |
expss | FALSE | FALSE | FALSE |
readstata13 | NA | TRUE | TRUE |