Before we begin, let’s learn a little about our data. The
babynames dataset comes in the babynames package. The package is pre-installed for you, just as ggplot2 was pre-installed in the last tutorial. But unlike in the last tutorial, I have not pre-loaded babynames, or any other package.
What does this mean? In R, whenever you want to use a package that is not part of base R, you need to load the package with the command
library. Until you load a package, R will not be able to find the datasets and functions contained in the package. For example, if we asked R to display the
babynames dataset, which comes in the babynames package, right now, we’d get the message below. R cannot find the dataset because we haven’t loaded the babynames package.
## # A tibble: 1,924,665 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Mary 7065 0.0724
## 2 1880 F Anna 2604 0.0267
## 3 1880 F Emma 2003 0.0205
## 4 1880 F Elizabeth 1939 0.0199
## 5 1880 F Minnie 1746 0.0179
## 6 1880 F Margaret 1578 0.0162
## 7 1880 F Ida 1472 0.0151
## 8 1880 F Alice 1414 0.0145
## 9 1880 F Bertha 1320 0.0135
## 10 1880 F Sarah 1288 0.0132
## # … with 1,924,655 more rows
To load the
babynames package, you would run the command
library(babynames). After you load a package, R will be able to find its contents until you close R. The next time you open R, you will need to reload the package if you wish to use it again.
This might sound like an inconvenience, but choosing which packages to load keeps your R experience simple and orderly.
In the chunk below, load babynames (the package) and then open the help page for
babynames (the data set). Be sure to read the help page before going on.
Now that you know a little about the dataset, let’s examine its contents. If you were to run
babynames at your R console, you would get output that looks like this:
#> 187 1880 F Christina 65 6.659495e-04
#> 188 1880 F Lelia 65 6.659495e-04
#> 189 1880 F Nelle 65 6.659495e-04
#> 190 1880 F Sue 65 6.659495e-04
#> 191 1880 F Johanna 64 6.557041e-04
#> 192 1880 F Lilly 64 6.557041e-04
#> 193 1880 F Lucinda 63 6.454587e-04
#> 194 1880 F Minerva 63 6.454587e-04
#> 195 1880 F Lettie 62 6.352134e-04
#> 196 1880 F Roxie 62 6.352134e-04
#> 197 1880 F Cynthia 61 6.249680e-04
#> 198 1880 F Helena 60 6.147226e-04
#> 199 1880 F Hilda 60 6.147226e-04
#> 200 1880 F Hulda 60 6.147226e-04
#> [ reached getOption("max.print") -- omitted 1825233 rows ]
Yikes. What is happening?
Displaying large data
babynames is a large data frame, and R is not well equiped to display the contents of large data frames. R shows as many rows as possible before your memory buffer is overwhelmed. At that point, R stops, leaving you to look at an arbitrary section of your data.
You can avoid this behaviour by transforming your data frame to a tibble.