class: center, middle, title-slide .title[ # Some Useful Data Sets ] .author[ ### Luke Tierney ] .institute[ ### University of Iowa ] .date[ ### 2023-05-05 ] --- layout: true <link rel="stylesheet" href="stat4580.css" type="text/css" /> ## Population and Size of Cities Ca. 1800 --- Data used by Playfair on population and size of some major European cities around 1800 is available in a file at <https://stat.uiowa.edu/~luke/data/Playfair>. -- This file can be read using `read.table`. -- We can read it directly from the web as ```r Playfair <- read.table("https://stat.uiowa.edu/~luke/data/Playfair") ``` -- In an Rmarkdown file you might want to work on when you are not connected to the internet it might be a good idea to download a local copy if you dont have one and then use the local copy: ```r if (! file.exists("Playfair.dat")) download.file("https://stat.uiowa.edu/~luke/data/Playfair", "Playfair.dat") ``` -- You can hide this chunk with the chunk option `include = FALSE`. --- Using the local file: ```r Playfair <- read.table("Playfair.dat") names(Playfair) ## [1] "population" "diameter" head(Playfair, 2) ## population diameter ## Edinburgh 60 9.144 ## Stockholm 63 9.652 ``` -- This data frame isn't in tidy form if we want to be able to use the city names as a variable since `read.table` stores these as the row names. --- One approach to tidying this data frame is to * use the row names to create a new `city` variable; * remove the now redundant row names: -- ```r Playfair$city <- rownames(Playfair) rownames(Playfair) <- NULL head(Playfair, 2) ## population diameter city ## 1 60 9.144 Edinburgh ## 2 63 9.652 Stockholm ``` --- Another option is to read the data as three variables by skipping the first row and then adding the variable names: ```r Playfair <- read.table("Playfair.dat", skip = 1) names(Playfair) ## [1] "V1" "V2" "V3" names(Playfair) <- c("city", "population", "diameter") ``` -- A third option is to use the `rownames_to_column` function in the `tibble` package. --- Useful checks: -- * make sure the beginning and end of the file match `head` and `tail`; -- * make sure the variable types are what they should be; -- * make sure the number of rows matches the number of lines in the file. -- `str` can help: ```r str(Playfair) ## 'data.frame': 22 obs. of 3 variables: ## $ city : chr "Edinburgh" "Stockholm" "Florence" "Genoa" ... ## $ population: int 60 63 75 80 80 80 90 120 130 140 ... ## $ diameter : num 9.14 9.65 10.16 10.67 10.16 ... ``` --- One way to find the number of lines in the file: ```r length(readLines("Playfair.dat")) ## [1] 23 ``` -- The number of lines in the file should be one more than the number of rows. -- It is not a bad idea to put a check in your file: ```r stopifnot(nrow(Playfair) + 1 == length(readLines("Playfair.dat"))) ``` --- layout: false ## City Temperatures The website <https://www.timeanddate.com/weather/> provides current temperatures for a number of cities around the world. -- Values from January 13, 2023, were saved in a file you can download from <https://stat.uiowa.edu/~luke/data/citytemps.dat>. -- ```r citytemps <- read.table("citytemps.dat", header = TRUE) dim(citytemps) ## [1] 140 2 head(citytemps) ## city temp ## 1 Accra 81 ## 2 Addis Ababa 55 ## 3 Adelaide 85 ## 4 Algiers 61 ## 5 Almaty -6 ## 6 Amman 46 ``` --- ## Gapminder Data The `gapminder` variable from the `gapminder` package contains select data from the [GapMinder](https://www.gapminder.org) project. -- Additional data is available from the [GapMinder web site](https://www.gapminder.org) -- ```r library(gapminder) dim(gapminder) ## [1] 1704 6 head(gapminder) ## # A tibble: 6 × 6 ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Afghanistan Asia 1952 28.8 8425333 779. ## 2 Afghanistan Asia 1957 30.3 9240934 821. ## 3 Afghanistan Asia 1962 32.0 10267083 853. ## 4 Afghanistan Asia 1967 34.0 11537966 836. ## 5 Afghanistan Asia 1972 36.1 13079460 740. ## 6 Afghanistan Asia 1977 38.4 14880372 786. ``` --- ## Barley The `barley` data set available in the `lattice` package records total yield in bushels per acre for 10 varieties at 6 experimental sites in Minnesota in each of two years. -- ```r library(lattice) dim(barley) ## [1] 120 4 head(barley) ## yield variety year site ## 1 27.00000 Manchuria 1931 University Farm ## 2 48.86667 Manchuria 1931 Waseca ## 3 27.43334 Manchuria 1931 Morris ## 4 39.93333 Manchuria 1931 Crookston ## 5 32.96667 Manchuria 1931 Grand Rapids ## 6 28.96667 Manchuria 1931 Duluth ``` --- ## Diamonds The `diamonds` data set available in the `ggplot2` package contains prices and other attributes of almost 54,000 diamonds. -- ```r library(ggplot2) dim(diamonds) ## [1] 53940 10 head(diamonds) ## # A tibble: 6 × 10 ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 ## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 ## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 ## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 ## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 ``` --- ## Palmer Penguins The [`palmerpenguins` package](https://allisonhorst.github.io/palmerpenguins/) includes data for adult foraging Adélie, Chinstrap, and Gentoo penguins observed on islands in the Palmer Archipelago near Palmer Station, Antarctica. -- Data were collected and made available by Dr. Kristen Gorman and the Palmer Station Long Term Ecological Research (LTER) Program. -- ```r library(palmerpenguins) penguins ## # A tibble: 344 × 8 ## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## <fct> <fct> <dbl> <dbl> <int> <int> ## 1 Adelie Torgersen 39.1 18.7 181 3750 ## 2 Adelie Torgersen 39.5 17.4 186 3800 ## 3 Adelie Torgersen 40.3 18 195 3250 ## 4 Adelie Torgersen NA NA NA NA ## 5 Adelie Torgersen 36.7 19.3 193 3450 ## 6 Adelie Torgersen 39.3 20.6 190 3650 ## 7 Adelie Torgersen 38.9 17.8 181 3625 ## 8 Adelie Torgersen 39.2 19.6 195 4675 ## 9 Adelie Torgersen 34.1 18.1 193 3475 ## 10 Adelie Torgersen 42 20.2 190 4250 ## # ℹ 334 more rows ## # ℹ 2 more variables: sex <fct>, year <int> ``` --- layout: true ## Other Sources and Forms of Data --- The [Data Sources link](../datasources.html) contains pointers to a range of different data sources. -- Data is available in many forms. -- Some are intended for human reading or editing: * Text files in various formats. -- * Excel spreadsheets. -- * Tables in web pages. -- * Tables in PDF files. -- Other forms are designed for reading by programs but may still be human readable: -- * CSV files. -- * Json and GeoJson files. -- * Web APIs. -- In some cases data has been wrapped in R packages, or an R package is available for working with a specific API. --- Some tools that are available for human-friendly forms: * Text files: `read.table`, `readr::read_table`, `readLines`. * Excel: `readxl::read_xls` * Web pages: `rvest` package. -- Some tools for machine-friendly forms: * CSV files: `read.csv`, `readr::read_csv`. * Json: `jsonlite::fromJSON`. * Web APIs: `httr` package. --- layout: false ### Example: Reading a table from Wikipedia <!-- from Rafa's book --> A Wikipedia page ([permanent link to an older version from 2017](https://en.wikipedia.org/w/index.php?title=Gun_violence_in_the_United_States_by_state&direction=prev&oldid=810166167)) contains a table with data on gun murders in the US by states. ```r url <- paste0("https://en.wikipedia.org/w/index.php?title=", "Gun_violence_in_the_United_States_by_state&", "direction=prev&oldid=810166167") library(tidyverse) library(rvest) h <- read_html(url) tbs <- html_table(html_nodes(h, "table")) murders <- tbs[[1]] names(murders) <- c("state", "pop", "murders", "rate") murders <- mutate(murders, across(2 : 3, function(x) as.numeric(gsub(",", "", x)))) p <- ggplot(murders, aes(x = pop, y = murders)) p + geom_point() library(plotly) ggplotly(p + geom_point(aes(text = state))) ``` --- ### Example: National Weather Service Data for Iowa City Current weather data is available in JSON format from <https://forecast.weather.gov/MapClick.php?lat=41.7&lon=-91.5&FcstType=json> Getting current temperature: ```r icurl <- "http://forecast.weather.gov/MapClick.php?lat=41.7&lon=-91.5&FcstType=json" icjson <- jsonlite::fromJSON(icurl) icjson$currentobservation$Temp ``` --- ### Example: Local Area Unemployment Statistics The Bureau of Labor Statistics provides [data on unemployment](https://www.bls.gov/lau/) by county for the most recent 14 months at <https://www.bls.gov/web/metro/laucntycur14.txt> This can be read using a `read.table`. --- ### Example: Test Scores in an Excel Spreadsheet A simple Excel file is available in the `dslabs` package ([GitHub source](https://github.com/rafalab/dslabs/blob/master/inst/extdata/2010_bigfive_regents.xls?raw=true)): ```r score_file <- system.file("extdata", "2010_bigfive_regents.xls", package = "dslabs") readxl::read_xls(score_file) ``` [More complicated files](https://stat.uiowa.edu/~luke/data/Arrivals-2017-01-06.xls) will need more processing. --- ## Reading More information on data import is provided in the chapter [_Data Import_](https://r4ds.had.co.nz/data-import.html) of [_R for Data Science_](https://r4ds.had.co.nz). Another reference is the chapter [_Importing Data_](https://rafalab.dfci.harvard.edu/dsbook/importing-data.html) in [_Introduction to Data Science: Data Analysis and Prediction Algorithms with R_](https://rafalab.dfci.harvard.edu/dsbook/). --- ## Exercises 1. Read in the Playfair data using the `read.table` function and tidy the result using the `rownames_to_column` function from the `tibble` package. <!-- Playfair <- read.table("http://www.stat.uiowa.edu/~luke/data/Playfair") rownames_to_column(Playfair, "city") --> 2. The variables `site`, `variety`, and `year` in the `barley` data set are _factors_. How many _levels_ do each of these variables have? [You can use the functions `levels` and `length`.]
//adapted from Emi Tanaka's gist at //https://gist.github.com/emitanaka/eaa258bb8471c041797ff377704c8505