--- title: "Some Useful Data Sets" output: html_document: toc: yes code_download: true --- ```{r setup, include = FALSE} source(here::here("setup.R")) knitr::opts_chunk$set(collapse = TRUE, message = FALSE, fig.height = 5, fig.width = 6, fig.align = "center") ``` ## Population and Size of Cities Ca. 1800 Data used by Playfair on population and size of some major European cities around 1800 is available in a file at . This file can be read using `read.table`. We can read it directly from the web as ```{r, eval = FALSE} Playfair <- read.table("https://stat.uiowa.edu/~luke/data/Playfair") ``` In an Rmarkdown file you might want to work on when you are not connected to the internet it might be a good idea to download a local copy if you dont have one and then use the local copy: ```{r} if (! file.exists("Playfair.dat")) download.file("https://stat.uiowa.edu/~luke/data/Playfair", "Playfair.dat") ``` You can hide this chunk with the chunk option `include = FALSE`. Using the local file: ```{r} Playfair <- read.table("Playfair.dat") names(Playfair) head(Playfair, 2) ``` This data frame isn't in tidy form if we want to be able to use the city names as a variable since `read.table` stores these as the row names. One approach to tidying this data frame is to * use the row names to create a new `city` variable; * remove the now redundant row names: ```{r} Playfair$city <- rownames(Playfair) rownames(Playfair) <- NULL head(Playfair, 2) ``` Another option is to read the data as three variables by skipping the first row and then adding the variable names: ```{r} Playfair <- read.table("Playfair.dat", skip = 1) names(Playfair) names(Playfair) <- c("city", "population", "diameter") ``` A third option is to use the `rownames_to_column` function in the `tibble` package. Useful checks: * make sure the beginning and end of the file match `head` and `tail`; * make sure the variable types are what they should be; * make sure the number of rows matches the number of lines in the file. `str` can help: ```{r} str(Playfair) ``` One way to find the number of lines in the file: ```{r} length(readLines("Playfair.dat")) ``` The number of lines in the file should be one more than the number of rows. It is not a bad idea to put a check in your file: ```{r} stopifnot(nrow(Playfair) + 1 == length(readLines("Playfair.dat"))) ``` ## City Temperatures ```{r, echo = FALSE, eval = FALSE} ## scraping and cleaning library(rvest) weather <- read_html("https://www.timeanddate.com/weather/") w <- html_table(html_nodes(weather, "table"))[[1]] names(w) <- paste0("V", 1 : 12) w1 <- w[1 : 4] w2 <- w[5 : 8]; names(w2) <- paste0("V", 1 : 4) w3 <- w[9 : 12]; names(w3) <- paste0("V", 1 : 4) ww <- rbind(w1, w2, w3) ## maybe try to parse the local times in V2? ## the stars indicate DST -- southern hemisphere in January www <- data.frame(city = sub(" \\*", "", ww$V1), temp = as.numeric(sub("([-+]?[[:digit:]]+).*", "\\1", ww$V4))) www <- www[complete.cases(www), ] write.table(www, row.names = FALSE, file = "citytemps.dat") # scraped from https://www.timeanddate.com/weather/ on 2023-01-13 16:03:41 CST head(read.table("citytemps.dat", header = TRUE)) ``` ```{r, echo = FALSE} if (! file.exists("citytemps.dat")) download.file("https://stat.uiowa.edu/~luke/data/citytemps.dat", "citytemps.dat") ``` The website provides current temperatures for a number of cities around the world. Values from January 13, 2023, were saved in a file you can download from . ```{r} citytemps <- read.table("citytemps.dat", header = TRUE) dim(citytemps) head(citytemps) ``` ## Gapminder Data The `gapminder` variable from the `gapminder` package contains select data from the [GapMinder](https://www.gapminder.org) project. Additional data is available from the [GapMinder web site](https://www.gapminder.org) ```{r} library(gapminder) dim(gapminder) head(gapminder) ``` ## Barley The `barley` data set available in the `lattice` package records total yield in bushels per acre for 10 varieties at 6 experimental sites in Minnesota in each of two years. ```{r} library(lattice) dim(barley) head(barley) ``` ## Diamonds The `diamonds` data set available in the `ggplot2` package contains prices and other attributes of almost 54,000 diamonds. ```{r} library(ggplot2) dim(diamonds) head(diamonds) ``` ## Palmer Penguins The [`palmerpenguins` package](https://allisonhorst.github.io/palmerpenguins/) includes data for adult foraging Adélie, Chinstrap, and Gentoo penguins observed on islands in the Palmer Archipelago near Palmer Station, Antarctica. Data were collected and made available by Dr. Kristen Gorman and the Palmer Station Long Term Ecological Research (LTER) Program. ```{r, echo = TRUE} library(palmerpenguins) penguins ``` ## Other Sources and Forms of Data The [Data Sources link](`r WLNK("datasources.html")`) contains pointers to a range of different data sources. Data is available in many forms. Some are intended for human reading or editing: * Text files in various formats. * Excel spreadsheets. * Tables in web pages. * Tables in PDF files. Other forms are designed for reading by programs but may still be human readable: * CSV files. * Json and GeoJson files. * Web APIs. In some cases data has been wrapped in R packages, or an R package is available for working with a specific API. Some tools that are available for human-friendly forms: * Text files: `read.table`, `readr::read_table`, `readLines`. * Excel: `readxl::read_xls` * Web pages: `rvest` package. Some tools for machine-friendly forms: * CSV files: `read.csv`, `readr::read_csv`. * Json: `jsonlite::fromJSON`. * Web APIs: `httr` package. ### Example: Reading a table from Wikipedia A Wikipedia page ([permanent link to an older version from 2017](https://en.wikipedia.org/w/index.php?title=Gun_violence_in_the_United_States_by_state&direction=prev&oldid=810166167)) contains a table with data on gun murders in the US by states. ```{r, eval = FALSE} url <- paste0("https://en.wikipedia.org/w/index.php?title=", "Gun_violence_in_the_United_States_by_state&", "direction=prev&oldid=810166167") library(tidyverse) library(rvest) h <- read_html(url) tbs <- html_table(html_nodes(h, "table")) murders <- tbs[[1]] names(murders) <- c("state", "pop", "murders", "rate") murders <- mutate(murders, across(2 : 3, function(x) as.numeric(gsub(",", "", x)))) p <- ggplot(murders, aes(x = pop, y = murders)) p + geom_point() library(plotly) ggplotly(p + geom_point(aes(text = state))) ``` ### Example: National Weather Service Data for Iowa City Current weather data is available in JSON format from Getting current temperature: ```{r, eval = FALSE} icurl <- "http://forecast.weather.gov/MapClick.php?lat=41.7&lon=-91.5&FcstType=json" icjson <- jsonlite::fromJSON(icurl) icjson$currentobservation$Temp ``` ### Example: Local Area Unemployment Statistics The Bureau of Labor Statistics provides [data on unemployment](https://www.bls.gov/lau/) by county for the most recent 14 months at This can be read using a `read.table`. ### Example: Test Scores in an Excel Spreadsheet A simple Excel file is available in the `dslabs` package ([GitHub source](https://github.com/rafalab/dslabs/blob/master/inst/extdata/2010_bigfive_regents.xls?raw=true)): ```{r, eval = FALSE} score_file <- system.file("extdata", "2010_bigfive_regents.xls", package = "dslabs") readxl::read_xls(score_file) ``` [More complicated files](https://stat.uiowa.edu/~luke/data/Arrivals-2017-01-06.xls) will need more processing. ## Reading More information on data import is provided in the chapter [_Data Import_](https://r4ds.had.co.nz/data-import.html) of [_R for Data Science_](https://r4ds.had.co.nz). Another reference is the chapter [_Importing Data_](https://rafalab.dfci.harvard.edu/dsbook/importing-data.html) in [_Introduction to Data Science: Data Analysis and Prediction Algorithms with R_](https://rafalab.dfci.harvard.edu/dsbook/). ## Exercises 1. Read in the Playfair data using the `read.table` function and tidy the result using the `rownames_to_column` function from the `tibble` package. 2. The variables `site`, `variety`, and `year` in the `barley` data set are _factors_. How many _levels_ do each of these variables have? [You can use the functions `levels` and `length`.]