Population and Size of Cities Ca. 1800
Data used by Playfair on population and size of some major European
cities around 1800 is available in a file at https://stat.uiowa.edu/~luke/data/Playfair .
This file can be read using read.table.
We can read it directly from the web as
Playfair <- read.table("https://stat.uiowa.edu/~luke/data/Playfair")
In an Rmarkdown file you might want to work on when you are not
connected to the internet it might be a good idea to download a local
copy if you dont have one and then use the local copy:
if (! file.exists("Playfair.dat"))
download.file("https://stat.uiowa.edu/~luke/data/Playfair",
"Playfair.dat")
You can hide this chunk with the chunk option
include = FALSE.
Using the local file:
Playfair <- read.table("Playfair.dat")
names(Playfair)
## [1] "population" "diameter"
head(Playfair, 2)
## population diameter
## Edinburgh 60 9.144
## Stockholm 63 9.652
This data frame isn’t in tidy form if we want to be able to use the
city names as a variable since read.table stores these as
the row names.
One approach to tidying this data frame is to
use the row names to create a new city variable;
remove the now redundant row names:
Playfair$city <- rownames(Playfair)
rownames(Playfair) <- NULL
head(Playfair, 2)
## population diameter city
## 1 60 9.144 Edinburgh
## 2 63 9.652 Stockholm
Another option is to read the data as three variables by skipping the
first row and then adding the variable names:
Playfair <- read.table("Playfair.dat", skip = 1)
names(Playfair)
## [1] "V1" "V2" "V3"
names(Playfair) <- c("city", "population", "diameter")
A third option is to use the rownames_to_column()
function in the tibble package.
Useful checks:
make sure the beginning and end of the file match
head() and tail();
make sure the variable types are what they should be;
make sure the number of rows matches the number of lines in the
file.
str can help:
str(Playfair)
## 'data.frame': 22 obs. of 3 variables:
## $ city : chr "Edinburgh" "Stockholm" "Florence" "Genoa" ...
## $ population: int 60 63 75 80 80 80 90 120 130 140 ...
## $ diameter : num 9.14 9.65 10.16 10.67 10.16 ...
One way to find the number of lines in the file:
length(readLines("Playfair.dat"))
## [1] 23
The number of lines in the file should be one more than the number of
rows.
It is not a bad idea to put a check in your file:
stopifnot(nrow(Playfair) + 1 == length(readLines("Playfair.dat")))
City Temperatures
The website https://www.timeanddate.com/weather/ provides current
temperatures for a number of cities around the world.
Values from January 12, 2026, were saved in a file you can download
from https://stat.uiowa.edu/~luke/data/citytemps.dat .
citytemps <- read.table("citytemps.dat", header = TRUE)
dim(citytemps)
## [1] 141 2
head(citytemps)
## city temp
## 1 Accra 84
## 2 Addis Ababa 73
## 3 Adelaide 66
## 4 Algiers 63
## 5 Almaty 9
## 6 Amman 50
Gapminder Data
The gapminder variable from the gapminder
package contains select data from the GapMinder project.
Additional data is available from the GapMinder web site .
library(gapminder)
dim(gapminder)
## [1] 1704 6
head(gapminder)
## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
Barley
The barley data set available in the
lattice package records total yield in bushels per acre for
10 varieties at 6 experimental sites in Minnesota in each of two
years.
library(lattice)
dim(barley)
## [1] 120 4
head(barley)
## yield variety year site
## 1 27.00000 Manchuria 1931 University Farm
## 2 48.86667 Manchuria 1931 Waseca
## 3 27.43334 Manchuria 1931 Morris
## 4 39.93333 Manchuria 1931 Crookston
## 5 32.96667 Manchuria 1931 Grand Rapids
## 6 28.96667 Manchuria 1931 Duluth
Diamonds
The diamonds data set available in the
ggplot2 package contains prices and other attributes of
almost 54,000 diamonds.
library(ggplot2)
dim(diamonds)
## [1] 53940 10
head(diamonds)
## # A tibble: 6 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Palmer Penguins
The palmerpenguins
package includes data for adult foraging Adélie, Chinstrap, and
Gentoo penguins observed on islands in the Palmer Archipelago near
Palmer Station, Antarctica.
Data were collected and made available by Dr. Kristen Gorman and the
Palmer Station Long Term Ecological Research (LTER) Program.
library(palmerpenguins)
penguins
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
Other Sources and Forms of Data
The Data Sources link contains
pointers to a range of different data sources.
Data is available in many forms.
Some are intended for human reading or editing:
Other forms are designed for reading by programs but may still be
human readable:
CSV files.
Json and GeoJson files.
Web APIs.
In some cases data has been wrapped in R packages, or an R package is
available for working with a specific API.
Some tools that are available for human-friendly forms:
Text files: read.table, readr::read_table,
readLines.
Excel: readxl::read_xls
Web pages: rvest package.
Some tools for machine-friendly forms:
CSV files: read.csv, readr::read_csv.
Json: jsonlite::fromJSON.
Web APIs: httr package.
Example: Reading a table from Wikipedia
A Wikipedia page (permanent
link to an older version from 2017 ) contains a table with data on
gun murders in the US by states.
url <- paste0("https://en.wikipedia.org/w/index.php?title=",
"Gun_violence_in_the_United_States_by_state&",
"direction=prev&oldid=810166167")
library(tidyverse)
library(rvest)
h <- read_html(url)
tbs <- html_table(html_nodes(h, "table"))
murders <- tbs[[1]]
names(murders) <- c("state", "pop", "murders", "rate")
murders <- mutate(murders,
across(2 : 3, function(x) as.numeric(gsub(",", "", x))))
p <- ggplot(murders, aes(x = pop, y = murders))
p + geom_point()
library(plotly)
ggplotly(p + geom_point(aes(text = state)))
Example: Test Scores in an Excel Spreadsheet
A simple Excel file is available in the dslabs package
(GitHub
source ):
score_file <-
system.file("extdata", "2010_bigfive_regents.xls", package = "dslabs")
readxl::read_xls(score_file)
More
complicated files will need more processing.
Exercises
Read in the Playfair data using the read.table function
and tidy the result using the rownames_to_column function
from the tibble package.
The variables site, variety, and
year in the barley data set are
factors . How many levels do each of these variables
have? [You can use the functions levels and
length.]
---
title: "Some Useful Data Sets"
output:
  html_document:
    toc: yes
    code_download: true
---

<link rel="stylesheet" href="stat4580.css" type="text/css" />

```{r setup, include = FALSE}
source(here::here("setup.R"))
knitr::opts_chunk$set(collapse = TRUE, message = FALSE,
                      fig.height = 5, fig.width = 6, fig.align = "center")
```


## Population and Size of Cities Ca. 1800

Data used by Playfair on population and size of some major European
cities around 1800 is available in a file at
<https://stat.uiowa.edu/~luke/data/Playfair>.

This file can be read using `read.table`.

We can read it directly from the web as

```{r, eval = FALSE}
Playfair <- read.table("https://stat.uiowa.edu/~luke/data/Playfair")
```

In an Rmarkdown file you might want to work on when you are not
connected to the internet it might be a good idea to download a local
copy if you dont have one and then use the local copy:

```{r}
if (! file.exists("Playfair.dat"))
    download.file("https://stat.uiowa.edu/~luke/data/Playfair",
                  "Playfair.dat")
```

You can hide this chunk with the chunk option `include = FALSE`.

Using the local file:
```{r}
Playfair <- read.table("Playfair.dat")
names(Playfair)
head(Playfair, 2)
```

This data frame isn't in tidy form if we want to be able to use the
city names as a variable since `read.table` stores these as the row
names.

One approach to tidying this data frame is to

* use the row names to create a new `city` variable;
* remove the now redundant row names:

```{r}
Playfair$city <- rownames(Playfair)
rownames(Playfair) <- NULL
head(Playfair, 2)
```

Another option is to read the data as three variables by skipping the
first row and then adding the variable names:
```{r}
Playfair <- read.table("Playfair.dat", skip = 1)
names(Playfair)
names(Playfair) <- c("city", "population", "diameter")
```

A third option is to use the `rownames_to_column()` function in the
`tibble` package.

Useful checks:

* make sure the beginning and end of the file match `head()` and `tail()`;

* make sure the variable types are what they should be;

* make sure the number of rows matches the number of lines in the file.

`str` can help:

```{r}
str(Playfair)
```

One way to find the number of lines in the file:
```{r}
length(readLines("Playfair.dat"))
```

The number of lines in the file should be one more than the number of rows.

It is not a bad idea to put a check in your file:

```{r}
stopifnot(nrow(Playfair) + 1 == length(readLines("Playfair.dat")))
```


## City Temperatures

```{r, echo = FALSE, eval = FALSE}
## scraping and cleaning
library(rvest)
weather <- read_html("https://www.timeanddate.com/weather/")
w <- html_table(html_nodes(weather, "table"))[[1]]

names(w) <- paste0("V", 1 : 12)
w1 <- w[1 : 4]
w2 <- w[5 : 8]; names(w2) <- paste0("V", 1 : 4)
w3 <- w[9 : 12]; names(w3) <- paste0("V", 1 : 4)
ww <- rbind(w1, w2, w3)
## maybe try to parse the local times in V2?
## the stars indicate DST -- southern hemisphere in January
www <- data.frame(city = sub(" \\*", "", ww$V1),
                  temp = as.numeric(sub("([-+]?[[:digit:]]+).*", "\\1", ww$V4)))
www <- www[complete.cases(www), ]

write.table(www, row.names = FALSE, file = "citytemps.dat")
# scraped from https://www.timeanddate.com/weather/ on 2026-01-12 13:05 CST
head(read.table("citytemps.dat", header = TRUE))
```

```{r, echo = FALSE}
if (! file.exists("citytemps.dat"))
    download.file("https://stat.uiowa.edu/~luke/data/citytemps.dat",
                  "citytemps.dat")
```

The website <https://www.timeanddate.com/weather/> provides current
temperatures for a number of cities around the world.

Values from January 12, 2026, were saved in a file you can download
from <https://stat.uiowa.edu/~luke/data/citytemps.dat>.

```{r}
citytemps <- read.table("citytemps.dat", header = TRUE)
dim(citytemps)
head(citytemps)
```


## Gapminder Data

The `gapminder` variable from the `gapminder` package contains select
data from the [GapMinder](https://www.gapminder.org) project.

Additional data is available from the [GapMinder web
site](https://www.gapminder.org).

```{r}
library(gapminder)
dim(gapminder)
head(gapminder)
```


## Barley

The `barley` data set available in the `lattice` package records total
yield in bushels per acre for 10 varieties at 6 experimental sites in
Minnesota in each of two years.

```{r}
library(lattice)
dim(barley)
head(barley)
```


## Diamonds

The `diamonds` data set available in the `ggplot2` package contains
prices and other attributes of almost 54,000 diamonds.

```{r}
library(ggplot2)
dim(diamonds)
head(diamonds)
```


## Palmer Penguins

The [`palmerpenguins`
package](https://allisonhorst.github.io/palmerpenguins/) includes data
for adult foraging Adélie, Chinstrap, and Gentoo penguins observed on
islands in the Palmer Archipelago near Palmer Station, Antarctica.

Data were collected and made available by
Dr. Kristen Gorman and the Palmer Station Long Term Ecological
Research (LTER) Program.

```{r, echo = TRUE}
library(palmerpenguins)
penguins
```


## Other Sources and Forms of Data

The [Data Sources link](`r WLNK("datasources.html")`) contains
pointers to a range of different data sources.

Data is available in many forms.

Some are intended for human reading or editing:

* Text files in various formats.

* Excel spreadsheets.

* Tables in web pages.

* Tables in PDF files.

Other forms are designed for reading by programs but may still be
human readable:

* CSV files.

* Json and GeoJson files.

* Web APIs.

In some cases data has been wrapped in R packages, or an R package is
available for working with a specific API.

Some tools that are available for human-friendly forms:

* Text files: `read.table`, `readr::read_table`, `readLines`.
* Excel: `readxl::read_xls`
* Web pages: `rvest` package.

Some tools for machine-friendly forms:

* CSV files: `read.csv`, `readr::read_csv`.
* Json: `jsonlite::fromJSON`.
* Web APIs: `httr` package.


### Example: Reading a table from Wikipedia

<!-- from Rafa's book -->
A Wikipedia page ([permanent link to an older
version from 2017](https://en.wikipedia.org/w/index.php?title=Gun_violence_in_the_United_States_by_state&direction=prev&oldid=810166167))
contains a table with data on gun murders in the US by states.

```{r, eval = FALSE}
url <- paste0("https://en.wikipedia.org/w/index.php?title=",
              "Gun_violence_in_the_United_States_by_state&",
              "direction=prev&oldid=810166167")
library(tidyverse)
library(rvest)
h <- read_html(url)
tbs <- html_table(html_nodes(h, "table"))
murders <- tbs[[1]]
names(murders) <- c("state", "pop", "murders", "rate")
murders <- mutate(murders,
                  across(2 : 3, function(x) as.numeric(gsub(",", "", x))))
p <- ggplot(murders, aes(x = pop, y = murders))
p + geom_point()
library(plotly)
ggplotly(p + geom_point(aes(text = state)))
```


### Example: National Weather Service Data for Iowa City

Current weather data is available in JSON format from

<https://forecast.weather.gov/MapClick.php?lat=41.7&lon=-91.5&FcstType=json>

Getting current temperature:

```{r, eval = FALSE}
icurl <-
    "http://forecast.weather.gov/MapClick.php?lat=41.7&lon=-91.5&FcstType=json"
icjson <- jsonlite::fromJSON(icurl)
icjson$currentobservation$Temp
```


### Example: Local Area Unemployment Statistics

The Bureau of Labor Statistics provides [data on
unemployment](https://www.bls.gov/lau/) by county for the most recent
14 months at

Previously these data were made available as text files that can be
read using a `read.table`; data from 2023 is available locally at

<https://homepage.divms.uiowa.edu/~luke/data/laus/laucntycur14-2023.txt>

Current data is available at

<https://www.bls.gov/web/metro/laucntycur14.zip>

This contains and Excel spreadsheet.


### Example: Test Scores in an Excel Spreadsheet

A simple Excel file is available in the `dslabs` package ([GitHub
source](https://github.com/rafalab/dslabs/blob/master/inst/extdata/2010_bigfive_regents.xls?raw=true)):

```{r, eval = FALSE}
score_file <-
    system.file("extdata", "2010_bigfive_regents.xls", package = "dslabs")
readxl::read_xls(score_file)
```

[More complicated files](https://stat.uiowa.edu/~luke/data/Arrivals-2017-01-06.xls) will need more processing.


## Reading

More information on data import is provided in the chapter [_Data
Import_](https://r4ds.hadley.nz/data-import.html) of [_R for Data
Science_](https://r4ds.hadley.nz).

Another reference is the chapter [_Importing
Data_](https://rafalab.dfci.harvard.edu/dsbook-part-1/R/importing-data.html) in
[_Introduction to Data Science: Data Analysis and Prediction
Algorithms with R_](https://rafalab.dfci.harvard.edu/dsbook-part-1/).


## Exercises

1. Read in the Playfair data using the `read.table` function and tidy
   the result using the `rownames_to_column` function from the
   `tibble` package.

<!--
Playfair <- read.table("http://www.stat.uiowa.edu/~luke/data/Playfair")
rownames_to_column(Playfair, "city")
-->

2. The variables `site`, `variety`, and `year` in the `barley` data
   set are _factors_.  How many _levels_ do each of these variables
   have? [You can use the functions `levels` and `length`.]
