class: center, middle, title-slide .title[ # Data and Data Frames ] .author[ ### Luke Tierney ] .institute[ ### University of Iowa ] .date[ ### 2023-05-06 ] --- layout: true <link rel="stylesheet" href="stat4580.css" type="text/css" /> ## Data Structures and Data Attribute Types --- Data comes in many different forms. -- Some of the most common data structures are * rectangular tables * networks and trees * geometries, regions -- Other forms include * collections of text * video and audio recordings * ... -- We will work mostly with tables. -- Many other forms can be reduced to tables. --- [Stevens](https://en.wikipedia.org/wiki/Stanley_Smith_Stevens) (1945) classifies [scales of measurement](https://en.wikipedia.org/wiki/Level_of_measurement) for attributes, or variables, as <!-- permanent link: https://en.wikipedia.org/w/index.php?title=Level_of_measurement&oldid=1060772847 --> * nominal (e.g. hair color) * ordinal (e.g. dislike, neutral, like) * interval (e.g. temperature) --- compare by difference * ratio (e.g. counts) --- compare by ratio -- These are sometimes grouped as * qualitative: nominal * quantitative: ordinal, interval, ratio -- These can be viewed as _semantic classifications_ --- _Computational considerations_ often classify variables as * categorical * integer, discrete * real, continuous -- Another consideration is that some scales may be cyclic: * hours of the day * angles, longitude -- These distinctions can be important in choosing visual representations. --- Other typologies include one proposed by [Mosteller and Tukey (1977)](https://en.wikipedia.org/wiki/Level_of_measurement#Mosteller_and_Tukey's_typology_(1977)): 1. Names 2. Grades (ordered labels like beginner, intermediate, advanced) 3. Ranks (orders with 1 being the smallest or largest, 2 the next smallest or largest, and so on) 4. Counted fractions (bound by 0 and 1) 5. Counts (non-negative integers) 6. Amounts (non-negative real numbers) 7. Balances (any real number) --- layout: true ## Data Types in R --- R variables can be of different types. -- The most common types are * `numeric` for real numbers * `integer` * `character` for text data or nominal data * `factor` for nominal or ordinal data -- Factors can be * unordered, for nominal data * ordered, for ordinal data -- .alert[ `factors` are more efficient and powerful for representing nominal or ordinal data than `character` data but can take a bit more getting used to. ] --- Membership predicates and coercion functions are | Predicate | Coersion| |:----------------|:-----------------| | `is.numeric` | `as.numeric` | | `is.integer` | `as.integer` | | `is.character` | `as.character` | | `is.factor` | `as.factor` | | `is.ordered` | `as.ordered` | -- .alert[ Conversion of factors with numeric-looking labels to numeric data should always go through `as.character` first. ] --- layout: true ## Data Frames: Organizing Cases and Variables --- Tabular data in R is usually stored as a _data frame_. -- A data frame is a collection of _variables_, each with a value for every _case_ or _observacion_. -- The `faithful` data set is a data frame: ```r class(faithful) ## [1] "data.frame" names(faithful) ## [1] "eruptions" "waiting" ``` -- Most tools we work with in R use data organized in data frames. --- Our `plot()` and `lm()` expressions from the [introductory section](intro.html) can also we written as ```r plot(waiting ~ eruptions, data = faithful, xlab = "Eruption time (min)", ylab = "Waiting time to next eruption (min)") fit <- lm(waiting ~ eruptions, data = faithful) ``` -- .alert[ `plot()` only uses the `data` argument when the plot is specified as a _formula_, like ```r waiting ~ eruptions ``` ] --- layout: true ## Examining the Data in a Data Frame --- `head()` provides an idea of what the raw data looks like: ```r head(faithful) ## eruptions waiting ## 1 3.600 79 ## 2 1.800 54 ## 3 3.333 74 ## 4 2.283 62 ## 5 4.533 85 ## 6 2.883 55 ``` -- `str()` is also useful for an overview of an object's structure: ```r str(faithful) ## 'data.frame': 272 obs. of 2 variables: ## $ eruptions: num 3.6 1.8 3.33 2.28 4.53 ... ## $ waiting : num 79 54 74 62 85 55 88 85 51 85 ... ``` --- Another useful function available from the `dplyr` or `tibble` packages is `glimpse()`: ```r library(dplyr) glimpse(faithful) ## Rows: 272 ## Columns: 2 ## $ eruptions <dbl> 3.600, 1.800, 3.333, 2.283, 4.533, 2.883, 4.700, 3.600, 1.95… ## $ waiting <dbl> 79, 54, 74, 62, 85, 55, 88, 85, 51, 85, 54, 84, 78, 47, 83, … ``` -- `summary()` shows basic statistical properties of the variables: ```r summary(faithful) ## eruptions waiting ## Min. :1.600 Min. :43.0 ## 1st Qu.:2.163 1st Qu.:58.0 ## Median :4.000 Median :76.0 ## Mean :3.488 Mean :70.9 ## 3rd Qu.:4.454 3rd Qu.:82.0 ## Max. :5.100 Max. :96.0 ``` --- `summary()` with a character variable and a factor variable: ```r ffl <- mutate(faithful, type = ifelse(eruptions < 3, "short", "long"), ftype = factor(type)) summary(ffl) ## eruptions waiting type ftype ## Min. :1.600 Min. :43.0 Length:272 long :175 ## 1st Qu.:2.163 1st Qu.:58.0 Class :character short: 97 ## Median :4.000 Median :76.0 Mode :character ## Mean :3.488 Mean :70.9 ## 3rd Qu.:4.454 3rd Qu.:82.0 ## Max. :5.100 Max. :96.0 ``` --- layout: false ## Variables in a Data Frame A Data frame is a list, or vector, of variables: ```r length(faithful) ## [1] 2 ``` -- The dollar sign `$` can be used to examine individual variables: ```r class(faithful$eruptions) ## [1] "numeric" class(faithful$waiting) ## [1] "numeric" ``` -- The variables can also be extracted by numerical or character index using the element extraction operation `[[`: ```r class(faithful[[1]]) ## [1] "numeric" class(faithful[["waiting"]]) ## [1] "numeric" ``` --- ## Dimensions The numbers of rows and columns can be obtained using `nrow()` and `ncol()`: ```r ncol(faithful) ## [1] 2 nrow(faithful) ## [1] 272 ``` -- `dim()` returns a vector of the dimensions: ```r dim(faithful) ## [1] 272 2 ``` --- ## Simple Visualizations `plot` has a _method_ for data frames that tries to provide a reasonable default visualization for numeric data frames: ```r plot(faithful) ``` <img src="datafrm_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- ## Sample Data Sets The [`datasets`](http://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html) package in the base R distribution contains a number of data sets you can explore. -- Another package with a useful collection of data sets is [`dslabs`](https://cran.r-project.org/package=dslabs). -- Many other data sets are contained in contributed packages as examples. -- There are also many contributed packages designed specifically for making particular data sets available. --- layout: true ## Tidy Data --- The useful concept of a _tidy_ data frame was introduced fairly [recently](https://www.jstatsoft.org/article/view/v059i10) and is described is a [chapter in _R for Data Science_](https://r4ds.had.co.nz/tidy-data.html). -- The idea is that * every observation should correspond to a single row; * every variable should correspond to a single column. -- Tidy data is computationally convenient, and many of the tools we will use are designed around tidy data frames. -- A large range of these tools can be accessed by loading the `tidyverse` package: ```r library(tidyverse) ``` -- But for now I will load the needed packages individually. --- .alert[ The term _tidy_ is a little unfortunate. * Data that is not _tidy_ isn't necessarily _bad_. * For human reading, and for some computations, data in a wider format can be better. * For other computations data in a longer, or narrower, format can be better. ] --- layout: true ## Tibbles --- Many tools in the `tidyverse` produce slightly enhanced data frames called _tibbles_: -- .pull-left.width-50[ ```r library(tibble) faithful_tbl <- as_tibble(faithful) class(faithful_tbl) ## [1] "tbl_df" "tbl" "data.frame" ``` ] -- .pull-right.width-40.small-code[ Tibbles print differently from standard data frames: ```r faithful_tbl *## # A tibble: 272 × 2 ## eruptions waiting *## <dbl> <dbl> ## 1 3.6 79 ## 2 1.8 54 ## 3 3.33 74 ## 4 2.28 62 ## 5 4.53 85 ## 6 2.88 55 ## 7 4.7 88 ## 8 3.6 85 ## 9 1.95 51 ## 10 4.35 85 *## # ℹ 262 more rows ``` ] -- For the most part data frames and tibbles can be used interchangeably. --- layout: false ## Tidying Data Many data sets are in tidy form already. -- If they are not, they can be put into tidy form. -- The tools for this are part of _data technologies_. -- The tasks involved are part of what is sometimes called _data wrangling_. --- layout: true ## An Example: Global Average Surface Temperatures --- Among many data sets available at <https://data.giss.nasa.gov/gistemp/> is data on monthly global average surface temperatures over a number of years. -- These data from 2017 were used for the widely cited [Bloomberg hottest year visualization](https://www.bloomberg.com/graphics/hottest-year-on-record/). -- The current data are available in a formatted text file at <https://data.giss.nasa.gov/gistemp/tabledata_v4/GLB.Ts+dSST.txt> -- or as a [_CSV_ (comma-separated values)](https://en.wikipedia.org/wiki/Comma-separated_values) file at <https://data.giss.nasa.gov/gistemp/tabledata_v4/GLB.Ts+dSST.csv> -- The first few lines of the CSV file: .small-code[ ``` Land-Ocean: Global Means Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec,J-D,D-N,DJF,MAM,JJA,SON 1880,-.18,-.24,-.08,-.16,-.09,-.21,-.18,-.10,-.14,-.23,-.21,-.17,-.16,***,***,-.11,-.16,-.19 1881,-.19,-.13,.04,.06,.07,-.18,.01,-.03,-.15,-.21,-.18,-.06,-.08,-.09,-.16,.05,-.07,-.18 1882,.17,.14,.05,-.15,-.13,-.22,-.16,-.07,-.14,-.23,-.16,-.36,-.11,-.08,.08,-.08,-.15,-.18 1883,-.29,-.36,-.12,-.18,-.17,-.07,-.07,-.13,-.22,-.11,-.24,-.11,-.17,-.19,-.33,-.16,-.09,-.19 ``` ] --- The CSV file is a little easier (for a computer program) to read in, so we will work with that. -- The numbers in the CSV file represent deviations in degrees Celcius from the average temperature for the base period 1951-1980. -- The file available on January 19, 2023, is now available [locally](https://stat.uiowa.edu/~luke/data/GLB.Ts+dSST.csv). -- We can make sure it has been downloaded to our working directory with ```r if (! file.exists("GLB.Ts+dSST.csv")) download.file("https://stat.uiowa.edu/~luke/data/GLB.Ts+dSST.csv", "GLB.Ts+dSST.csv") ``` --- Assuming this locally available file has been downloaded, we can read in the data and drop some columns we don't need with ```r library(readr) gast <- read_csv("GLB.Ts+dSST.csv", skip = 1)[1 : 13] ``` -- .alert[ * The function `read_csv` is from the `readr` package, which is part of the `tidyverse`. * An alternative is the base R function `read.csv`. ] --- A look a the first few lines: .small-code[ ```r head(gast, 5) ## # A tibble: 5 × 13 ## Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1880 -0.18 -0.24 -0.08 -0.16 -0.09 -0.21 -0.18 -0.1 -0.14 -0.23 -0.21 -0.17 ## 2 1881 -0.19 -0.13 0.04 0.06 0.07 -0.18 0.01 -0.03 -0.15 -0.21 -0.18 -0.06 ## 3 1882 0.17 0.14 0.05 -0.15 -0.13 -0.22 -0.16 -0.07 -0.14 -0.23 -0.16 -0.36 ## 4 1883 -0.29 -0.36 -0.12 -0.18 -0.17 -0.07 -0.07 -0.13 -0.22 -0.11 -0.24 -0.11 ## 5 1884 -0.12 -0.08 -0.36 -0.4 -0.33 -0.34 -0.3 -0.27 -0.27 -0.24 -0.33 -0.3 ``` ] -- And the last few lines: .small-code[ ```r tail(gast, 5) ## # A tibble: 5 × 13 ## Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2018 0.81 0.84 0.88 0.89 0.82 0.77 0.82 0.76 0.8 1.01 0.82 0.92 ## 2 2019 0.93 0.94 1.17 1.01 0.85 0.91 0.94 0.94 0.92 1.01 0.99 1.09 ## 3 2020 1.16 1.24 1.17 1.13 1.02 0.92 0.9 0.88 0.99 0.89 1.1 0.81 ## 4 2021 0.81 0.64 0.89 0.76 0.78 0.84 0.91 0.81 0.92 1 0.93 0.87 ## 5 2022 0.91 0.89 1.05 0.84 0.84 0.92 0.93 0.95 0.9 0.97 0.72 0.8 ``` ] --- The `print()` method for tibbles abbreviates the output. -- It is neater and provides some useful additional information on variable data types. -- But it shows only the first few rows and may not explicitly show some columns. -- If some columns are skipped, you can ask to see more by calling `print()` explicitly: .small-code[ ```r print(tail(gast), width = 100) ## # A tibble: 6 × 13 ## Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2017 1.02 1.13 1.16 0.94 0.91 0.72 0.81 0.87 0.76 0.9 0.88 0.93 ## 2 2018 0.81 0.84 0.88 0.89 0.82 0.77 0.82 0.76 0.8 1.01 0.82 0.92 ## 3 2019 0.93 0.94 1.17 1.01 0.85 0.91 0.94 0.94 0.92 1.01 0.99 1.09 ## 4 2020 1.16 1.24 1.17 1.13 1.02 0.92 0.9 0.88 0.99 0.89 1.1 0.81 ## 5 2021 0.81 0.64 0.89 0.76 0.78 0.84 0.91 0.81 0.92 1 0.93 0.87 ## 6 2022 0.91 0.89 1.05 0.84 0.84 0.92 0.93 0.95 0.9 0.97 0.72 0.8 ``` ] <!-- * The last values in the `Aug` - `Dec` columns are missing and coded as `***`. * This causes these columns to be read as character vectors indicated by `<chr>`. * The others have been read as numeric, coded `<dbl>` (for _double precision_). * We can fix the these columns now or deal with them later. One way to fix them now is to use `mutate_if`: ```r gast <- mutate(gast, across(where(is.character), as.numeric)) head(gast) ## # A tibble: 6 × 13 ## Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1880 -0.18 -0.24 -0.08 -0.16 -0.09 -0.21 -0.18 -0.1 -0.14 -0.23 -0.21 -0.17 ## 2 1881 -0.19 -0.13 0.04 0.06 0.07 -0.18 0.01 -0.03 -0.15 -0.21 -0.18 -0.06 ## 3 1882 0.17 0.14 0.05 -0.15 -0.13 -0.22 -0.16 -0.07 -0.14 -0.23 -0.16 -0.36 ## 4 1883 -0.29 -0.36 -0.12 -0.18 -0.17 -0.07 -0.07 -0.13 -0.22 -0.11 -0.24 -0.11 ## 5 1884 -0.12 -0.08 -0.36 -0.4 -0.33 -0.34 -0.3 -0.27 -0.27 -0.24 -0.33 -0.3 ## 6 1885 -0.58 -0.33 -0.26 -0.41 -0.45 -0.43 -0.33 -0.31 -0.28 -0.23 -0.23 -0.1 ``` Each observation consists of a year, a month, and a temperature. --> --- The format with one column per month is compact and useful for viewing and data entry. -- But it is not in _tidy format_ since * the monthly temperature variable is spread over 12 columns; * the month variable is encoded in the column headings. -- For obvious reasons this data format is often referred to as _wide format_. -- The tidy, or _long_, format would have three variables: `Year`, `Month`, and `Temp`. --- One way to put this data frame in tidy, or long, format uses `pivot_longer` from the `tidyr` package: ```r library(tidyr) lgast <- pivot_longer(gast, -Year, ## specifies the columns to use -- all but Year names_to = "Month", values_to = "Temp") ``` -- The first few rows of the result: ```r head(lgast) ## # A tibble: 6 × 3 ## Year Month Temp ## <dbl> <chr> <dbl> ## 1 1880 Jan -0.18 ## 2 1880 Feb -0.24 ## 3 1880 Mar -0.08 ## 4 1880 Apr -0.16 ## 5 1880 May -0.09 ## 6 1880 Jun -0.21 ``` --- During plotting it is likely that the `Month` variable will be converted to a _factor_. -- By default, this will order levels alphabetically, which is not what we want: ```r levels(as.factor(lgast$Month)) ## [1] "Apr" "Aug" "Dec" "Feb" "Jan" "Jul" "Jun" "Mar" "May" "Nov" "Oct" "Sep" ``` -- We can guard against this by converting `Month` to a factor with the right levels now: ```r lgast <- mutate(lgast, Month = factor(Month, levels = month.abb)) levels(lgast$Month) ## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec" ``` -- We can now use this tidy version of the data to create a static version of the [Bloomberg hottest year visualization](https://www.bloomberg.com/graphics/hottest-year-on-record/). --- The basic framework is set up with .pull-left.small-code[ ```r library(ggplot2) p <- ggplot(lgast) + ggtitle("Global Average Surface Temperatures") + theme(plot.title = element_text(hjust = 0.5)) ``` .alert[ `ggplot` objects only produce output when they are printed. To see the plot in `p` we need to print it, for example by using a line line with only `p`. ] ] -- .pull-right[ ```r p ``` <img src="datafrm_files/figure-html/gast-base-1.png" style="display: block; margin: auto;" /> ] --- Then add a layer with lines for each year (specified by the `group` argument to `geom_line`). .pull-left.small-code[ ```r p + geom_line(aes(x = Month, y = Temp, * group = Year)) ``` ] .pull-right[ <img src="datafrm_files/figure-html/gast-lines-1.png" style="display: block; margin: auto;" /> ] --- We can use color to distingish the years. .pull-left.small-code[ ```r p1 <- p + geom_line(aes(x = Month, y = Temp, * color = Year, group = Year), na.rm = TRUE) p1 ``` Saving the plot specification in the variable `p1` will make it easier to experiment with color variations: ] .pull-right[ <img src="datafrm_files/figure-html/gast-color1-1.png" style="display: block; margin: auto;" /> ] --- One way to highlight the `past_year` 2022: .pull-left.small-code[ ```r lgast_last <- filter(lgast, Year == past_year) p1 + geom_line(aes(x = Month, y = Temp, group = Year), linewidth = 1, color = "red", data = lgast_last, na.rm = TRUE) ``` ] .pull-right[ <img src="datafrm_files/figure-html/gast-color2-1.png" style="display: block; margin: auto;" /> ] --- A useful way to show more recent data in the context of the full data is to show the full data in grey and the more recent years in black: .pull-left.small-code[ ```r lgast2k <- filter(lgast, Year >= 2000) ggplot(lgast, aes(x = Month, y = Temp, group = Year)) + geom_line(color = "grey80") + theme_minimal() + geom_line(data = lgast2k) ``` ] .pull-right[ <img src="datafrm_files/figure-html/gast-full-grey-1.png" style="display: block; margin: auto;" /> ] -- If you want to update your plot later in the year then the current year's entry may contain missing value indicators that you will have to deal with. <!-- The version of the data [on the web](https://data.giss.nasa.gov/gistemp/tabledata_v3/GLB.Ts+dSST.txt) may have been updated to include the missing values for 2022. If you want to update your plot later in the year you will see similar missing value markers for the remaining months of 2023. --> --- The New York Times on January 18, 2018, published [another visualization](https://www.nytimes.com/interactive/2018/01/18/climate/hottest-year-2017.html) of these data showing average yearly temperatures ([via Google may work better](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwj_4PmhnM31AhXFjYkEHfQxATEQFnoECAYQAQ&url=https%3A%2F%2Fwww.nytimes.com%2Finteractive%2F2018%2F01%2F18%2Fclimate%2Fhottest-year-2017.html&usg=AOvVaw3qouxHQAJCqrt_F3fbLEGE)). -- To recreate this plot we first need to compute yearly average temperatures. -- This is easy to do with the `summarize` and `group_by` functions from `dpyr`: .pull-left.small-code[ ```r library(dplyr) atemp <- lgast %>% group_by(Year) %>% summarize(AveTemp = mean(Temp, na.rm = TRUE)) head(atemp) ## # A tibble: 6 × 2 ## Year AveTemp ## <dbl> <dbl> ## 1 1880 -0.166 ## 2 1881 -0.0792 ## 3 1882 -0.105 ## 4 1883 -0.172 ## 5 1884 -0.278 ## 6 1885 -0.328 ``` ] -- .pull-right[ Using `na.rm = TRUE` ensures that the mean is based on the available months if data for some months is missing. ] --- A simple version of the plot is then produced by .pull-left.small-code[ ```r ggplot(atemp) + geom_point(aes(x = Year, y = AveTemp)) ``` ] .pull-right[ <img src="datafrm_files/figure-html/gast-nyt-1.png" style="display: block; margin: auto;" /> ] --- Another variation on the Bloomberg plot showing just a few years 20 years apart: .pull-left.small-code[ ```r lg_by_20 <- filter(lgast, Year %in% seq(2020, by = -20, len = 5)) %>% mutate(Year = factor(Year)) ggplot(lg_by_20, aes(x = Month, y = Temp, group = Year, color = Year)) + geom_line() ``` Converting `Year` to a factor results in a discrete color scale and legend. ] .pull-right[ <img src="datafrm_files/figure-html/gast-skip-1.png" style="display: block; margin: auto;" /> ] --- layout: true ## Handling Missing Values --- The data for 2019 available in early 2020 is also available [locally](https://stat.uiowa.edu/~luke/data/GLB.Ts+dSST-2019.csv). -- Assuming this locally available file has been downloaded, we can read in the data and drop some columns we don't need with ```r gast2019 <- read_csv("GLB.Ts+dSST-2019.csv", skip = 1)[1 : 13] ``` -- The last three columns are read as character variables: .small-code[ ```r head(gast2019, 5) ## # A tibble: 5 × 13 ## Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> ## 1 1880 -0.17 -0.23 -0.08 -0.15 -0.08 -0.2 -0.17 -0.09 -0.13 -.22 -.20 -.16 ## 2 1881 -0.18 -0.13 0.04 0.06 0.08 -0.17 0.02 -0.02 -0.14 -.20 -.17 -.05 ## 3 1882 0.18 0.15 0.06 -0.15 -0.13 -0.21 -0.15 -0.06 -0.13 -.23 -.15 -.34 ## 4 1883 -0.28 -0.36 -0.11 -0.17 -0.16 -0.07 -0.05 -0.12 -0.2 -.10 -.22 -.10 ## 5 1884 -0.12 -0.07 -0.36 -0.39 -0.33 -0.34 -0.32 -0.27 -0.26 -.24 -.32 -.30 ``` ] --- The reason is that data for October through December were not available: .small-code[ ```r tail(gast2019, 2) ## # A tibble: 2 × 13 ## Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> ## 1 2018 0.82 0.85 0.9 0.89 0.83 0.78 0.83 0.76 0.81 1.02 .83 .92 ## 2 2019 0.94 0.96 1.18 1.02 0.86 0.93 0.95 0.94 0.93 *** *** *** ``` ] -- We want to convert these columns to numeric, with missing values represented as `NA`. -- One option is to handle them individually: .small-code[ ```r gast2019$Oct <- as.numeric(gast2019$Oct) ## Warning: NAs introduced by coercion tail(gast2019, 2) ## # A tibble: 2 × 13 ## Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> ## 1 2018 0.82 0.85 0.9 0.89 0.83 0.78 0.83 0.76 0.81 1.02 .83 .92 ## 2 2019 0.94 0.96 1.18 1.02 0.86 0.93 0.95 0.94 0.93 NA *** *** ``` ] --- Another option is to convert all character columns to numeric with .small-code[ ```r gast2019 <- mutate(gast2019, across(where(is.character), as.numeric)) ## Warning: There were 2 warnings in `mutate()`. ## The first warning was: ## ℹ In argument: `across(where(is.character), as.numeric)`. ## Caused by warning: ## ! NAs introduced by coercion ## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning. tail(gast2019, 2) ## # A tibble: 2 × 13 ## Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2018 0.82 0.85 0.9 0.89 0.83 0.78 0.83 0.76 0.81 1.02 0.83 0.92 ## 2 2019 0.94 0.96 1.18 1.02 0.86 0.93 0.95 0.94 0.93 NA NA NA ``` ] -- The warnings are benign and can be suppressed with the `warning = FALSE` chunk option. --- Since we know the missing value pattern `***` we can also avoid the need to fix the data after the fact by specifying this at read time: ```r gast2019 <- read_csv("GLB.Ts+dSST-2019.csv", na = "***", skip = 1)[1 : 13] tail(gast2019, 2) ## # A tibble: 2 × 13 ## Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2018 0.82 0.85 0.9 0.89 0.83 0.78 0.83 0.76 0.81 1.02 0.83 0.92 ## 2 2019 0.94 0.96 1.18 1.02 0.86 0.93 0.95 0.94 0.93 NA NA NA ``` --- A plot highlighting the year 2019 shows only the months with available data: .pull-left.small-code[ ```r lgast2019 <- gast2019 %>% pivot_longer(-Year, names_to = "Month", values_to = "Temp") %>% mutate(Month = factor(Month, levels = month.abb)) ggplot(lgast2019, aes(x = Month, y = Temp, group = Year)) + geom_line(color = "grey80", na.rm = TRUE) + geom_line(data = filter(lgast2019, Year == 2019), na.rm = TRUE) ``` Adding `na.rm = TRUE` in the `geom_line` calls suppresses warnings; the plot would be the same without these. ] .pull-right[ <img src="datafrm_files/figure-html/gast-2019-1.png" style="display: block; margin: auto;" /> ] --- layout: false .alert[ Outline of the tools used: * Data processing: * Reading: `read_csv`, `read.csv`; * Reshaping: `pivot_longer`; * Cleaning: `is.character`, `as.numeric`, `mutate`, `across`, `factor`; * Summarizing: `group_by`, `summarize`. * Visualization geometries: * `geom_line` for a line plot; * `geom_point` for a scatter plot. ] --- layout: false ## Reading Stevens' classification of scales of measurement is described in a [Wikipedia article](https://en.wikipedia.org/wiki/Level_of_measurement). A good introduction to the concept of _tidy data_ is provided in a [chapter in _R for Data Science_](https://r4ds.had.co.nz/tidy-data.html). ## Interactive Tutorial An interactive [`learnr`](https://rstudio.github.io/learnr/) tutorial for these notes is [available](../tutorials/datafrm.Rmd). You can run the tutorial with ```r STAT4580::runTutorial("datafrm") ``` --- layout: true ## Exercises --- 1) Which of the Stevens classifications (nominal, ordinal, interval, ratio) best characterizes these variables: * a. Daily maximal temperatures in Iowa City. * b. Population counts for Iowa counties. * c. Education level of job applicants using the [Bureau of Labor Statistics classification](https://www.bls.gov/careeroutlook/2014/article/education-level-and-jobs.htm). * d. Major of UI students. <!-- Which answers are correct for exercise 1: * a: interval; b: ratio; c: ordinal; d: nominal a: nominal; b: interval; c: ordinal; d: ratio a: nominal; b: interval; c: ordinal; d: ratio a: interval; b: ratio; c: nominal; d: ordinal fmt <- function(x) paste(letters[seq_along(x)], x, sep = ": ", collapse = "; ") stev <- c("nominal", "ordinal", "interval", "ratio") fmt(sample(stev, 4)) --> 2) Which of these data sets are in tidy form? * a. The builtin data set `co2` * b. The builtin data set `BOD` * c. The `who` data set in package `tidyr` (`tidyr::who`) * d. The `mpg` data set in package `ggplot2` (`ggplot2::mpg`) <!-- Which answers are correct for exercise 2: * a: not tidy; b: tidy; c: not tidy; d: tidy * a: tidy; b: tidy; c: not tidy; d: tidy * a: not tidy; b: not tidy; c: not tidy; d: tidy * a: not tidy; b: tidy; c: tidy; d: not tidy --> --- The next exercises use the data in the variable `gapminder` in the package `gapminder`. You can make it available with ```r data(gapminder, package = "gapminder") ``` 3) Use the function `str` to examine the value of the gapminder variable. How many cases are there in the data set? How many of the variables are factors? 4) Use the functions `class` and `names` to find the class and variable names in the `gapminder` data. 5) Use `summary` to compute summary information for the variables. 6) Fill in the values for `---` needed to produce plots of life expectancy against year for the countries in continent Oceania. <!-- ## nolint start --> ```r library(dplyr) library(ggplot2) data(gapminder, package = "gapminder") ggplot(filter(gapminder, continent == "Oceania"), aes(x = ---, y = ---, color = country)) + geom_line() ``` <!-- ## nolint end -->
//adapted from Emi Tanaka's gist at //https://gist.github.com/emitanaka/eaa258bb8471c041797ff377704c8505