class: center, middle, title-slide .title[ # String Parsing and Regular Expressions ] .author[ ### Luke Tierney ] .institute[ ### University of Iowa ] .date[ ### 2023-05-06 ] --- layout: true <link rel="stylesheet" href="stat4580.css" type="text/css" /> <style type="text/css"> .remark-code { font-size: 85%; } </style> ## Data Cleaning --- After reading data from text files or web pages it is common to have to -- * clean up string variables; -- * extract numbers. -- This can be done using -- * high-level functions that usually work but not always; -- * using string subsetting; -- * using _regular expressions_. --- Some examples of strings that need to be processed: -- ```r "12%" "New York *" "2,100" "Temp: 12 \u00b0F" ``` -- Some of the most common cases are covered here. -- Much more is available in [_R for Data Science_](https://r4ds.had.co.nz/), in particular in the chapters * [_Data Import_](https://r4ds.had.co.nz/data-import.html); * [_Strings_](https://r4ds.had.co.nz/strings.html). --- layout: true ## Removing a Percent Sign --- Reading the GDP growth rate data from a [web page](https://www.multpl.com/us-real-gdp-growth-rate/table/by-quarter) produced a data frame with a column like ```r s <- c("12%", "2%") ``` This can be converted to a numeric variable by -- * extracting the sub-string without the `%` -- * and then using `as.numeric`. -- ```r nchar(s) ## [1] 3 2 substr(s, 1, nchar(s) - 1) ## [1] "12" "2" as.numeric(substr(s, 1, nchar(s) - 1)) ## [1] 12 2 ``` --- An alternative is to use `sub()` function to replace `"%"` by the empty string `""`. -- ```r as.numeric(sub("%", "", s)) ## [1] 12 2 ``` -- The function `parse_number` in the `readr` package ignores the percent sign and extracts the numbers correctly: -- ```r library(readr) parse_number(s) ## [1] 12 2 ``` --- layout: true ## Removing Grouping Characters --- Numbers are sometimes written using _grouping characters_: -- ```r s1 <- c("800", "2,100") s2 <- c("800", "2,100", "3,123,500") ``` -- * The comma is often used as a grouping character in the US. -- * Other countries use different characters. -- * Other countries also use different characters for the decimal separator. --- `sub` and `gsub` can be used to remove grouping characters: -- ```r sub(",", "", s1) ## [1] "800" "2100" sub(",", "", s2) ## [1] "800" "2100" "3123,500" ``` -- ```r gsub(",", "", s2) ## [1] "800" "2100" "3123500" as.numeric(gsub(",", "", s2)) ## [1] 800 2100 3123500 ``` -- * `sub` replaces the first match to a pattern; -- * `gsub` replaces all matches. --- `parse_number` can again be used: -- ```r parse_number(s2) ## [1] 800 2100 3123500 ``` -- `parse_number` is convenient but may be less robust: -- * In Switzerland the grouping character is `'`. -- * If `parse_number` has its defaults set to Swiss conventions: -- ```r parse_number(s2, locale = locale(grouping_mark = "'")) ## [1] 800 2 3 ``` --- layout: false ## Separating City and State Data often has city and state specified in a variable like ```r s <- c("Boston, MA", "Iowa City, IA") ``` -- If all state specifications are in two-letter form then city and state can be extracted as sub-strings: -- ```r substr(s, 1, nchar(s) - 4) ## [1] "Boston" "Iowa City" substr(s, nchar(s) - 1, nchar(s)) ## [1] "MA" "IA" ``` -- This would not work if full state names are used. -- An alternative is to use a _regular expression_. --- ## Regular Expressions Regular expressions are a language for expressing patterns in strings. -- Regular expressions should be developed carefully, like any program, starting with simple steps and building up. -- The simplest regular expressions are literal strings, like `%`. -- More complex expressions are built up using _meta-characters_ that have special meanings in regular expressions. -- Many punctuation characters are regular expression meta-characters. -- Paul Murrell's [_Introduction to Data Technologies_](http://www.stat.auckland.ac.nz/~paul/ItDT/) provides a good introduction in Section 9.9.2 and an extensive reference in Chapter 11. -- The [Strings chapter](https://r4ds.had.co.nz/strings.html) in [R for Data Science](https://r4ds.had.co.nz/) also provides an introduction to regular expressions, but uses its own set of functions from the `tidyverse`. -- The web site [Regular-Expressions.info](https://www.regular-expressions.info/) is a useful on-line resource. --- ## Regular Expression Meta-Charcters Some important meta-characters are `.`, `*`, `+`, and `?`: -- * The period `.` stands for any character. -- * The asterisk `*` means zero, one, or more of the preceding character specification. -- * The plus sign `+` means one, or more of the preceding character specification. -- * The question mark `?` means zero or one of the preceding character specification. -- The pattern `",.*"` matches a comma `,` followed by zero or more characters: -- ```r sub(",.*", "", s) ## [1] "Boston" "Iowa City" ``` -- The pattern `".*, "` matches zero or more characters followed by a comma and a space: -- ```r sub(".*, ", "", s) ## [1] "MA" "IA" ``` --- ## Trimming White Space If the data file is not consistent on the use of spaces in the separator another possibility is to -- * remove the characters through the comma; -- * trim the white space from the result. -- ```r sub(".*,", "", s) ## [1] " MA" " IA" ``` -- ```r trimws(sub(".*,", "", s)) ## [1] "MA" "IA" ``` --- ## Using `separate` If the city-state variable is already in a data frame or tibble then the `separate` function from the `tidyr` package can be used: -- ```r library(tibble) library(tidyr) d <- tibble(citystate = s) d ## # A tibble: 2 × 1 ## citystate ## <chr> ## 1 Boston, MA ## 2 Iowa City, IA ``` -- ```r separate(d, citystate, c("city", "state"), sep = ", ") ## # A tibble: 2 × 2 ## city state ## <chr> <chr> ## 1 Boston MA ## 2 Iowa City IA ``` --- layout: true ## Escaping Meta-Characters --- Reading data from [city temperatures](https://www.timeanddate.com/weather/) produces a variable that looks like ```r s <- c("London *", "Sydney") ``` -- The `*` indicates daylight saving or summer time. -- We would like to -- * extract the city name; -- * extract whether there is a `*`. -- The `*` is a meta-character. -- To include a literal meta-character in a pattern the meta-characters needs to be _escaped_. --- A meta-character is escaped by preceding it by a backslash `\`. -- But the backslash is a meta-character for R strings! -- To put a backslash into an R string it needs to be written as `\\`. -- The pattern we want to match a space followed by a `*` is `␣\*`, with `␣` denoting a space character. -- An R string containing these three characters is written as `" \\*"`. --- It is often useful to write a pattern once and save it in a variable.\: ```r (pat <- " \\*") ## [1] " \\*" ``` -- This string contains three characters: ```r nchar(pat) ## [1] 3 ``` -- Standard printing includes the backslash escape, and other escape characters, so the printed string can be read back into R: ```r "a, b and c" ## [1] "a, b\n and c" ``` --- The `writeLines` function is useful to see the characters in a string. ```r writeLines(pat) ## \* ``` -- To help make the space more visible we can add a delimiter: ```r writeLines(paste0("'", pat, "'")) ## ' \*' ``` -- Another option is to use `sprintf` with `writeLines`: ```r writeLines(sprintf("'%s'", pat)) ## ' \*' ``` --- This pattern removes the space and asterisk if present: ```r s ## [1] "London *" "Sydney" sub(pat, "", s) ## [1] "London" "Sydney" ``` -- The `grep` and `grepl` functions check whether a pattern matches in elements of a character vector. -- * `grep` is short for Get REgular exPression. -- * `grep` is a standard Linux command-line utility for searching text files. -- * In R, `grep` returns the indices of the elements that match the pattern. -- * `grepl` returns a logical vector indicating whether there is a match: -- ```r grep(pat, s) ## [1] 1 grepl(pat, s) ## [1] TRUE FALSE ``` --- layout: true ## Matching Numbers --- Reading temperature data might produce a string like ```r s <- c("32F", "-11F") ``` -- This can be processed as ```r substr(s, 1, nchar(s) - 1) ## [1] "32" "-11" ``` -- or as ```r sub("F", "", s) ## [1] "32" "-11" ``` --- An alternative uses some more regular expression features: -- * match the number within the string; -- * extract a sub-match. -- A pattern to match an integer, possibly preceded by a sign is ```r intpat <- "[-+]?[[:digit:]]+" s ## [1] "32F" "-11F" sub(intpat, "X", s) ## [1] "XF" "XF" ``` -- The `[` and `]` meta-characters define character sets; any character between these will match. -- `[:digit:]` specifies a _character class_ of digits. --- There are a number of _character classes_, including -- * `[:alpha:]` alphabetic letters; -- * `[:digit:]` digits; -- * `[:space:]` white space (spaces, tabs). -- A _sub_pattern_ can be extracted using _back references_: ```r sub("([-+]?[[:digit:]]+).*", "\\1", s) ## [1] "32" "-11" ``` -- * Sub-patterns can be specified with `(` and `)`. -- * _Back references_ can be used to refer to previous sub-patterns by number. -- * The digit needs to be escaped with a `\`; -- * In an R string the `\` needs to be escaped with a second `\`. --- A sub-string approach for a temperature embedded in a string: -- ```r s <- c("Temp: 32F", "Temp: -11F") (s1 <- substr(s, 6, nchar(s))) ## [1] " 32F" " -11F" (s2 <- substr(s1, 1, nchar(s1) - 1)) ## [1] " 32" " -11" as.numeric(s2) ## [1] 32 -11 ``` -- Using regular expressions, sub-patterns, and back references: ```r sub(".*[[:space:]]+([-+]?[[:digit:]]+).*", "\\1", s) ## [1] "32" "-11" ``` -- `parse_number` is again an alternative: ```r parse_number(s) ## [1] 32 -11 ``` --- layout: true ## City Temperatures --- The [city temperatures](https://www.timeanddate.com/weather/) data used previously can be read using ```r library(rvest) library(dplyr) weather <- read_html("https://www.timeanddate.com/weather/") w <- html_table(html_nodes(weather, "table"))[[1]] w1 <- w[c(1, 4)]; names(w1) <- c("city", "temp") w2 <- w[c(5, 8)]; names(w2) <- c("city", "temp") w3 <- w[c(9, 12)]; names(w3) <- c("city", "temp") ww <- rbind(w1, w2, w3) ww <- filter(ww, city != "") head(ww) ## # A tibble: 6 × 2 ## city temp ## <chr> <chr> ## 1 Accra 75 °F ## 2 Addis Ababa 73 °F ## 3 Adelaide 43 °F ## 4 Algiers 73 °F ## 5 Almaty 46 °F ## 6 Amman 64 °F ``` --- Cleaning up and extracting `dst`: ```r www <- mutate(ww, dst = grepl(" \\*", city), city = sub(" \\*", "", city), temp.txt = temp, ## for checking on conversion failures` temp = as.numeric(sub("([-+]?[[:digit:]]+).*", "\\1", temp))) ``` -- Check on `NA` values from conversion: ```r filter(www, is.na(temp)) ## # A tibble: 0 × 4 ## # ℹ 4 variables: city <chr>, temp <dbl>, dst <lgl>, temp.txt <chr> www <- select(www, -temp.txt) ``` --- Five highest and lowest temperatures: .pull-left[ ```r slice_max(www, temp, n = 5) ## # A tibble: 8 × 3 ## city temp dst ## <chr> <dbl> <lgl> ## 1 Baghdad 95 FALSE ## 2 Kinshasa 93 FALSE ## 3 Managua 91 FALSE ## 4 Riyadh 91 FALSE ## 5 Bangkok 90 FALSE ## 6 Havana 90 TRUE ## 7 Kuwait City 90 FALSE ## 8 San Juan 90 FALSE ``` ] .pull-right[ ```r slice_min(www, temp, n = 5) ## # A tibble: 7 × 3 ## city temp dst ## <chr> <dbl> <lgl> ## 1 Anadyr 22 FALSE ## 2 St. John's 39 TRUE ## 3 Anchorage 40 TRUE ## 4 Moscow 41 FALSE ## 5 Adelaide 43 FALSE ## 6 Melbourne 43 FALSE ## 7 Tallinn 43 TRUE ``` ] --- Temperatures for northern and southern hemisphere (approximately): .pull-left[ ```r ggplot(www, aes(x = temp, fill = dst)) + geom_density(alpha = 0.5) ``` ] .pull-right[ <img src="regex_files/figure-html/temp-densities-1.png" style="display: block; margin: auto;" /> ] --- layout: true ## Tricky Characters --- Some examples: ```r (s <- head(ww$temp)) ## [1] "75 °F" "73 °F" "43 °F" "73 °F" "46 °F" "64 °F" nchar(s) ## [1] 5 5 5 5 5 5 substr(s, 1, nchar(s) - 3) ## [1] "75" "73" "43" "73" "46" "64" substr(s, 1, nchar(s) - 2) ## [1] "75 " "73 " "43 " "73 " "46 " "64 " as.numeric(substr(s, 1, nchar(s) - 2)) ## Warning: NAs introduced by coercion ## [1] NA NA NA NA NA NA as.numeric("82 ") ## [1] 82 sub(" .*", "", s) ## [1] "75 °F" "73 °F" "43 °F" "73 °F" "46 °F" "64 °F" ``` -- The problem is _two_ non-ascii characters. --- The `stri_escape_unicode` function from the `stringi` can make these characters more visible: ```r stringi::stri_escape_unicode(s) ## [1] "75\\u00a0\\u00b0F" "73\\u00a0\\u00b0F" "43\\u00a0\\u00b0F" ## [4] "73\\u00a0\\u00b0F" "46\\u00a0\\u00b0F" "64\\u00a0\\u00b0F" ``` -- The troublesome characters are: -- * [No-break space U00A0](https://www.fileformat.info/info/unicode/char/00a0/index.htm). -- * [Degree symbol U00B0](https://www.fileformat.info/info/unicode/char/00b0/index.htm). -- Using the unicode specification for the no-break space does work: ```r sub("\u00a0.*", "", s) ## [1] "75" "73" "43" "73" "46" "64" ``` --- layout: true ## Variations in Regular Expression Engines --- <!-- Note on unicode digits: https://unix.stackexchange.com/questions/414226/difference-between-0-9-digit-and-d --> Many tools and languages support working with regular expressions. -- R supports: -- * The POSIX standard for _extended regular expressions_. This is the default engine. -- * Perl-compatible regular expressions ([PCRE](https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions). This engine is selected by adding `perl = TRUE` in calls to functions using regular expressions. -- Different engines can differ in how certain expressions are interpreted, especially when non-ASCII characters are involved. -- Different engines also sometimes offer shorthand notations, in particular for character classes. --- Some examples: -- | POSIX class | similar to | shorthand | | | ----------- | ----------- | --------- | ------------------------------ | |`[:digit:]` | `[0-9]` | `\d` | digits | |`[:upper:]` | `[A-Z]` | `\u` | upper case letters | |`[:lower:]` | `[a-z]` | `\l` | lower-case letters | |`[:alpha:]` | `[A-Za-z]`| | upper- and lower-case letters | |`[:space:]` | `[ \t\n]` | `\s` | whitespace characters | -- The shorthand versions need to have their `\` escaped when used in an R string: -- ```r intpat <- ".*\\s([-+]?\\d+).*" sub(intpat, "\\1", c("Temp: 32F", "Temp: -11F")) ## [1] "32" "-11" ``` --- layout: false ## Raw Strings _Raw strings_ can make writing regular expressions a little easier. -- In R a raw string is specifies as `r"(...)"`. -- The `...` characters can be any characters and are taken literally, without special interpretation that might require escaping. -- For the integer pattern: -- ```r intpat <- ".*\\s([-+]?\\d+).*" intpat_raw <- r"(.*\s([-+]?\d+).*)" intpat == intpat_raw ## [1] TRUE ``` -- Python, C++, and other languages provide similar facilites. -- R's raw string syntax it modeled after the one used in C++. --- layout: true ## A Note on Sorting --- Non-ASCII characters can create issues for sorting strings, but even ASCII character sort order is not the same in all [locales](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). -- The `LETTERS` data set contains the upper-case letters in alphabetical order for the English sorting convention and most other locales: ```r LETTERS ## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" ## [20] "T" "U" "V" "W" "X" "Y" "Z" ``` -- But in Estonian, Latvian, and Lithuanian: ```r stringr::str_sort(LETTERS, locale = "est") ## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" ## [20] "Z" "T" "U" "V" "W" "X" "Y" ``` --- Ordering of lower case and upper case letters in English and most other locales: ```r stringr::str_sort(c("A", "a"), locale = "eng") ## [1] "a" "A" ``` -- But in Danish, and also Maltese: ```r stringr::str_sort(c("A", "a"), locale = "dan") ## [1] "A" "a" ``` --- layout: true ## Encoding Issues --- Files contain a sequence of 8-bit integers, or _bytes_. -- These are the integers from 0 through 255. -- For text files, these bytes are interpreted as representing characters. -- The mapping from bytes to characters is called an _encoding_. -- The encoding for the characters used in American English is ASCII: the _American Standard Code for Information Interchange_. -- The [ASCII encoding](https://ascii-tables.com/) uses only the integers 0 through 127. -- The ASCII encoding is adequate for American uses; even UK text files need more: the pound sign £. -- Encodings that use integers 128 through 255: -- * Latin1 (aka ISO-8859-1) for western European languages (includes £); -- * Latin2 (aka ISO-8859-2) for eastern European languages. -- Many encodings are available to support other alphabets and character sets. --- Fortunately most systems now use [Unicode](https://home.unicode.org/) with the [UTF-8 encoding](https://en.wikipedia.org/wiki/UTF-8) for representing non-ASCII characters. -- If you need to read a file with non-ASCII characters using `read.csv()` or similar base functions a good place to start is to specify `encoding = "UTF-8"`. -- The functions in the `readr` package default to assuming the encoding is UTF-8. -- Getting the encoding wrong can result in a few messed up characters or in an entire string being messed up: -- .pull-left.small-code[ Reading files without specifying the correct encoding might produce a strings like ```r x1 ## [1] "El Ni\xf1o was particularly bad this year" x2 ## [1] "\x82\xb1\x82\xf1\x82ɂ\xbf\x82\xcd" ``` ] -- .pull-left.small-code[ If the correct encoding is known, then these can be fixed after the fact with `iconv()`: ```r iconv(x1, "Latin1", "UTF-8") ## [1] "El Niño was particularly bad this year" iconv(x2, "Shift-JIS", "UTF-8") ## [1] "こんにちは" ``` ] --- Re-reading the file with the proper encoding specified may be a better option. -- The `readr` function `guess_encoding()` may help identify the correct encoding if it is not specified in the data documentation. -- Handling encoding issues in R can be more complicated on Windows, but this will improve with the next release of R. --- layout: false ## Getting the Current Temperature The tools described here come in handy when scraping data from the web. -- This code gets the current temperature in Iowa City from the National Weather Service: ```r library(xml2) url <- "http://forecast.weather.gov/zipcity.php?inputstring=Iowa+City,IA" page <- read_html(url) xpath <- "//p[@class=\"myforecast-current-lrg\"]" tempNode <- xml_find_first(page, xpath) nodeText <- xml_text(tempNode) as.numeric(sub("([-+]?[[:digit:]]+).*", "\\1", nodeText)) ``` -- An example of creating a current temperature map is described [here](/home/luke/writing/classes/uiowa/STAT4580/webpage/weather.html). -- ## Reading Chapters [_Data Import_](https://r4ds.had.co.nz/data-import.html) and [_Strings_](https://r4ds.had.co.nz/strings.html) in [_R for Data Science_](https://r4ds.had.co.nz/). Chapter [_String processing_](https://rafalab.dfci.harvard.edu/dsbook/) in [_Introduction to Data Science Data Analysis and Prediction Algorithms with R_](https://rafalab.dfci.harvard.edu/dsbook/).
//adapted from Emi Tanaka's gist at //https://gist.github.com/emitanaka/eaa258bb8471c041797ff377704c8505