# String Parsing and Regular Expressions

Luke Tierney

University of Iowa

2024-04-14

## Data Cleaning -- * `gsub` replaces all matches. --- `parse_number` can again be used: -- ```r parse_number(s2) ## [1] 800 2100 3123500 ``` -- `parse_number` is convenient but may be less robust: -- * In Switzerland the grouping character is `'`. -- * If `parse_number` has its defaults set to Swiss conventions: -- ```r parse_number(s2, locale = locale(grouping_mark = "'")) ## [1] 800 2 3 ``` --- layout: false ## Separating City and State Data often has city and state specified in a variable like ```r s <- c("Boston, MA", "Iowa City, IA") ``` -- If all state specifications are in two-letter form then city and state can be extracted as sub-strings: -- ```r substr(s, 1, nchar(s) - 4) ## [1] "Boston" "Iowa City" substr(s, nchar(s) - 1, nchar(s)) ## [1] "MA" "IA" ``` -- This would not work if full state names are used. -- An alternative is to use a _regular expression_. --- ## Regular Expressions Regular expressions are a language for expressing patterns in strings. -- Regular expressions should be developed carefully, like any program, starting with simple steps and building up. -- The simplest regular expressions are literal strings, like `%`. -- More complex expressions are built up using _meta-characters_ that have special meanings in regular expressions. -- Many punctuation characters are regular expression meta-characters. -- Paul Murrell's [_Introduction to Data Technologies_]( provides a good introduction in Section 9.9.2 and an extensive reference in Chapter 11. -- The [Strings chapter]( in [R for Data Science]( also provides an introduction to regular expressions, but uses its own set of functions from the `tidyverse`. -- The web site []( is a useful on-line resource. --- ## Regular Expression Meta-Charcters Some important meta-characters are `.`, `*`, `+`, and `?`: -- * The period `.` stands for any character. -- * The asterisk `*` means zero, one, or more of the preceding character specification. -- * The plus sign `+` means one, or more of the preceding character specification. -- * The question mark `?` means zero or one of the preceding character specification. -- The pattern `",.*"` matches a comma `,` followed by zero or more characters: -- ```r s ## [1] "Boston, MA" "Iowa City, IA" sub(",.*", "", s) ## [1] "Boston" "Iowa City" ``` -- The pattern `".*, "` matches zero or more characters followed by a comma and a space: -- ```r sub(".*, ", "", s) ## [1] "MA" "IA" ``` --- ## Trimming White Space If the data file is not consistent on the use of spaces in the separator another possibility is to -- * remove the characters through the comma; -- * trim the white space from the result. -- ```r sub(".*,", "", s) ## [1] " MA" " IA" ``` -- ```r trimws(sub(".*,", "", s)) ## [1] "MA" "IA" ``` --- ## Using `separate` If the city-state variable is already in a data frame or tibble, then the `separate` function from the `tidyr` package can be used: -- ```r library(tibble) library(tidyr) d <- tibble(citystate = s) d ## # A tibble: 2 × 1 ## citystate ## <chr> ## 1 Boston, MA ## 2 Iowa City, IA ``` -- ```r separate(d, citystate, c("city", "state"), sep = ", ") ## # A tibble: 2 × 2 ## city state ## <chr> <chr> ## 1 Boston MA ## 2 Iowa City IA ``` --- layout: true ## Escaping Meta-Characters --- Reading data from [city temperatures]( produces a variable that looks like ```r s <- c("London *", "Sydney") ``` -- The `*` indicates daylight saving or summer time. -- We would like to -- * extract the city name; -- * extract whether there is a `*`. -- The `*` is a meta-character. -- To include a literal meta-character in a pattern the meta-characters needs to be _escaped_. --- A meta-character is escaped by preceding it by a backslash `\`. -- But the backslash is a meta-character for R strings! -- To put a backslash into an R string it needs to be written as `\\`. -- The pattern we want to match, a space followed by a `*`, is `␣\*`, with `␣` denoting a space character. -- An R string containing these three characters is written as `" \\*"`. --- It is often useful to write a pattern once and save it in a variable.\: ```r (pat <- " \\*") ## [1] " \\*" ``` -- This string contains three characters: ```r nchar(pat) ## [1] 3 ``` -- Standard printing includes the backslash escape, and other escape characters, so the printed string can be read back into R: ```r "a, b and c" ## [1] "a, b\n and c" ``` --- The `writeLines` function is useful to see the characters in a string. ```r writeLines(pat) ## \* ``` -- To help make the space more visible we can add a delimiter: ```r writeLines(paste0("'", pat, "'")) ## ' \*' ``` -- Another option is to use `sprintf` with `writeLines`: ```r writeLines(sprintf("'%s'", pat)) ## ' \*' ``` --- This pattern removes the space and asterisk if present: ```r s ## [1] "London *" "Sydney" sub(pat, "", s) ## [1] "London" "Sydney" ``` -- The `grep` and `grepl` functions check whether a pattern matches in elements of a character vector. -- * `grep` is short for Get REgular exPression. -- * `grep` is a standard Linux command-line utility for searching text files. -- * In R, `grep` returns the indices of the elements that match the pattern. -- * `grepl` returns a logical vector indicating whether there is a match: -- ```r grep(pat, s) ## [1] 1 grepl(pat, s) ## [1] TRUE FALSE ``` --- layout: true ## Matching Numbers --- Reading temperature data might produce a string like ```r s <- c("32F", "-11F") ``` -- This can be processed as ```r substr(s, 1, nchar(s) - 1) ## [1] "32" "-11" ``` -- or as ```r sub("F", "", s) ## [1] "32" "-11" ``` --- An alternative uses some more regular expression features: -- * match the number within the string; -- * extract a sub-match. -- A pattern to match an integer, possibly preceded by a sign is ```r intpat <- "[-+]?[[:digit:]]+" s ## [1] "32F" "-11F" sub(intpat, "X", s) ## [1] "XF" "XF" ``` -- The `[` and `]` meta-characters define character sets; any character between these will match. -- `[:digit:]` specifies a _character class_ of digits. --- There are a number of _character classes_, including -- * `[:alpha:]` alphabetic letters; -- * `[:digit:]` digits; -- * `[:space:]` white space (spaces, tabs). -- * `[:lower:]` lower case letters -- * `[:upper:]` upper case letters -- * `[:alnum:]` characters from `[:alpha:]` and `[:digit:]` -- * `[:punct:]` punctuation characters --- A _sub_pattern_ can be extracted using _back references_: ```r sub("([-+]?[[:digit:]]+).*", "\\1", s) ## [1] "32" "-11" ``` -- * Sub-patterns can be specified with `(` and `)`. -- * _Back references_ can be used to refer to previous sub-patterns by number. -- * The digit needs to be escaped with a `\`; -- * In an R string the `\` needs to be escaped with a second `\`. --- A sub-string approach for a temperature embedded in a string: -- ```r s <- c("Temp: 32F", "Temp: -11F") (s1 <- substr(s, 6, nchar(s))) ## [1] " 32F" " -11F" (s2 <- substr(s1, 1, nchar(s1) - 1)) ## [1] " 32" " -11" as.numeric(s2) ## [1] 32 -11 ``` -- Using regular expressions, sub-patterns, and back references: ```r sub(".*[[:space:]]+([-+]?[[:digit:]]+).*", "\\1", s) ## [1] "32" "-11" ``` -- `parse_number` is again an alternative: ```r parse_number(s) ## [1] 32 -11 ``` --- layout: true ## City Temperatures --- The [city temperatures]( data used previously can be read using ```r library(rvest) library(dplyr) weather <- read_html("") w <- html_table(html_nodes(weather, "table"))[[1]] w1 <- w[c(1, 4)]; names(w1) <- c("city", "temp") w2 <- w[c(5, 8)]; names(w2) <- c("city", "temp") w3 <- w[c(9, 12)]; names(w3) <- c("city", "temp") ww <- rbind(w1, w2, w3) ww <- filter(ww, city != "") head(ww) ## # A tibble: 6 × 2 ## city temp ## <chr> <chr> ## 1 Accra 84 °F ## 2 Addis Ababa 63 °F ## 3 Adelaide 56 °F ## 4 Algiers 64 °F ## 5 Almaty 46 °F ## 6 Amman 57 °F ``` --- Cleaning up and extracting `dst`: ```r www <- mutate(ww, dst = grepl(" \\*", city), city = sub(" \\*", "", city), temp.txt = temp, ## for checking on conversion failures` temp = as.numeric(sub("([-+]?[[:digit:]]+).*", "\\1", temp))) ``` -- Check on `NA` values from conversion: ```r filter(www, ## # A tibble: 0 × 4 ## # ℹ 4 variables: city <chr>, temp <dbl>, dst <lgl>, temp.txt <chr> www <- select(www, -temp.txt) ``` --- Five highest and lowest temperatures: .pull-left[ ```r slice_max(www, temp, n = 5) ## # A tibble: 6 × 3 ## city temp dst ## <chr> <dbl> <lgl> ## 1 San Salvador 95 FALSE ## 2 Managua 93 FALSE ## 3 Bangkok 86 FALSE ## 4 Kiritimati 86 FALSE ## 5 Mexico City 86 FALSE ## 6 Santo Domingo 86 FALSE ``` ] .pull-right[ ```r slice_min(www, temp, n = 5) ## # A tibble: 6 × 3 ## city temp dst ## <chr> <dbl> <lgl> ## 1 Anadyr 23 FALSE ## 2 Reykjavik 36 FALSE ## 3 Helsinki 37 TRUE ## 4 Anchorage 38 TRUE ## 5 Stockholm 39 TRUE ## 6 Tallinn 39 TRUE ``` ] --- Temperatures for northern and southern hemisphere (approximately): .pull-left[ ```r ggplot(www, aes(x = temp, fill = dst)) + geom_density(alpha = 0.5) ``` ] .pull-right[ <img src="regex_files/figure-html/temp-densities-1.png" style="display: block; margin: auto;" /> ] --- layout: true ## Tricky Characters --- Some examples: ```r (s <- head(ww$temp)) ## [1] "84 °F" "63 °F" "56 °F" "64 °F" "46 °F" "57 °F" nchar(s) ## [1] 5 5 5 5 5 5 substr(s, 1, nchar(s) - 3) ## [1] "84" "63" "56" "64" "46" "57" substr(s, 1, nchar(s) - 2) ## [1] "84 " "63 " "56 " "64 " "46 " "57 " as.numeric(substr(s, 1, nchar(s) - 2)) ## Warning: NAs introduced by coercion ## [1] NA NA NA NA NA NA as.numeric("82 ") ## [1] 82 sub(" .*", "", s) ## [1] "84 °F" "63 °F" "56 °F" "64 °F" "46 °F" "57 °F" ``` -- The problem is _two_ non-ascii characters. --- The `stri_escape_unicode` function from the `stringi` package can make these characters more visible: ```r stringi::stri_escape_unicode(s) ## [1] "84\\u00a0\\u00b0F" "63\\u00a0\\u00b0F" "56\\u00a0\\u00b0F" ## [4] "64\\u00a0\\u00b0F" "46\\u00a0\\u00b0F" "57\\u00a0\\u00b0F" ``` -- The troublesome characters are: -- * [No-break space U00A0]( -- * [Degree symbol U00B0]( -- Using the unicode specification for the no-break space does work: ```r sub("\u00a0.*", "", s) ## [1] "84" "63" "56" "64" "46" "57" ``` --- layout: true ## Variations in Regular Expression Engines --- <!-- Note on unicode digits: --> Many tools and languages support working with regular expressions. -- R supports: -- * The POSIX standard for _extended regular expressions_. This is the default engine. -- * Perl-compatible regular expressions ([PCRE]( This engine is selected by adding `perl = TRUE` in calls to functions using regular expressions. -- Different engines can differ in how certain regular expressions are interpreted, especially when non-ASCII characters are involved. -- Different engines also sometimes offer shorthand notations, in particular for character classes. --- Some examples: -- | POSIX class | similar to | shorthand | | | ----------- | ----------- | --------- | ------------------------------ | |`[:digit:]` | `[0-9]` | `\d` | digits | |`[:upper:]` | `[A-Z]` | `\u` | upper case letters | |`[:lower:]` | `[a-z]` | `\l` | lower-case letters | |`[:alpha:]` | `[A-Za-z]`| | upper- and lower-case letters | |`[:space:]` | `[ \t\n]` | `\s` | whitespace characters | -- The shorthand versions need to have their `\` escaped when used in an R string: -- ```r intpat <- ".*\\s([-+]?\\d+).*" sub(intpat, "\\1", c("Temp: 32F", "Temp: -11F")) ## [1] "32" "-11" ``` --- layout: false ## Raw Strings _Raw strings_ can make writing regular expressions a little easier. -- In R a raw string is specified as `r"(...)"`. -- The `...` characters can be any characters and are taken literally, without special interpretation that might require escaping. -- For the integer pattern: -- ```r intpat <- ".*\\s([-+]?\\d+).*" intpat_raw <- r"(.*\s([-+]?\d+).*)" intpat == intpat_raw ## [1] TRUE ``` -- Python, C++, and other languages provide similar facilites. -- R's raw string syntax it modeled after the one used in C++. --- layout: true ## A Note on Sorting --- Non-ASCII characters can create issues for sorting strings, but even ASCII character sort order is not the same in all [locales]( -- The `LETTERS` data set contains the upper-case letters in alphabetical order for the English sorting convention and most other locales: ```r LETTERS ## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" ## [20] "T" "U" "V" "W" "X" "Y" "Z" ``` -- But in Estonian, Latvian, and Lithuanian: ```r stringr::str_sort(LETTERS, locale = "est") ## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" ## [20] "Z" "T" "U" "V" "W" "X" "Y" ``` --- Ordering of lower case and upper case letters in English and most other locales: ```r stringr::str_sort(c("A", "a"), locale = "eng") ## [1] "a" "A" ``` -- But in Danish, and also Maltese: ```r stringr::str_sort(c("A", "a"), locale = "dan") ## [1] "A" "a" ``` --- layout: true ## Encoding Issues --- Files contain a sequence of 8-bit integers, or _bytes_. -- These are the integers from 0 through 255. -- For text files, these bytes are interpreted as representing characters. -- The mapping from bytes to characters is called an _encoding_. -- The encoding for the characters used in American English is ASCII: the _American Standard Code for Information Interchange_. -- The [ASCII encoding]( uses only the integers 0 through 127. -- The ASCII encoding is adequate for American uses; even UK text files need more: the pound sign £. -- Encodings that use integers 128 through 255: -- * Latin1 (aka ISO-8859-1) for western European languages (includes £); -- * Latin2 (aka ISO-8859-2) for eastern European languages. -- Many encodings are available to support other alphabets and character sets. --- Fortunately most systems now use [Unicode]( with the [UTF-8 encoding]( for representing non-ASCII characters. -- If you need to read a file with non-ASCII characters using `read.csv()` or similar base functions a good place to start is to specify `encoding = "UTF-8"`. -- The functions in the `readr` package default to assuming the encoding is UTF-8. -- Getting the encoding wrong can result in a few messed up characters or in an entire string being messed up: -- .pull-left.small-code[ Reading files without specifying the correct encoding might produce strings like ```r x1 ## [1] "El Ni\xf1o was particularly bad this year" x2 ## [1] "\x82\xb1\x82\xf1\x82ɂ\xbf\x82\xcd" ``` ] -- .pull-left.small-code[ If the correct encoding is known, then these can be fixed after the fact with `iconv()`: ```r iconv(x1, "Latin1", "UTF-8") ## [1] "El Niño was particularly bad this year" iconv(x2, "Shift-JIS", "UTF-8") ## [1] "こんにちは" ``` ] --- Re-reading the file with the proper encoding specified may be a better option. -- The `readr` function `guess_encoding()` may help identify the correct encoding if it is not specified in the data documentation. -- Handling encoding issues in R used to be more complicated on Windows, but this has improved with recent releases of R. --- layout: false ## Fonts .pull-left[ Once you have the right encoding for character data, you also need a set of fonts so your data shows up properly in the console, the editor, and graphs. {{content}} ] -- Fonts may already be installed. ```r c("\U{1f600}", "\u3041", "\u4a00") ## [1] "😀" "ぁ" "䨀" ``` {{content}} -- If glyphs are not availabl, the characters may be shown in various ways. ```r c("\U{105b8}", "\u3040") ## [1] "𐖸" "\u3040" ``` {{content}} -- How to install fonts will depend on your platform. -- .pull-right[ .hide-code[ A plot with emojis: ```r data.frame(x = 1:10, y = runif(10), z = sapply(0x1f600 + 0:9, intToUtf8)) |> ggplot(aes(x, y, label = z)) + geom_text() ``` <img src="regex_files/figure-html/unnamed-chunk-56-1.png" style="display: block; margin: auto;" /> ] ] --- ## A Note on Line Endings On Linux and current macOS lines of text in files are terminated by a _line feed_ (LF) character `\n` or `\x0a`. -- On Windows, and in HTML, lines are terminated by a _carriage return_ (CR, `\r`, `x0d`) followed by a linefeed (CRLF). -- On Linux and macOS there is no difference in reading or writing to a file in _text mode_ or _binary mode_. -- On Window, reading in text mode translates between CRLF line endings to LF, and writing translates LF to CRLF. -- Many file transfer operations will do this conversion. -- Git will check out text files with the appropriate line endings for the platform. -- Most higher-level tools on Linux will work with CRLF line endings, and on Windows will work for LF line endings. -- Occasionally you may see some issues, and need to fix the line endings yourself. --- ## Getting the Current Temperature The tools described here come in handy when scraping data from the web. -- This code gets the [current temperature in Iowa City](,IA) from the National Weather Service: ```r library(xml2) url <- ",IA" page <- read_html(url) xpath <- "//p[@class=\"myforecast-current-lrg\"]" tempNode <- xml_find_first(page, xpath) nodeText <- xml_text(tempNode) as.numeric(sub("([-+]?[[:digit:]]+).*", "\\1", nodeText)) ``` -- An example of creating a current temperature map is described [here](../weather.html). -- ## Reading Chapters [_Data Import_]( and [_Strings_]( in [_R for Data Science_]( Chapter [_String processing_]( in [_Introduction to Data Science: Data Analysis and Prediction Algorithms with R_](
