Data Cleaning 
After reading data from text files or web pages it is common to have to
This can be done using
Some examples of strings that need to be processed:
"12%"
"New York *"
"2,100"
"Temp: 12 \u00b0F"Some of the most common cases are covered here.
Much more is available in R for Data Science 
 
Removing a Percent Sign 
Reading the GDP growth rate data from a web page  produced a data frame with a column like
s <- c("12%", "2%")This can be converted to a numeric variable by
nchar(s)
## [1] 3 2
substr(s, 1, nchar(s) - 1)
## [1] "12" "2"
as.numeric(substr(s, 1, nchar(s) - 1))
## [1] 12  2An alternative is to use sub() function to replace "%" by the empty string "".
as.numeric(sub("%", "", s))
## [1] 12  2The function parse_number in the readr package ignores the percent sign and extracts the numbers correctly:
library(readr)
parse_number(s)
## [1] 12  2 
Removing Grouping Characters 
Numbers are sometimes written using grouping characters :
s1 <- c("800", "2,100")
s2 <- c("800", "2,100", "3,123,500")
The comma is often used as a grouping character in the US.
Other countries use different characters.
Other countries also use different characters for the decimal separator.
 
sub and gsub can be used to remove grouping characters:
sub(",", "", s1)
## [1] "800"  "2100"
sub(",", "", s2)
## [1] "800"      "2100"     "3123,500"gsub(",", "", s2)
## [1] "800"     "2100"    "3123500"
as.numeric(gsub(",", "", s2))
## [1]     800    2100 3123500parse_number can again be used:
parse_number(s2)
## [1]     800    2100 3123500parse_number is convenient but may be less robust:
parse_number(s2, locale = locale(grouping_mark = "'"))
## [1] 800   2   3 
Separating City and State 
Data often has city and state specified in a variable like
s <- c("Boston, MA", "Iowa City, IA")If all state specifications are in two-letter form then city and state can be extracted as sub-strings:
substr(s, 1, nchar(s) - 4)
## [1] "Boston"    "Iowa City"
substr(s, nchar(s) - 1, nchar(s))
## [1] "MA" "IA"This would not work if full state names are used.
An alternative is to use a regular expression .
 
Regular Expressions 
Regular expressions are a language for expressing patterns in strings.
Regular expressions should be developed carefully, like any program, starting with simple steps and building up.
The simplest regular expressions are literal strings, like %.
More complex expressions are built up using meta-characters  that have special meanings in regular expressions.
Many punctuation characters are regular expression meta-characters.
Paul Murrell’s Introduction to Data Technologies 
The Strings chapter  in R for Data Science  also provides an introduction to regular expressions, but uses its own set of functions from the tidyverse.
The web site Regular-Expressions.info  is a useful on-line resource.
 
Trimming White Space 
If the data file is not consistent on the use of spaces in the separator another possibility is to
sub(".*,", "", s)
## [1] " MA" " IA"trimws(sub(".*,", "", s))
## [1] "MA" "IA" 
Using separate 
If the city-state variable is already in a data frame or tibble, then the separate function from the tidyr package can be used:
library(tibble)
library(tidyr)
d <- tibble(citystate = s)
d
## # A tibble: 2 × 1
##   citystate    
##   <chr>        
## 1 Boston, MA   
## 2 Iowa City, IAseparate(d, citystate, c("city", "state"), sep = ", ")
## # A tibble: 2 × 2
##   city      state
##   <chr>     <chr>
## 1 Boston    MA   
## 2 Iowa City IA 
Escaping Meta-Characters 
Reading data from city temperatures  produces a variable that looks like
s <- c("London *", "Sydney")The * indicates daylight saving or summer time.
We would like to
The * is a meta-character.
To include a literal meta-character in a pattern the meta-characters needs to be escaped .
A meta-character is escaped by preceding it by a backslash \.
But the backslash is a meta-character for R strings!
To put a backslash into an R string it needs to be written as \\.
The pattern we want to match, a space followed by a *, is ␣\*, with ␣ denoting a space character.
An R string containing these three characters is written as " \\*".
It is often useful to write a pattern once and save it in a variable.:
(pat <- " \\*")
## [1] " \\*"This string contains three characters:
nchar(pat)
## [1] 3Standard printing includes the backslash escape, and other escape characters, so the printed string can be read back into R:
"a, b
 and c"
## [1] "a, b\n and c"The writeLines function is useful to see the characters in a string.
writeLines(pat)
##  \*To help make the space more visible we can add a delimiter:
writeLines(paste0("'", pat, "'"))
## ' \*'Another option is to use sprintf with writeLines:
writeLines(sprintf("'%s'", pat))
## ' \*'This pattern removes the space and asterisk if present:
s
## [1] "London *" "Sydney"
sub(pat, "", s)
## [1] "London" "Sydney"The grep and grepl functions check whether a pattern matches in elements of a character vector.
grep is short for Get REgular exPression.
grep is a standard Linux command-line utility for searching text files.
In R, grep returns the indices of the elements that match the pattern.
grepl returns a logical vector indicating whether there is a match:
 
grep(pat, s)
## [1] 1
grepl(pat, s)
## [1]  TRUE FALSE 
Matching Numbers 
Reading temperature data might produce a string like
s <- c("32F", "-11F")This can be processed as
substr(s, 1, nchar(s) - 1)
## [1] "32"  "-11"or as
sub("F", "", s)
## [1] "32"  "-11"An alternative uses some more regular expression features:
A pattern to match an integer, possibly preceded by a sign is
intpat <- "[-+]?[[:digit:]]+"
s
## [1] "32F"  "-11F"
sub(intpat, "X", s)
## [1] "XF" "XF"The [ and ] meta-characters define character sets; any character between these will match.
[:digit:] specifies a character class  of digits.
There are a number of character classes , including
[:alpha:] alphabetic letters;
[:digit:] digits;
[:space:] white space (spaces, tabs).
[:lower:] lower case letters
[:upper:] upper case letters
[:alnum:] characters from [:alpha:] and [:digit:]
[:punct:] punctuation characters
 
A sub_pattern  can be extracted using back references :
sub("([-+]?[[:digit:]]+).*", "\\1", s)
## [1] "32"  "-11"
Sub-patterns can be specified with ( and ).
Back references  can be used to refer to previous sub-patterns by number.
The digit needs to be escaped with a \;
In an R string the \ needs to be escaped with a second \.
 
A sub-string approach for a temperature embedded in a string:
s <- c("Temp:  32F", "Temp: -11F")
(s1 <- substr(s, 6, nchar(s)))
## [1] "  32F" " -11F"
(s2 <- substr(s1, 1, nchar(s1) - 1))
## [1] "  32" " -11"
as.numeric(s2)
## [1]  32 -11Using regular expressions, sub-patterns, and back references:
sub(".*[[:space:]]+([-+]?[[:digit:]]+).*", "\\1", s)
## [1] "32"  "-11"parse_number is again an alternative:
parse_number(s)
## [1]  32 -11 
City Temperatures 
The city temperatures  data used previously can be read using
library(rvest)
library(dplyr)
weather <- read_html("https://www.timeanddate.com/weather/")
w <- html_table(html_nodes(weather, "table"))[[1]]
w1 <- w[c(1, 4)]; names(w1) <- c("city", "temp")
w2 <- w[c(5, 8)]; names(w2) <- c("city", "temp")
w3 <- w[c(9, 12)]; names(w3) <- c("city", "temp")
ww <- rbind(w1, w2, w3)
ww <- filter(ww, city != "")
head(ww)
## # A tibble: 6 × 2
##   city        temp 
##   <chr>       <chr>
## 1 Accra       84 °F
## 2 Addis Ababa 63 °F
## 3 Adelaide    56 °F
## 4 Algiers     64 °F
## 5 Almaty      46 °F
## 6 Amman       57 °FCleaning up and extracting dst:
www <- mutate(ww,
              dst = grepl(" \\*", city),
              city = sub(" \\*", "", city),
              temp.txt = temp,   ## for checking on conversion failures`
              temp = as.numeric(sub("([-+]?[[:digit:]]+).*", "\\1", temp)))Check on NA values from conversion:
filter(www, is.na(temp))
## # A tibble: 0 × 4
## # ℹ 4 variables: city <chr>, temp <dbl>, dst <lgl>, temp.txt <chr>
www <- select(www, -temp.txt)Five highest and lowest temperatures:
slice_max(www, temp, n = 5)
## # A tibble: 6 × 3
##   city           temp dst  
##   <chr>         <dbl> <lgl>
## 1 San Salvador     95 FALSE
## 2 Managua          93 FALSE
## 3 Bangkok          86 FALSE
## 4 Kiritimati       86 FALSE
## 5 Mexico City      86 FALSE
## 6 Santo Domingo    86 FALSEslice_min(www, temp, n = 5)
## # A tibble: 6 × 3
##   city       temp dst  
##   <chr>     <dbl> <lgl>
## 1 Anadyr       23 FALSE
## 2 Reykjavik    36 FALSE
## 3 Helsinki     37 TRUE 
## 4 Anchorage    38 TRUE 
## 5 Stockholm    39 TRUE 
## 6 Tallinn      39 TRUETemperatures for northern and southern hemisphere (approximately):
ggplot(www, aes(x = temp, fill = dst)) +
    geom_density(alpha = 0.5)
 
Tricky Characters 
Some examples:
(s <- head(ww$temp))
## [1] "84 °F" "63 °F" "56 °F" "64 °F" "46 °F" "57 °F"
nchar(s)
## [1] 5 5 5 5 5 5
substr(s, 1, nchar(s) - 3)
## [1] "84" "63" "56" "64" "46" "57"
substr(s, 1, nchar(s) - 2)
## [1] "84 " "63 " "56 " "64 " "46 " "57 "
as.numeric(substr(s, 1, nchar(s) - 2))
## Warning: NAs introduced by coercion
## [1] NA NA NA NA NA NA
as.numeric("82 ")
## [1] 82
sub(" .*", "", s)
## [1] "84 °F" "63 °F" "56 °F" "64 °F" "46 °F" "57 °F"The problem is two  non-ascii characters.
The stri_escape_unicode function from the stringi package can make these characters more visible:
stringi::stri_escape_unicode(s)
## [1] "84\\u00a0\\u00b0F" "63\\u00a0\\u00b0F" "56\\u00a0\\u00b0F"
## [4] "64\\u00a0\\u00b0F" "46\\u00a0\\u00b0F" "57\\u00a0\\u00b0F"The troublesome characters are:
Using the unicode specification for the no-break space does work:
sub("\u00a0.*", "", s)
## [1] "84" "63" "56" "64" "46" "57" 
Variations in Regular Expression Engines 
Many tools and languages support working with regular expressions.
R supports:
The POSIX standard for extended regular expressions . This is the default engine.
Perl-compatible regular expressions (PCRE ). This engine is selected by adding perl = TRUE in calls to functions using regular expressions.
 
Different engines can differ in how certain regular expressions are interpreted, especially when non-ASCII characters are involved.
Different engines also sometimes offer shorthand notations, in particular for character classes.
Some examples:
 
[:digit:][0-9]\ddigits 
 
[:upper:][A-Z]\uupper case letters 
 
[:lower:][a-z]\llower-case letters 
 
[:alpha:][A-Za-z]upper- and lower-case letters 
 
[:space:][ \t\n]\swhitespace characters 
 
 
The shorthand versions need to have their \ escaped when used in an R string:
intpat <- ".*\\s([-+]?\\d+).*"
sub(intpat, "\\1", c("Temp:  32F", "Temp: -11F"))
## [1] "32"  "-11" 
Raw Strings 
Raw strings  can make writing regular expressions a little easier.
In R a raw string is specified as r"(...)".
The ... characters can be any characters and are taken literally, without special interpretation that might require escaping.
For the integer pattern:
intpat <- ".*\\s([-+]?\\d+).*"
intpat_raw <- r"(.*\s([-+]?\d+).*)"
intpat == intpat_raw
## [1] TRUEPython, C++, and other languages provide similar facilites.
R’s raw string syntax it modeled after the one used in C++.
 
A Note on Sorting 
Non-ASCII characters can create issues for sorting strings, but even ASCII character sort order is not the same in all locales .
The LETTERS data set contains the upper-case letters in alphabetical order for the English sorting convention and most other locales:
LETTERS
##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
## [20] "T" "U" "V" "W" "X" "Y" "Z"But in Estonian, Latvian, and Lithuanian:
stringr::str_sort(LETTERS, locale = "est")
##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
## [20] "Z" "T" "U" "V" "W" "X" "Y"Ordering of lower case and upper case letters in English and most other locales:
stringr::str_sort(c("A", "a"), locale = "eng")
## [1] "a" "A"But in Danish, and also Maltese:
stringr::str_sort(c("A", "a"), locale = "dan")
## [1] "A" "a" 
Encoding Issues 
Files contain a sequence of 8-bit integers, or bytes .
These are the integers from 0 through 255.
For text files, these bytes are interpreted as representing characters.
The mapping from bytes to characters is called an encoding .
The encoding for the characters used in American English is ASCII: the American Standard Code for Information Interchange .
The ASCII encoding  uses only the integers 0 through 127.
The ASCII encoding is adequate for American uses; even UK text files need more: the pound sign £.
Encodings that use integers 128 through 255:
Many encodings are available to support other alphabets and character sets.
Fortunately most systems now use Unicode  with the UTF-8 encoding  for representing non-ASCII characters.
If you need to read a file with non-ASCII characters using read.csv() or similar base functions a good place to start is to specify encoding = "UTF-8".
The functions in the readr package default to assuming the encoding is UTF-8.
Getting the encoding wrong can result in a few messed up characters or in an entire string being messed up:
Reading files without specifying the correct encoding might produce strings like
x1
## [1] "El Ni\xf1o was particularly bad this year"
x2
## [1] "\x82\xb1\x82\xf1\x82ɂ\xbf\x82\xcd"If the correct encoding is known, then these can be fixed after the fact with iconv():
iconv(x1, "Latin1", "UTF-8")
## [1] "El Niño was particularly bad this year"
iconv(x2, "Shift-JIS", "UTF-8")
## [1] "こんにちは"Re-reading the file with the proper encoding specified may be a better option.
The readr function guess_encoding() may help identify the correct encoding if it is not specified in the data documentation.
Handling encoding issues in R used to be more complicated on Windows, but this has improved with recent releases of R.
 
Fonts 
Once you have the right encoding for character data, you also need a set of fonts so your data shows up properly in the console, the editor, and graphs.
Fonts may already be installed.
c("\U{1f600}", "\u3041", "\u4a00")
## [1] "😀" "ぁ" "䨀"If glyphs are not availabl, the characters may be shown in various ways.
c("\U{105b8}", "\u3040")
## [1] "𐖸"      "\u3040"How to install fonts will depend on your platform.
A plot with emojis:
data.frame(x = 1:10,
           y = runif(10),
           z = sapply(0x1f600 + 0:9,
                      intToUtf8)) |>
ggplot(aes(x, y, label = z)) +
    geom_text()
 
A Note on Line Endings 
On Linux and current macOS lines of text in files are terminated by a line feed  (LF) character \n or \x0a.
On Windows, and in HTML, lines are terminated by a carriage return  (CR, \r, x0d) followed by a linefeed (CRLF).
On Linux and macOS there is no difference in reading or writing to a file in text mode  or binary mode .
On Window, reading in text mode translates between CRLF line endings to LF, and writing translates LF to CRLF.
Many file transfer operations will do this conversion.
Git will check out text files with the appropriate line endings for the platform.
Most higher-level tools on Linux will work with CRLF line endings, and on Windows will work for LF line endings.
Occasionally you may see some issues, and need to fix the line endings yourself.
 
Getting the Current Temperature 
The tools described here come in handy when scraping data from the web.
This code gets the current temperature in Iowa City  from the National Weather Service:
library(xml2)
url <- "http://forecast.weather.gov/zipcity.php?inputstring=Iowa+City,IA"
page <- read_html(url)
xpath <- "//p[@class=\"myforecast-current-lrg\"]"
tempNode <- xml_find_first(page, xpath)
nodeText <- xml_text(tempNode)
as.numeric(sub("([-+]?[[:digit:]]+).*", "\\1", nodeText))An example of creating a current temperature map is described here .
 
Exercises 
Complete all lessons and exercises in the https://regexone.com/  online interactive tutorial.
Consider the code
library(tidyverse)
filter(mpg, grepl(---, model))For which of the following regular expressions in place of --- will this code return the subset of rows for all models that contain either 4wd or awd in their model names?
“[a4]wd” 
“[a4]wd” 
“4awd” 
“[[4a]]wd” 
  
 
---
title: "String Parsing and Regular Expressions"
output:
  html_document:
    toc: yes
    code_folding: show
    code_download: true
---

<link rel="stylesheet" href="stat4580.css" type="text/css" />
<style type="text/css"> .remark-code { font-size: 85%; } </style>
```{r setup, include = FALSE}
source(here::here("setup.R"))
knitr::opts_chunk$set(collapse = TRUE, message = FALSE,
                      fig.height = 5, fig.width = 6, fig.align = "center")

options(htmltools.dir.version = FALSE)

set.seed(12345)
library(ggplot2)
library(lattice)
library(tidyverse)
library(gridExtra)
theme_set(theme_minimal() +
          theme(text = element_text(size = 16),
                panel.border = element_rect(color = "grey30", fill = NA)))
```


## Data Cleaning

After reading data from text files or web pages it is common to have to

* clean up string variables;

* extract numbers.

This can be done using

* high-level functions that usually work but not always;

* using string subsetting;

* using _regular expressions_.

Some examples of strings that need to be processed:

```{r, eval = FALSE}
"12%"
"New York *"
"2,100"
"Temp: 12 \u00b0F"
```

Some of the most common cases are covered here.

Much more is available in [_R for Data
Science_](https://r4ds.hadley.nz/), in particular in the chapters

* [_Data Import_](https://r4ds.hadley.nz/data-import.html);
* [_Strings_](https://r4ds.hadley.nz/strings.html).


## Removing a Percent Sign

Reading the GDP growth rate data from a 
[web page](https://www.multpl.com/us-real-gdp-growth-rate/table/by-quarter)
produced a data frame with a column like

```{r}
s <- c("12%", "2%")
```

This can be converted to a numeric variable by

* extracting the sub-string without the `%`

* and then using `as.numeric`.

```{r}
nchar(s)
substr(s, 1, nchar(s) - 1)
as.numeric(substr(s, 1, nchar(s) - 1))
```

An alternative is to use `sub()` function to replace `"%"` by the
empty string `""`.

```{r}
as.numeric(sub("%", "", s))
```

The function `parse_number` in the `readr` package ignores the percent
sign and extracts the numbers correctly:

```{r}
library(readr)
parse_number(s)
```


## Removing Grouping Characters

Numbers are sometimes written using _grouping characters_:

```{r}
s1 <- c("800", "2,100")
s2 <- c("800", "2,100", "3,123,500")
```

* The comma is often used as a grouping character in the US.

* Other countries use different characters.

* Other countries also use different characters for the decimal separator.

`sub` and `gsub` can be used to remove grouping characters:

```{r}
sub(",", "", s1)
sub(",", "", s2)
```

```{r}
gsub(",", "", s2)
as.numeric(gsub(",", "", s2))
```

* `sub` replaces the first match to a pattern;

* `gsub` replaces all matches.

`parse_number` can again be used:

```{r}
parse_number(s2)
```

`parse_number` is convenient but may be less robust:

* In Switzerland the grouping character is `'`.

* If `parse_number` has its defaults set to Swiss conventions:

```{r}
parse_number(s2, locale = locale(grouping_mark = "'"))
```

## Separating City and State

Data often has city and state specified in a variable like

```{r}
s <- c("Boston, MA", "Iowa City, IA")
```

If all state specifications are in two-letter form then city and state
can be extracted as sub-strings:

```{r}
substr(s, 1, nchar(s) - 4)
substr(s, nchar(s) - 1, nchar(s))
```

This would not work if full state names are used.

An alternative is to use a _regular expression_.


## Regular Expressions

Regular expressions are a language for expressing patterns in strings.

Regular expressions should be developed carefully, like any program,
starting with simple steps and building up.

The simplest regular expressions are literal strings, like `%`.

More complex expressions are built up using _meta-characters_ that
have special meanings in regular expressions.

Many punctuation characters are regular expression meta-characters.

Paul Murrell's [_Introduction to Data
  Technologies_](http://www.stat.auckland.ac.nz/~paul/ItDT/) provides
  a good introduction in Section 9.9.2 and an extensive reference in
  Chapter 11.

The [Strings chapter](https://r4ds.hadley.nz/strings.html) in [R for
Data Science](https://r4ds.hadley.nz/) also provides an introduction to
regular expressions, but uses its own set of functions from the
`tidyverse`.

The web site
[Regular-Expressions.info](https://www.regular-expressions.info/) is a
useful on-line resource.


## Regular Expression Meta-Charcters

Some important meta-characters are `.`, `*`, `+`, and `?`:

* The period `.` stands for any character.

* The asterisk `*` means zero, one, or more of the preceding character
  specification.

* The plus sign `+` means one, or more of the preceding character
  specification.

* The question mark `?` means zero or one of the preceding character
  specification.

The pattern `",.*"` matches a comma `,` followed by zero or more
characters:

```{r}
s
sub(",.*", "", s)
```

The pattern `".*, "` matches zero or more characters followed by a
comma and a space:

```{r}
sub(".*, ", "", s)
```


## Trimming White Space

If the data file is not consistent on the use of spaces in the
separator another possibility is to

* remove the characters through the comma;

* trim the white space from the result.

```{r}
sub(".*,", "", s)
```

```{r}
trimws(sub(".*,", "", s))
```


## Using `separate`

If the city-state variable is already in a data frame or tibble, then
the `separate` function from the `tidyr` package can be used:

```{r}
library(tibble)
library(tidyr)
d <- tibble(citystate = s)
d
```

```{r}
separate(d, citystate, c("city", "state"), sep = ", ")
```


## Escaping Meta-Characters

Reading data from
[city temperatures](https://www.timeanddate.com/weather/)
produces a variable that looks like

```{r}
s <- c("London *", "Sydney")
```

The `*` indicates daylight saving or summer time.

We would like to

* extract the city name;

* extract whether there is a `*`.

The `*` is a meta-character.

To include a literal meta-character in a pattern the meta-characters
needs to be _escaped_.

A meta-character is escaped by preceding it by a backslash `\`.

But the backslash is a meta-character for R strings!

To put a backslash into an R string it needs to be written as `\\`.

The pattern we want to match, a space followed by a `*`, is 
`␣\*`, with `␣` denoting a space character.

An R string containing these three characters is written as `" \\*"`.

It is often useful to write a pattern once and save it in a variable.\:

```{r}
(pat <- " \\*")
```

This string contains three characters:

```{r}
nchar(pat)
```

Standard printing includes the backslash escape, and other escape
characters, so the printed string can be read back into R:

```{r}
"a, b
 and c"
```

The `writeLines` function is useful to see the characters in a string.

```{r}
writeLines(pat)
```

To help make the space more visible we can add a delimiter:

```{r}
writeLines(paste0("'", pat, "'"))
```

Another option is to use `sprintf` with `writeLines`:

```{r}
writeLines(sprintf("'%s'", pat))
```

This pattern removes the space and asterisk if present:

```{r}
s
sub(pat, "", s)
```

The `grep` and `grepl` functions check whether a pattern matches in
elements of a character vector.

* `grep` is short for Get REgular exPression.

* `grep` is a standard Linux command-line utility for searching text files.

* In R, `grep` returns the indices of the elements that match the pattern.

* `grepl` returns a logical vector indicating whether there is a match:

```{r}
grep(pat, s)
grepl(pat, s)
```


## Matching Numbers

Reading temperature data might produce a string like

```{r}
s <- c("32F", "-11F")
```

This can be processed as

```{r}
substr(s, 1, nchar(s) - 1)
```

or as

```{r}
sub("F", "", s)
```

An alternative uses some more regular expression features:

* match the number within the string;

* extract a sub-match.

A pattern to match an integer, possibly preceded by a sign is

```{r}
intpat <- "[-+]?[[:digit:]]+"
s
sub(intpat, "X", s)
```

The `[` and `]` meta-characters define character sets; any character
between these will match.

`[:digit:]` specifies a _character class_ of digits.

There are a number of _character classes_, including

* `[:alpha:]` alphabetic letters;

* `[:digit:]` digits;

* `[:space:]` white space (spaces, tabs).

* `[:lower:]` lower case letters

* `[:upper:]` upper case letters

* `[:alnum:]` characters from `[:alpha:]` and `[:digit:]`

* `[:punct:]` punctuation characters

A _sub_pattern_ can be extracted using _back references_:

```{r}
sub("([-+]?[[:digit:]]+).*", "\\1", s)
```

* Sub-patterns can be specified with `(` and `)`.

* _Back references_ can be used to refer to previous sub-patterns by number.

* The digit needs to be escaped with a `\`;

* In an R string the `\` needs to be escaped with a second `\`.

A sub-string approach for a temperature embedded in a string:

```{r}
s <- c("Temp:  32F", "Temp: -11F")
(s1 <- substr(s, 6, nchar(s)))
(s2 <- substr(s1, 1, nchar(s1) - 1))
as.numeric(s2)
```

Using regular expressions, sub-patterns, and back references:

```{r}
sub(".*[[:space:]]+([-+]?[[:digit:]]+).*", "\\1", s)
```

`parse_number` is again an alternative:

```{r}
parse_number(s)
```


## City Temperatures

The [city temperatures](https://www.timeanddate.com/weather/)
data used previously can be read using

```{r, message = FALSE}
library(rvest)
library(dplyr)
weather <- read_html("https://www.timeanddate.com/weather/")
w <- html_table(html_nodes(weather, "table"))[[1]]

w1 <- w[c(1, 4)]; names(w1) <- c("city", "temp")
w2 <- w[c(5, 8)]; names(w2) <- c("city", "temp")
w3 <- w[c(9, 12)]; names(w3) <- c("city", "temp")
ww <- rbind(w1, w2, w3)
ww <- filter(ww, city != "")
head(ww)
```

Cleaning up and extracting `dst`:
```{r}
www <- mutate(ww,
              dst = grepl(" \\*", city),
              city = sub(" \\*", "", city),
              temp.txt = temp,   ## for checking on conversion failures`
              temp = as.numeric(sub("([-+]?[[:digit:]]+).*", "\\1", temp)))
```

Check on `NA` values from conversion:

```{r}
filter(www, is.na(temp))
www <- select(www, -temp.txt)
```

Five highest and lowest temperatures: 
```{r}
slice_max(www, temp, n = 5)
```
```{r}
slice_min(www, temp, n = 5)
```

Temperatures for northern and southern hemisphere (approximately):

```{r temp-densities, eval = FALSE}
ggplot(www, aes(x = temp, fill = dst)) +
    geom_density(alpha = 0.5)
```
```{r temp-densities, echo = FALSE}
```


## Tricky Characters

Some examples:

```{r}
(s <- head(ww$temp))
nchar(s)
substr(s, 1, nchar(s) - 3)
substr(s, 1, nchar(s) - 2)
as.numeric(substr(s, 1, nchar(s) - 2))
as.numeric("82 ")
sub(" .*", "", s)
```

The problem is _two_ non-ascii characters.

The `stri_escape_unicode` function from the `stringi` package can make
these characters more visible:

```{r}
stringi::stri_escape_unicode(s)
```

The troublesome characters are:

* [No-break space U00A0](https://www.fileformat.info/info/unicode/char/00a0/index.htm).

* [Degree symbol
  U00B0](https://www.fileformat.info/info/unicode/char/00b0/index.htm).

Using the unicode specification for the no-break space does work:

```{r}
sub("\u00a0.*", "", s)
```


## Variations in Regular Expression Engines

<!-- Note on unicode digits:
 https://unix.stackexchange.com/questions/414226/difference-between-0-9-digit-and-d
-->

Many tools and languages support working with regular expressions.

R supports:

* The POSIX standard for _extended regular expressions_. This is the
  default engine.

* Perl-compatible regular expressions
  ([PCRE](https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions)).
  This engine is selected by adding `perl = TRUE` in calls to
  functions using regular expressions.

Different engines can differ in how certain regular expressions are
interpreted, especially when non-ASCII characters are involved.

Different engines also sometimes offer shorthand notations, in
particular for character classes.

Some examples:

| POSIX class |  similar to | shorthand |                                |
| ----------- | ----------- | --------- | ------------------------------ |
|`[:digit:]` |  `[0-9]`   |  `\d`     |   digits                       |
|`[:upper:]` |   `[A-Z]`  |  `\u`     |   upper case letters           |
|`[:lower:]` |  `[a-z]`   |  `\l`     |  lower-case letters            |
|`[:alpha:]` |  `[A-Za-z]`|           |  upper- and lower-case letters |
|`[:space:]` |  `[ \t\n]` |  `\s`     |  whitespace characters         |

The shorthand versions need to have their `\` escaped when used in an R
string:

```{r}
intpat <- ".*\\s([-+]?\\d+).*"
sub(intpat, "\\1", c("Temp:  32F", "Temp: -11F"))
```


## Raw Strings

_Raw strings_ can make writing regular expressions a little easier.

In R a raw string is specified as `r"(...)"`.

The `...` characters can be any characters and are taken literally,
without special interpretation that might require escaping.

For the integer pattern:

```{r}
intpat <- ".*\\s([-+]?\\d+).*"
intpat_raw <- r"(.*\s([-+]?\d+).*)"
intpat == intpat_raw
```

Python, C++, and other languages provide similar facilites.

R's raw string syntax it modeled after the one used in C++.


## A Note on Sorting

Non-ASCII characters can create issues for sorting strings, but even
ASCII character sort order is not the same in all
[locales](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes).

The `LETTERS` data set contains the upper-case letters in alphabetical
order for the English sorting convention and most other locales:

```{r}
LETTERS
```

But in Estonian, Latvian, and Lithuanian:

```{r}
stringr::str_sort(LETTERS, locale = "est")
```

Ordering of lower case and upper case letters in English and most other
locales:

```{R}
stringr::str_sort(c("A", "a"), locale = "eng")
```

But in Danish, and also Maltese:

```{R}
stringr::str_sort(c("A", "a"), locale = "dan")
```

```{r, echo = FALSE, eval = FALSE}
## read a table from Wikipedia with the locale abbreviations
url <- "https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes"
library(rvest)
library(dplyr)
w <- read_html(url)

## different letter sort order
tbl <- html_table(html_nodes(w, "table"), fill = TRUE)
ltbl <- tbl[[2]]
lcodes <- ltbl[[6]]
v <- sapply(lcodes,
            function(loc)
                identical(stringr::str_sort(LETTERS, locale = loc),
                          LETTERS))
ltbl[!v, c("Language family", "ISO language name", "639-2/T")]

## different upper/lower case sort order
ss <- c("a", "A")
vv <- sapply(lcodes,
             function(loc)
                identical(stringr::str_sort(ss, locale = loc),
                          ss))
ltbl[!vv, c("Language family", "ISO language name", "639-2/T")]

```


## Encoding Issues

Files contain a sequence of 8-bit integers, or _bytes_.

These are the integers from 0 through 255.

For text files, these bytes are interpreted as representing characters.

The mapping from bytes to characters is called an _encoding_.

The encoding for the characters used in American English is ASCII:
the _American Standard Code for Information Interchange_.

The [ASCII encoding](https://ascii-tables.com/) uses only the
integers 0 through 127.

The ASCII encoding is adequate for American uses; even UK text files
need more: the pound sign £.

Encodings that use integers 128 through 255:

* Latin1 (aka ISO-8859-1) for western European languages (includes £);

* Latin2 (aka ISO-8859-2) for eastern European languages.

Many encodings are available to support other alphabets and character sets.

Fortunately most systems now use [Unicode](https://home.unicode.org/)
with the [UTF-8 encoding](https://en.wikipedia.org/wiki/UTF-8) for
representing non-ASCII characters.

If you need to read a file with non-ASCII characters using `read.csv()`
or similar base functions a good place to start is to specify
`encoding = "UTF-8"`.

The functions in the `readr` package default to assuming the encoding
is UTF-8.

Getting the encoding wrong can result in a few messed up characters
or in an entire string being messed up:

```{r include = FALSE}
x1 <- "El Ni\xf1o was particularly bad this year"
x2 <- "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"
```

Reading files without specifying the correct encoding
might produce strings like

```{r}
x1
x2
```

If the correct encoding is known, then these can be fixed after the
fact with `iconv()`:

```{r}
iconv(x1, "Latin1", "UTF-8")
iconv(x2, "Shift-JIS", "UTF-8")
```

Re-reading the file with the proper encoding specified may be a better
option.

The `readr` function `guess_encoding()` may help identify the correct
encoding if it is not specified in the data documentation.

Handling encoding issues in R used to be more complicated on Windows, but
this has improved with recent releases of R.


## Fonts

Once you have the right encoding for character data, you also need a set
of fonts so your data shows up properly in the console, the editor, and
graphs.

Fonts may already be installed.

```{r}
c("\U{1f600}", "\u3041", "\u4a00")
```

If glyphs are not availabl, the characters may be shown in various ways.

```{r}
c("\U{105b8}", "\u3040")
```

How to install fonts will depend on your platform.

A plot with emojis:
```{r, class.source = "fold-hide"}
data.frame(x = 1:10,
           y = runif(10),
           z = sapply(0x1f600 + 0:9,
                      intToUtf8)) |>
ggplot(aes(x, y, label = z)) +
    geom_text()
```


## A Note on Line Endings

On Linux and current macOS lines of text in files are terminated by a
_line feed_ (LF) character `\n` or `\x0a`.

On Windows, and in HTML, lines are terminated by a _carriage return_
(CR, `\r`, `x0d`) followed by a linefeed (CRLF).

On Linux and macOS there is no difference in reading or writing to a
file in _text mode_ or _binary mode_.

On Window, reading in text mode translates between CRLF line endings
to LF, and writing translates LF to CRLF.

Many file transfer operations will do this conversion.

Git will check out text files with the appropriate line endings for
the platform.

Most higher-level tools on Linux will work with CRLF line endings, and
on Windows will work for LF line endings.

Occasionally you may see some issues, and need to fix the line endings
yourself.


## Getting the Current Temperature 

The tools described here come in handy when scraping data from the web.

This code gets the [current temperature in Iowa
City](http://forecast.weather.gov/zipcity.php?inputstring=Iowa+City,IA)
from the National Weather Service:

```{r, eval = FALSE}
library(xml2)
url <- "http://forecast.weather.gov/zipcity.php?inputstring=Iowa+City,IA"
page <- read_html(url)
xpath <- "//p[@class=\"myforecast-current-lrg\"]"
tempNode <- xml_find_first(page, xpath)
nodeText <- xml_text(tempNode)
as.numeric(sub("([-+]?[[:digit:]]+).*", "\\1", nodeText))
```

An example of creating a current temperature map is described
[here](`r WLNK("weather.html")`).


## Reading

Chapters [_Data Import_](https://r4ds.hadley.nz/data-import.html) and
[_Strings_](https://r4ds.hadley.nz/strings.html) in [_R for Data
Science_](https://r4ds.hadley.nz/).

Chapter [_String
processing_](https://rafalab.dfci.harvard.edu/dsbook-part-1/wrangling/string-processing.html)
in [_Introduction to Data Science: Data Analysis and Prediction
Algorithms with R_](https://rafalab.dfci.harvard.edu/dsbook-part-1/).


## Exercises

<!--
from https://rafalab.github.io/dsbook/string-processing.html#exercises-41
-->
1. Complete all lessons and exercises in the https://regexone.com/
   online interactive tutorial.

2. Consider the code

    ```{r, eval = FALSE}
    library(tidyverse)
    filter(mpg, grepl(---, model))
    ```

    For which of the following regular expressions in place of `---` will
    this code return the subset of rows for all models that contain either
    `4wd` or `awd` in their model names?

    a. "[a4]wd "
    b. "[a4]wd"
    c. "4awd"
    d. "[[4a]]wd"
