After reading data from text files or web pages it is common to have to

This can be done using

Some examples of strings that need to be processed:

## [1] "12%"
## [1] "New York *"
## [1] "2,100"
## [1] "Temp: 12 °F"

This note covers some of the most common cases; much more is available in R for Data Science, in particular in the chapters

Removing a Percent Sign

Reading the GDP growth rate data from a web page prduced a data frame with a column like

s <- c("12%", "2%")

This can be converted to a numeric variable by extracting the sub-string without the % and using as.numeric:

nchar(s)
## [1] 3 2
substr(s, 1, nchar(s) - 1)
## [1] "12" "2"
as.numeric(substr(s, 1, nchar(s) - 1))
## [1] 12  2

An alternative is to use sub to replace "%" by the empty string ""

as.numeric(sub("%", "", s))
## [1] 12  2

The function parse_number in the readr package ignores the percent sign and extracts the numbers correctly:

library(readr)
parse_number(s)
## [1] 12  2

Removing Grouping Characters

Numbers are sometimes written using grouping characters:

s1 <- c("800", "2,100")
s2 <- c("800", "2,100", "3,123,500")

sub and gsub can be used to remove grouping characters:

sub(",", "", s1)
## [1] "800"  "2100"
sub(",", "", s2)
## [1] "800"      "2100"     "3123,500"
gsub(",", "", s2)
## [1] "800"     "2100"    "3123500"
as.numeric(gsub(",", "", s2))
## [1]     800    2100 3123500

parse_number can again be used:

parse_number(s2)
## [1]     800    2100 3123500

parse_number is convenient but may be less robust:

parse_number(s2, locale = locale(grouping_mark = "'"))
## [1] 800   2   3

Separating City and State

Data often has city and state specified in a variable like

s <- c("Boston, MA", "Iowa City, IA")

If all state specifications are in two-letter form then city and state can be extracted as sub-strings:

substr(s, 1, nchar(s) - 4)
## [1] "Boston"    "Iowa City"
substr(s, nchar(s) - 1, nchar(s))
## [1] "MA" "IA"

This would not work if full state names are used.

An alternative is to use a regular expression.

Two meta-characters are the period . and the asterisk *:

The pattern ",.*" matches a comma , followed by zero or more characters:

sub(",.*", "", s)
## [1] "Boston"    "Iowa City"

The pattern ".*," matches zero or more characters followed by a comma:

sub(".*, ", "", s)
## [1] "MA" "IA"

If the city-state variable is already in a data frame or tibble the separate function from the tidyr package can be used:

library(tibble)
library(tidyr)
d <- tibble(citystate = s)
separate(d, citystate, c("city", "state"), sep = ", ")
## # A tibble: 2 x 2
##   city      state
##   <chr>     <chr>
## 1 Boston    MA   
## 2 Iowa City IA

Escaping Meta-Characters

Reading date from city temperatures produces a variable that looks like

s <- c("London *", "Sydney")

The * indicates daylight saving or summer time.

We would like to

The * is a meta-character.

To include a literal meta-character in a pattern the meta-characters needs to be escaped.

pat <- " \\*"
pat
## [1] " \\*"
nchar(pat)
## [1] 3
writeLines(pat)
##  \*
writeLines(sprintf("|%s|", pat))
## | \*|
sub(pat, "", s)
## [1] "London" "Sydney"

The grep and grepl functions check whether a pattern matches in elements of a character vector.

grep(pat, s)
## [1] 1
grepl(pat, s)
## [1]  TRUE FALSE

Matching Numbers

Reading temperature data might produce a string like

s <- c("32F", "-11F")

This can be processed as

substr(s, 1, nchar(s) - 1)
## [1] "32"  "-11"
sub("F", "", s)
## [1] "32"  "-11"

An alternative uses some more regular expression features:

A pattern to match an integer, possibly preceded by a sign is

sub("[-+]?[[:digit:]]+", "X", s)
## [1] "XF" "XF"

A sub_pattern can be extracted using back references:

sub("([-+]?[[:digit:]]+).*", "\\1", s)
## [1] "32"  "-11"

For a temperature embedded in a string, like

s <- c("Temp:  32F", "Temp: -11F")

a sub-string approach might be

(s1 <- substr(s, 6, nchar(s)))
## [1] "  32F" " -11F"
(s2 <- substr(s1, 1, nchar(s1) - 1))
## [1] "  32" " -11"
as.numeric(s2)
## [1]  32 -11

Using regular expressions, sub-patterns, and back references:

sub(".*[[:space:]]+([-+]?[[:digit:]]+).*", "\\1", s)
## [1] "32"  "-11"

parse_number is again an alternative:

parse_number(s)
## [1]  32 -11

City Temperatures

The city temperatures data used previously can be read using

library(rvest)
library(dplyr)
weather <- read_html("https://www.timeanddate.com/weather/")
w <- html_table(html_nodes(weather, "table"))[[1]]

w1 <- w[c(1, 4)]; names(w1) <- c("city", "temp")
w2 <- w[c(5, 8)]; names(w2) <- c("city", "temp")
w3 <- w[c(9, 12)]; names(w3) <- c("city", "temp")
ww <- rbind(w1, w2, w3)
ww <- filter(ww, city != "")
head(ww)
##          city  temp
## 1       Accra 81 °F
## 2 Addis Ababa 64 °F
## 3    Adelaide 64 °F
## 4     Algiers 60 °F
## 5      Almaty 48 °F
## 6     Amman * 50 °F

Cleaning up and extracting dst:

www <- mutate(ww,
              dst = grepl(" \\*", city),
              city = sub(" \\*", "", city),
              temp = as.numeric(sub("([-+]?[[:digit:]]+).*", "\\1", temp)))
## Warning in rlang::eval_tidy(~as.numeric(sub("([-+]?[[:digit:]]+).*", "\
## \1", : NAs introduced by coercion

Five highest and lowest temperatures:

top_n(www, 5, temp)
##            city temp   dst
## 1       Bangkok   85 FALSE
## 2      Honolulu   85 FALSE
## 3      Khartoum   88 FALSE
## 4    Kiritimati   88 FALSE
## 5       Managua   90 FALSE
## 6 Santo Domingo   88 FALSE
top_n(www, 5, desc(temp))
##          city temp   dst
## 1      Anadyr    3 FALSE
## 2    Helsinki   23  TRUE
## 3 Minneapolis   32  TRUE
## 4        Oslo   32  TRUE
## 5  St. John's   32  TRUE
## 6   Stockholm   23  TRUE
## 7     Tallinn   25  TRUE

Temperatures for northern and southern hemisphere (approximately):

library(ggplot2)
ggplot(www, aes(x = temp, fill = dst)) + geom_density(alpha = 0.5)
## Warning: Removed 1 rows containing non-finite values (stat_density).

Tricky Characters

Some examples:

(s <- head(ww$temp))
## [1] "81 °F" "64 °F" "64 °F" "60 °F" "48 °F" "50 °F"
substr(s, 1, nchar(s) - 3)
## [1] "81" "64" "64" "60" "48" "50"
substr(s, 1, nchar(s) - 2)
## [1] "81 " "64 " "64 " "60 " "48 " "50 "
as.numeric(substr(s, 1, nchar(s) - 2))
## Warning: NAs introduced by coercion
## [1] NA NA NA NA NA NA
as.numeric("82 ")
## [1] 82
sub(" .*", "", s)
## [1] "81 °F" "64 °F" "64 °F" "60 °F" "48 °F" "50 °F"

The problem is two non-ascii characters.

The stri_escape_unicode function from the stringi can make these more visible:

stringi::stri_escape_unicode(s)
## [1] "81\\u00a0\\u00b0F" "64\\u00a0\\u00b0F" "64\\u00a0\\u00b0F"
## [4] "60\\u00a0\\u00b0F" "48\\u00a0\\u00b0F" "50\\u00a0\\u00b0F"

The troublesome characters are:

Using the unicode specification for the no-break space does work:

sub("\u00a0.*", "", s)
## [1] "81" "64" "64" "60" "48" "50"