Background 
R is a language, or an environment, for data analysis and data visualization.
R is derived form the [S  language](https://en.wikipedia.org/wiki/S_(programming_language)  developed at ATT Bell Laboratories.
R was originally developed for teaching at the University of Auckland, New Zealand, by Ross Ihaka and Robert Gentleman.
R is now maintained by an international group of about 20 statisticians and computer scientists.
A great strength of R is the large number of extension packages that have been developed.
The number available on CRAN  is now over 19,000.
 
Basic Usage 
Interactive R uses a command line interface  (CLI).
The interface runs a read-evaluate-print loop  (REPL).
A simple interaction with the R interpreter:
> 1 + 2
[1] 3Values can be assigned to variables using a left arrow <- combination:
> x <- c(1, 3, 5)
> x
[1] 1 3 5
The = sign can also be used for assignment, but <- is recommended.
 
Basic arithmetic operations work element-wise on vectors:
> x + x
[1]  2  6 10Scalars are recycled  to the length of the longer operand:
> x + 1
[1] 2 4 6> 2 * x
[1]  2  6 10Some ways to create new vectors:
> c(1, 2, 3)
[1] 1 2 3> c("a", "b", "c")
[1] "a" "b" "c"> 1 : 3
[1] 1 2 3These examples show a prompt  as you would see in the interpreter.
Usually Rmarkdown documents show code and results like this:
2 * x
## [1]  2  6 10This makes it easier to copy code for pasting it into another document or the R console.
 
Data Frames 
Data sets in R are often organized in named lists  of variables called data frames .
The value of the variable faithful is a data frame with two variables recorded for eruptions of the Old Faithful  geyser in Yellowstone National Park:
eruptions: Eruption duration (minutes)waiting: Waiting time to next eruption (minutes) 
head() shows the first 6 rows:
head(faithful)
##   eruptions waiting
## 1     3.600      79
## 2     1.800      54
## 3     3.333      74
## 4     2.283      62
## 5     4.533      85
## 6     2.883      55 
A Simple Scatter Plot 
with(faithful,
     plot(eruptions, waiting,
          xlab = "Eruption time (min)",
          ylab = "Waiting time to next eruption (min)"))
Several graphics systems are available for R.
plot() is part of base graphics .
 
Fitting a Linear Regression 
fit <- with(faithful, lm(waiting ~ eruptions))
fit
## 
## Call:
## lm(formula = waiting ~ eruptions)
## 
## Coefficients:
## (Intercept)    eruptions  
##       33.47        10.73You can also use the data argument to lm():
fit <- lm(waiting ~ eruptions, data = faithful)coef() extracts the coefficients:
coef(fit)
## (Intercept)   eruptions 
##    33.47440    10.72964summary(fit) provides more details:
summary(fit)
## 
## Call:
## lm(formula = waiting ~ eruptions, data = faithful)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.0796  -4.4831   0.2122   3.9246  15.9719 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  33.4744     1.1549   28.98   <2e-16 ***
## eruptions    10.7296     0.3148   34.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.914 on 270 degrees of freedom
## Multiple R-squared:  0.8115, Adjusted R-squared:  0.8108 
## F-statistic:  1162 on 1 and 270 DF,  p-value: < 2.2e-16 
Adding the Regression Line to the Plot 
Original plot:
with(faithful,
     plot(eruptions, waiting,
          xlab = "Eruption time (min)",
          ylab = "Waiting time to next eruption (min)"))
With a regression line:
with(faithful,
     plot(eruptions, waiting,
          xlab = "Eruption time (min)",
          ylab = "Waiting time to next eruption (min)"))
abline(coef(fit), col = "red", lwd = 3)
 
Packages and Package Libraries 
Extension code and data sets are often made available in packages .
Packages are stored in folders or directories as collections called libraries .
.libPaths() will show you the libraries your R process will search.
search() shows what packages are attached to the global search path.
The library() function is used to find packages in the libraries and attach them to the search path.
The expression pkg::var gets the value of variable var from package pkg without attaching pkg.
You can install packages using the install.packages function or the Install Packages  item in the RStudio Tools  menu.
By default packages are installed from CRAN .
It is also possible to use functions in the remotes package to install packages hosted on GitHub  or GitLab .
 
A Useful Package: ggplot2 
The ggplot2 package provides a powerful alternative to the base graphics system.
The geyser example can be done in ggplot2 like this:
library(ggplot2)
ggplot(data = faithful) +
    geom_point(mapping = aes(x = eruptions, y = waiting)) +
    geom_smooth(mapping = aes(x = eruptions, y = waiting),
                method = "lm", se = FALSE)
ggplot2 is part of a useful collection of packages called the tidyverse 
ggplot is based on the Grammar of Graphics .
A basic template for creating a plot with ggplot:
ggplot(data = <DATA>) + <GEOM>(mapping = aes(<MAPPINGS>)) 
 
Subsetting and Extracting Components 
The subset operator  [ can be used to extract element by index:
month.abb
##  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
month.abb[1 : 3]
## [1] "Jan" "Feb" "Mar"Subsetting can also be based on a logical expression that returns TRUE or FALSE for each element:
(starts_with_J <- substr(month.abb, 1, 1) == "J")
##  [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
month.abb[starts_with_J]
## [1] "Jan" "Jun" "Jul"
The value of an assignment operation is the right hand side value.
Ordinarily this value is not printed. 
Placing the assignment expression in parentheses causes it to be printed. 
 
 
Individual elements can be extracted using the element operator  [[:
month.abb[[3]]
## [1] "Mar"Components of named lists, like data frames , can be extracted with the $ operator:
names(faithful)
## [1] "eruptions" "waiting"
head(faithful, 4)
##   eruptions waiting
## 1     3.600      79
## 2     1.800      54
## 3     3.333      74
## 4     2.283      62
head(faithful$eruptions, 4)
## [1] 3.600 1.800 3.333 2.283The element operator can be used as well:
head(faithful[["eruptions"]], 4)
## [1] 3.600 1.800 3.333 2.283 
Functions 
Simple Functions 
All computations in R are carried out by functions.
Defining a function allows you to avoid cutting and pasting code.
A simple function:
ms <- function(x) list(mean = mean(x), sd = sd(x))
ms(faithful$eruptions)
## $mean
## [1] 3.487783
## 
## $sd
## [1] 1.141371 
Generic Functions and Object-Oriented Programming 
R supports several mechanisms for object-oriented programming based on generic functions .
The most commonly used mechanism, called S3, allows a function to dispatch  to a method  based on the class  of its first argument.
plot is a very simple generic function.
plot
## function (x, y, ...) 
## UseMethod("plot")
## <bytecode: 0x562822b00028>
## <environment: namespace:base>For example, the plot method for linear model fit objects produces a set of 4 plots commonly used to assess regression fits.
plot(fit)
 
Lazy and Non-Standard Evaluation 
An unusual but useful feature of R is that function arguments are not evaluated until their value is needed, so they may not be evaluated at all.
This is called lazy evaluation .
log("A")
## Error in log("A"): non-numeric argument to mathematical function
f <- function(x) NULL
f(log("A"))
## NULLFunctions can also capture the expression of the arguments they were called with:
f <- function(x) deparse(substitute(x))
f(a + b)
## [1] "a + b"Together these features allow functions to evaluate their arguments in non-standard  ways.
This is most commonly used to allow values for variables in arguments to be found in a provided data frame.
The with() function is a simple example:
mean(eruptions)
## Error in eval(expr, envir, enclos): object 'eruptions' not foundwith(faithful, mean(eruptions))
## [1] 3.487783Non-standard evaluation of this type is used extensively in the tidyverse .
 
 
The Tidyverse 
Tidyverse 
The dplyr package provides a grammar for data manipulation .
A simple example: computing means and standard deviations for the waiting times after the short (less than 3 minutes) and the long (3 minutes or more) eruptions:
library(dplyr)
tmp <- mutate(faithful,
              type = ifelse(eruptions < 3,
                            "short",
                            "long"))
head(tmp)
##   eruptions waiting  type
## 1     3.600      79  long
## 2     1.800      54 short
## 3     3.333      74  long
## 4     2.283      62 short
## 5     4.533      85  long
## 6     2.883      55 shortsummarize(group_by(tmp, type),
          mean = mean(waiting),
          sd = sd(waiting))
## # A tibble: 2 × 3
##   type   mean    sd
##   <chr> <dbl> <dbl>
## 1 long   80.0  5.99
## 2 short  54.5  5.84
Tidyverse functions like to work with an enhanced form of data frame called a tibble .
 
A computation like this can be viewed as a transformation pipeline  consisting of three stages:
mutation (adding a new variable) 
grouping (splitting by type) 
summarizing within groups. 
 
Tidyverse code often uses the forward pipe operator  %>% provided by the magrittr package to express such a pipeline.
R 4.1.0 and later also provides a native pipe operator  |>.
The pipe operator allows a call f(x) to be written as
x |> f()The left hand value is passed implicitly as the first argument to the function called on the right.
Using the pipe operator, the code for computing means and standard deviations can be written as
faithful |>
    mutate(type = ifelse(eruptions < 3, "short", "long")) |>
    group_by(type) |>
    summarize(mean = mean(waiting),
              sd = sd(waiting))
## # A tibble: 2 × 3
##   type   mean    sd
##   <chr> <dbl> <dbl>
## 1 long   80.0  5.99
## 2 short  54.5  5.84There are trade-offs:
Manipulation pipelines expressed this way are often more compact than ones using intermediate variables and/or nested calls.
With pipe notation there is no need to come up with intermediate variable names.
Pipe notation obscures the function calls that are actually happening and this can make debugging harder.
 
 
Contrast to Point-and-Click Interfaces 
Even simple tasks require learning some of the R language.
Once you can do simple tasks, you have learned some of the R language.
More complicated tasks become easier.
Even very complicated tasks become possible.
 
 
R and Reproducibility 
Analyses in R are carried out by running code describing the tasks to perform.
This code can be
audited to make sure the analysis is right;
replayed to make sure the results are repoducible;
reused after changes in the data or on new data.
 
Literate data analysis  tools like Rmarkdown provide support for this.
 
Finding Out More 
Getting Help on Functions 
help(mean) will show help for the function mean.This can be abbreviated as ?mean 
 
 
Some R Introductions and Tutorials 
 
Introductions to the Tidiverse 
 
 
Interactive Tutorial 
An interactive learnravailable .
You can run the tutorial with
STAT4580::runTutorial("Rintro")You can install the current version of the STAT4580 package with
remotes::install_gitlab("luke-tierney/STAT4580")You may need to install the remotes package from CRAN first.
 
Exercises 
Compute the mean of the numbers 1, 3, 5, 8. 
 
What is the mean of the eruptions variable in the faithful data frame? 
 
Find the average of the first 50 eruption durations in the faithful data frame. 
 
Use the median function to modify the pipe example in the tidyverse section  to include medians. 
 
 
