Introduction

class: center, middle, title-slide

.title[
# Introduction
]
.author[
### Luke Tierney
]
.institute[
### University of Iowa
]
.date[
### 2023-05-19
]

---

layout: true

## Syllabus and Background
---

### Basics

Review the [course syllabus](https://homepage.divms.uiowa.edu/~luke/classes/STAT7400-2023//syllabus.pdf).

Collect some info on you:

- name
- field
- statistics background
- computing background

Please put this information in your GitLab repo's `README` file.

---

### Homework

Some problems will cover ideas not covered in class.

Working together is OK.

Try to work on your own.

Your write-up must be your own.

Do not use solutions from previous years.

Submission will be by [_Icon_](https://icon.uiowa.edu/) or by
[_UI GitLab_](https://research-git.uiowa.edu).

---

### Project

Find a topic you are interested in.

Written report plus possibly some form of presentation.

### Ask Questions

Ask questions if you are confused or think a point needs more
discussion.

Questions can lead to interesting discussions.

---
layout: true
## Computational Tools
---

### Computers and Operating Systems

We will use software available on the Linux workstations in the
Mathematical Sciences labs (Schaeffer 346 in particular).

Most things we will do can be done remotely by using
`ssh` to log into one of the machines in Schaeffer 346 using
`ssh`.

These machines are

`l-lnx2<xy>.stat.uiowa.edu`

with `<xy>` = 00, 01, 02, ..., 19.

---

You can also access the CLAS Linux systems using a browser at

<https://fastx.divms.uiowa.edu/>

This connects you to one of several servers.

* It is OK to run small jobs on these servers.

* For larger jobs you should log into one of the lab machines.

Most of the software we will use is available free for installing on
any Linux, Mac OS X, or Windows computer.

You are free to use any computer you like, but I will be more likely
to be able to help you resolve problems you run into if you are using
the lab computers.

---

### Git and GitLab

Git is a _version control system_ that is very useful for
keeping track of revision history and for collaboration.

We will be using the University's GitLab server.

**_Today_** you should log into the page
<https://research-git.uiowa.edu> with your HawkID.

I will then create a repository for you to use within the [class
group](https://research-git.uiowa.edu/STAT7400-Spring-2023).

A brief introduction to Git is [available](https://homepage.divms.uiowa.edu/~luke/classes/STAT7400-2023//git.html).

---

### What You Will Need

You will need to know how to

* run R
  * Compile and run C programs

Other Tools you may need:

* text editor
  * command shell
  * `make`, `grep`, etc.

Many people like to use [RStudio](https://posit.co) for working
with R as well as C.

---

### Class Web Pages

The [class web page](https://homepage.divms.uiowa.edu/~luke/classes/STAT7400-2023/) contains some pointers to available
tools and documentation.  It will be updated throughout the semester.

Reading assignments and homework will be posted on the class web
pages.

### Computing Account Setup: Do This Today!

Make sure you are able to log into the CLAS Linux systems with
your HawkID and password.

The [resources page](https://homepage.divms.uiowa.edu/~luke/classes/STAT7400-2023//resources.html) provides some
pointers on how to do this.  **If you cannot, please let me know
immediately.**

If you have not done so already, log into the page
<https://research-git.uiowa.edu> with your HawkID to activate your
GitLab account.

---
layout:false
## Computational Statistics, Statistical Computing, and Data Science

.pull-left[
**Computational Statistics:** Statistical procedures that depend
heavily on computation.

* Statistical graphics
* Bootstrap
* MCMC
* Smoothing
* Machine lerning
* ...
{{content}}
]
--

**Statistical Computing:** Computational tools for data analysis.

* Numerical analysis
* Optimization
* Design of statistical languages
* Graphical tools and methods
* ...

.pull-right[
**Data Science:** A more recent term, covering areas like

* Accessing and cleaning data
* Working with big data
* Working with complex and non-standard data
* Machine learning methods
* Graphics and visualization
* ...
{{content}}
]
--

**Overlap:** The division is not sharp; some consider the these terms
to be equivalent.

---
layout: true
## Course Topics
---

The course will cover, in varying levels of detail, a selection from
these topics in _Computational Statistics_, _Statistical Computing_,
and _Data Science_:

* basics of computer organization
  * data technologies
  * graphical methods and visualization
  * random variate generation
  * design and analysis of simulation experiments
  * bootstrap
  * Markov chain Monte Carlo
  * basics of computer arithmetic
  * numerical linear algebra
  * optimization algorithms for model fitting
  * smoothing
  * machine learning and data mining
  * parallel computing in statistics
  * symbolic computation
  * use and design of high level languages

Some topics will be explored in class, some in homework assignments.

Many could fill an entire course; we will only scratch the surface.

---

Your project is an opportunity to go into more depth on one or more of
these areas.

The course will interleave statistical computing with computational
statistics and data science; progression through the topics covered
will not be linear.

Working computer assignments and working on the project are the most
important part.

Class discussions of issues that arise in working problems can be very
valuable, so raise issues for discussion.

Class objectives:

* Become familiar with some ideas from computational statistics,
    statistical computing, and data science.
  * Develop skills and experience in using the computer as a
    research tool.

---
layout: true
## Thumbnail Sketch of R
---

R is a language for statistical computing and graphics.

R is related to the S language developed at Bell Labs.

R is a _high level language_:

* somewhat functional in nature;

* has some object-oriented features;

* interactive;

* can use compiled C or FORTRAN code.

R has many built-in features and tools

R has a well developed extension mechanism (packages):

* tools for writing packages;

* many contributed packages are available.

---

### Some examples

.pull-left[
Fitting a linear regression to simulated data:

```r
x <- c(1, 2, 3, 4, 3, 2, 1)
y <- rnorm(length(x), x + 2, 0.2)
lm(y ~ x)
## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##       2.237        0.909
```
]
--
.pull-right[
A function to sum the values in a vector

```r
mysum <- function(x) {
    s <- 0
    for (y in x) s <- s + y
    s
}
mysum(1:10)
## [1] 55
```
]

---
layout: true
## Thumbnail Sketch of C
---

C is a low level language originally developed for systems programming.

Originally developed at Bell Labs for programming UNIX.

Can be used to write very efficient code.

Can call libraries written in C, FORTRAN, etc. on most systems.

A reasonable book on C is [_Practical C Programming, 3rd
Edition_](https://www.oreilly.com/catalog/pcp3/), by Steve Oualline.
There are many other good books available.

A simple example program is available [at the class web
site](https://homepage.divms.uiowa.edu/~luke/classes/STAT7400-2023//examples/hello/).

---

### Example: Summing the Numbers in a Vector
.pull-left.small-code[
```C
#include <stdio.h>

#define N 1000000
#define REPS 1000

double x[N];

double sum(int n, double *x)
{
    double s;
    int i;

s = 0.0;
    for (i = 0; i < N; i++) {
        s = s + x[i];
    }
    return s;
}
```
]
--
.pull-right.small-code[
A `main` program:
```C
int main()
{
    double s;
    int i, j;

for (i = 0; i < N; i++)
        x[i] = i + 1;

for (j = 0; j < REPS; j++)
        s = sum(N, x);

printf("sum = %f\n", s);
    return 0;
}
```
]

---
layout: true
## Speed Comparisons
---

Consider two simple problems:

* computing the sum of a vector of numbers
  * computing the dot product of two vectors

Code for these problems in C, Lisp-Stat, and R is available
[here](https://homepage.divms.uiowa.edu/~luke/classes/STAT7400-2023//examples/speed/).

Timings are obtained with commands like

```shell
time ddot
```

for the C versions, 
--
and

```r
x <- as.double(1:1000000)
system.time(for (i in 1:1000) ddot(x, x))
```

for R.

---

The results for sum:

|Sum        | Time (sec)| base = C| base = C -O2|
|:----------|----------:|--------:|------------:|
|C sum      |       2.33|     1.00|         2.21|
|C sum -O2  |       1.05|     0.45|         1.00|
|R sum      |       0.81|     0.35|         0.77|
|R mysum    |      21.42|     9.21|        20.40|
|C sumk     |       7.92|     3.41|         7.54|
|C sumk -O2 |       4.20|     1.81|         4.00|
|R mysumk   |      83.15|    35.76|        79.19|

Notes:

* R sum means built-in `sum`.
* `sumk` and `mysumk` use [_Kahan summation_](https://en.wikipedia.org/wiki/Kahan_summation_algorithm).

---

The results for dot product:

|Dot Product | Time (sec)| base = C| base = C -O2|
|:-----------|----------:|--------:|------------:|
|C ddot      |       2.34|     1.00|         2.25|
|C ddot -O2  |       1.04|     0.44|         1.00|
|R ddot      |      47.85|    20.45|        46.01|
|R crossp    |       1.46|     0.62|         1.40|

<!--
  Dot Product & Time (sec) & base = C & base = C -O2 \\

C ddot & 2.34 & 1.00 & 2.25 \\ 
  C ddot -O2 & 1.04 & 0.45 & 1.00 \\ 
  R ddot & 47.85 & 20.47 & 46.01 \\ 
  R crossp & 1.46 & 0.63 & 1.40 \\ 
-->

Notes:

* R crossp means `crossprod`.

---

Some conclusions and comments:

* Low level languages like C _can_ produce much faster code.

* It is much easier to develop code in an interactive, high level
  language.

* Usually the difference is _much_ less.

* Improvements in high level language runtime systems (e.g. byte
  compilation, runtime code generation) can make a big difference.

* Using the right high level language function (e.g. `sum`)
  can eliminate the difference.

* High level language functions may be able to take advantage of
  multiple cores.

* Speed isn't everything: accuracy is most important!

//adapted from Emi Tanaka's gist at //https://gist.github.com/emitanaka/eaa258bb8471c041797ff377704c8505