Chapter 1 Introduction

1.1 Syllabus and Background

1.1.1 Basics

Review the course syllabus. Fill out info sheets:

name
field
statistics background
computing background

1.1.2 Homework

Some problems will cover ideas not covered in class.

Working together is OK.

Try to work on your own.

Your write-up must be your own.

Do not use solutions from previous years.

Submission by Icon or by GitLab. .

1.1.3 Project

Find a topic you are interested in.

Written report plus possibly some form of presentation.

1.1.4 Ask Questions

Ask questions if you are confused or think a point needs more discussion.

Questions can lead to interesting discussions.

1.2 Computational Tools

1.2.1 Computers and Operating Systems

We will use software available on the Linux workstations in the Mathematical Sciences labs (Schaeffer 346 in particular).

Most things we will do can be done remotely by using ssh to log into one of the machines in Schaeffer 346 using ssh. These machines are

l-lnx2<xy>.stat.uiowa.edu

with <xy> = 00, 01, 02, …, 19.

You can also access the CLAS Linux systems using a browser at

http://fastx.divms.uiowa.edu/

This connects you to one of several servers.

It is OK to run small jobs on these servers.
For larger jobs you should log into one of the lab machines.

Most of the software we will use is available free for installing on any Linux, Mac OS X, or Windows computer.

You are free to use any computer you like, but I will be more likely to be able to help you resolve problems you run into if you are using the lab computers.

1.2.2 Git and GitLab

Git is a version control system that is very useful for keeping track of revision history and for collaboration.

We will be using the University’s GitLab server. Today you should log into the page http://research-git.uiowa.edu with your HawkID.

I will then create a repository for you to use within the class group.

A brief introduction to Git is available.

1.2.3 What You Will Need

You will need to know how to

run R
Compile and run C programs

Other Tools you may need:

text editor
command shell
make, grep, etc.

Many people like to use RStudio for working with R as well as C.

1.2.4 Class Web Pages

The class web page contains some pointers to available tools and documentation. It will be updated throughout the semester.

Reading assignments and homework will be posted on the class web pages.

1.2.5 Computing Account Setup: Do This Today!

Make sure you are able to log into the CLAS Linux systems with your HawkID and password.

The resources page provides some pointers on how to do this. If you cannot, please let me know immediately.

If you have not done so already, log into the page http://research-git.uiowa.edu with your HawkID to activate your GitLab account.

1.3 Computational Statistics, Statistical Computing, and Data Science

Computational Statistics: Statistical procedures that depend heavily on computation.

Statistical graphics
Bootstrap
MCMC
Smoothing
Machine lerning
…

Statistical Computing: Computational tools for data analysis.

Numerical analysis
Optimization
Design of statistical languages
Graphical tools and methods
…

Data Science: A more recent term, covering areas like

Accessing and cleaning data
Working with big data
Working with complex and non-standard data
Machine learning methods
Graphics and visualization
…

Overlap: The division is not sharp; some consider the these terms to be equivalent.

1.4 Course Topics

The course will cover, in varying levels of detail, a selection from these topics in Computational Statistics, Statistical Computing, and Data Science:

basics of computer organization
data technologies
graphical methods and visualization
random variate generation
design and analysis of simulation experiments
bootstrap
Markov chain Monte Carlo
basics of computer arithmetic
numerical linear algebra
optimization algorithms for model fitting
smoothing
machine learning and data mining
parallel computing in statistics
symbolic computation
use and design of high level languages

Some topics will be explored in class, some in homework assignments.

Many could fill an entire course; we will only scratch the surface.

Your project is an opportunity to go into more depth on one or more of these areas.

The course will interleave statistical computing with computational statistics and data science; progression through the topics covered will not be linear.

Working computer assignments and working on the project are the most important part.

Class discussions of issues that arise in working problems can be very valuable, so raise issues for discussion.

Class objectives:

Become familiar with some ideas from computational statistics, statistical computing, and data science.
Develop skills and experience in using the computer as a research tool.

1.5 Thumbnail Sketch of R

R is a language for statistical computing and graphics.

R is related to the S language developed at Bell Labs.

R is a high level language:

somewhat functional in nature;
has some object-oriented features;
interactive;
can use compiled C or FORTRAN code.

R has many built-in features and tools

R has a well developed extension mechanism (packages):

tools for writing packages;
many contributed packages are available.

1.5.1 Some examples

Fitting a linear regression to simulated data:

x <- c(1,2,3,4,3,2,1)
y <- rnorm(length(x), x + 2, 0.2)
lm(y ~ x)
## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##      1.9396       0.9934

A function to sum the values in a vector

mysum <- function(x) {
    s <- 0
    for (y in x) s <- s + y
    s
}
mysum(1:10)
## [1] 55

1.6 Thumbnail Sketch of C

C is a low level language originally developed for systems programming.

Originally developed at Bell Labs for programming UNIX.

Can be used to write very efficient code.

Can call libraries written in C, FORTRAN, etc. on most systems.

A reasonable book on C is Practical C Programming, 3rd Edition, By Steve Oualline. There are many other good books available.

A simple example program is available at the class web site.

1.6.1 Example: Summing the Numbers in a Vector

#include <stdio.h>

#define N 1000000
#define REPS 1000

double x[N];

double sum(int n, double *x)
{
    double s;
    int i;

    s = 0.0;
    for (i = 0; i < N; i++) {
        s = s + x[i];
    }
    return s;
}

int main()
{
    double s;
    int i, j;

    for (i = 0; i < N; i++)
        x[i] = i + 1;

    for (j = 0; j < REPS; j++)
        s = sum(N, x);

    printf("sum = %f\n", s);
    return 0;
}

1.7 Speed Comparisons

Consider two simple problems:

computing the sum of a vector of numbers
computing the dot product of two vectors

Code for these problems in C, Lisp-Stat, and R is available here.

Timings are obtained with commands like

time ddot

for the C versions, and

x <- as.double(1:1000000)
system.time(for (i in 1:1000) ddot(x, x))

for R.

The results:

Sum	Time (sec)	base = C	base = C -O2
C sum	2.33	1.00	2.21
C sum -O2	1.05	0.45	1.00
R sum	0.81	0.35	0.77
R mysum	21.42	9.21	20.40
C sumk	7.92	3.41	7.54
C sumk -O2	4.21	1.81	4.00
R mysumk	83.15	35.76	79.19

and

Dot Product	Time (sec)	base = C	base = C -O2
C ddot	2.34	1.00	2.25
C ddot -O2	1.04	0.44	1.00
R ddot	47.85	20.45	46.01
R crossp	1.46	0.62	1.40

Notes:

R sum means built-in sum; R crossp means crossprod.
sumk and mysumk use Kahan summation.
some of the R speeds may improve by about 30% in the next R release.

Some conclusions and comments:

Low level languages like C can produce much faster code.
It is much easier to develop code in an interactive, high level language.
Usually the difference is much less.
Improvements in high level language runtime systems (e.g. byte compilation, runtime code generation) can make a big difference.
Using the right high level language function (e.g. ) can eliminate the difference.
High level language functions may be able to take advantage of multiple cores.
Speed isn’t everything: accuracy is most important!