Basic Computer Architecture

class: center, middle, title-slide

.title[
# Basic Computer Architecture
]
.author[
### Luke Tierney
]
.institute[
### University of Iowa
]
.date[
### 2023-05-19
]

---

## Typical Machine Layout

Figure based on M. L. Scott, _Programming Language Pragmatics_,
Figure 5.1, p. 205

---
layout: true
## Structure of Lab Workstations
---

### Processor and Cache

.pull-left.tiny-code.width-55[
```shell
luke@l-lnx200 ~% lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  24
  On-line CPU(s) list:   0-23
Vendor ID:               GenuineIntel
  Model name:            12th Gen Intel(R) Core(TM) i9-12900K
    CPU family:          6
    Model:               151
    Thread(s) per core:  2
    Core(s) per socket:  16
    Socket(s):           1
    Stepping:            2
    CPU(s) scaling MHz:  59%
    CPU max MHz:         5200.0000
    CPU min MHz:         800.0000
    BogoMIPS:            6374.40
    Flags:               ...
Virtualization features: 
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                   640 KiB (16 instances)
  L1i:                   768 KiB (16 instances)
  L2:                    14 MiB (10 instances)
  L3:                    30 MiB (1 instance)
...
```
]

.pull-right.width-40[
* There is a single _16-core_ processor with
  _hyperthreading_ that acts like 24 separate processors
{{content}}
]

* _Hyperthreading_ is enabled, which makes each core to some
  extent behave like two processors.
{{content}}

* The total L3 cache is 30MiB
]

---

### Memory and Swap Space

.pull-left.tiny-code.width-60[
```shell
luke@l-lnx200 ~% free -m
               total        used        free      shared  buff/cache   available
Mem:           31924        1144       20258          25       10521       30258
Swap:          24255           0       24255
```
]

.pull-right.width-40[
* The workstations have about 32G of memory.
{{content}}
]
--
* The swap space is about 24G.
--

### Disk Space

.pull-left.tiny-code.width-60[
Using the `df` command produces:
```shell
luke@l-lnx200 ~% df -BG
Filesystem                1G-blocks  Used Available Use% Mounted on
...
/dev/nvme0n1p2                   1G    1G        1G  16% /boot
/dev/mapper/vg00-tmp             8G    1G        8G   1% /tmp
/dev/mapper/vg00-var            47G   15G       31G  33% /var
...
/dev/mapper/vg00-scratch        26G    1G       26G   1% /var/scratch
...
clasnetappvm...:/students      300G  144G      157G  48% /mnt/nfs/clasnetappvm/students
...
clasnetappvm...:/shared         82G   20G       63G  24% /mnt/nfs/clasnetappvm/shared
...
clasnetappvm...:/grad         1536G  560G      977G  37% /mnt/nfs/clasnetappvm/grad
...
```
]

.pull-right.width-40[
* Local disks are large but mostly unused
{{content}}
]
--

* Space in `/var/scratch` can be used for temporary storage.
{{content}}
--

* User space is on network disks.
{{content}}
--

* Network speed can be a bottle neck.

---

### Performance Monitoring

.pull-left.tiny-code.width-55[
Using the `top` command produces:

```shell
top - 11:06:34 up  4:06,  1 user,  load average: 0.00, 0.01, 0.05
Tasks: 127 total,   1 running, 126 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni, 99.8%id,  0.2%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  16393524k total,   898048k used, 15495476k free,   268200k buffers
Swap: 18481148k total,        0k used, 18481148k free,   217412k cached

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
 1445 root      20   0  445m  59m  23m S  2.0  0.4   0:11.48 kdm_greet          
    1 root      20   0 39544 4680 2036 S  0.0  0.0   0:01.01 systemd            
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd           
    3 root      20   0     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/0        
    5 root       0 -20     0    0    0 S  0.0  0.0   0:00.00 kworker/0:0H       
    6 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kworker/u:0        
    7 root       0 -20     0    0    0 S  0.0  0.0   0:00.00 kworker/u:0H       
    8 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/0        
    9 root      RT   0     0    0    0 S  0.0  0.0   0:00.07 watchdog/0         
   10 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/1        
   12 root       0 -20     0    0    0 S  0.0  0.0   0:00.00 kworker/1:0H       
   13 root      20   0     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/1        
   14 root      RT   0     0    0    0 S  0.0  0.0   0:00.10 watchdog/1         
   15 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/2        
   17 root       0 -20     0    0    0 S  0.0  0.0   0:00.00 kworker/2:0H       
   18 root      20   0     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/2        
   ...
```
]

.pull-right.width-40[
* Interactive options allow you to kill or _renice_ (change
  the priority of) processes you own.
{{content}}
]
--

* The command `htop` may be a little nicer to work with.
{{content}}
--

* A GUI tool, `System Monitor`, is available from one of
  the menus.  From the command line this can be run as
  `gnome-system-monitor`.

---

Another useful command is `ps` (process status)

```shell
luke@l-lnx200 ~% ps -u luke
  PID TTY          TIME CMD
 4618 ?        00:00:00 sshd
 4620 pts/0    00:00:00 tcsh
 4651 pts/0    00:00:00 ps
```

There are many options; see `man ps` for details.

---
layout: true
## Processors
---

### Basics

Processors execute a sequence of instructions.

Each instruction requires some of

* decoding instruction
  * fetching operands from memory
  * performing an operation (add, multiply, ...)
  * etc.

Older processors would carry out one of these steps per clock
cycle and then move to the next.

Most modern processors use _pipelining_ to carry out some
operations in parallel.

---

### Pipelining
A simple example:

&nbsp;&nbsp; `$s \leftarrow 0$`<br>
&nbsp;&nbsp; **for** `$i = 1$` **to** `$n$` **do**<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; `$s \leftarrow s + x_i y_i$`<br>
&nbsp;&nbsp; **end**

Simplified view: Each step has two parts,

* Fetch `$x_i$` and `$y_i$` from memory
* Compute `$s = s + x_i y_i$`

---

Suppose the computer has two functional units that can operate in parallel,

* An _Integer_ unit that can fetch from memory
* A _Floating Point_ unit that can add and multiply

If each step takes roughly the same amount of time, a pipeline can
speed the computation by a factor of two:

---

* Floating point operations are much slower than this.

* Modern chips contain many more separate functional units.

* Pipelines can have 10 or more stages.

* Some operations take more than one clock cycle.

* The compiler or the processor orders operations to keep the
  pipeline busy.

* If this fails, then the pipeline _stalls_.

---

### Superscalar Processors, Hyper-Threading, and Multiple Cores

Some processors have enough functional units to have more than
one pipeline running in parallel.

Such processors are called _superscalar_.

In some cases there are enough functional units per processor to allow
one physical processor to pretend like it is two (somewhat simpler)
logical processors. This approach is called _hyper-threading_.

* Hyper-threaded processors on a single physical chip share some
  resources, in particular cache.

* Benchmarks suggest that hyper-threading produces about a 20%
  speed-up in cases where dual physical processors would produce a
  factor of 2 speed-up

---

It is now possible to fully replicate processors within one chip;
these are _multi core_ processors.

* Multi-core machines are effectively full multi-processor
  machines (at least for most purposes).

* Dual core processors are now ubiquitous.

* The department research machine `r-lnx404` has two 16-core processors.

* Our lab machines have a single 16-core processor.

* Processors with even more cores are available.

Many processors support some form of vectorized operations, e.g.  SSE2
(Single Instruction, Multiple Data, Extensions 2) on Intel and AMD
processors.

GPUs provide even more parallelism but require specialized programming.

---

### Implications

Modern processors achieve high speed though a collection of
clever tricks.

Most of the time these tricks work extremely well.

Every so often a small change in code may cause pipelining
heuristics to fail, resulting in a pipeline stall.

These small changes can then cause large differences in
performance.

The chances are that a "small change" in code that causes a
large change in performance was not in fact such a small change
after all.

Processor speeds have not been increasing very much recently. 
--
Though the `arm64` family (Apple M1) has produced some significant
speedups while reducing power consumption.

Many believe that speed improvements will need to come from
increased use of explicit parallel programming.

More details are available in a talk at

> https://www.infoq.com/presentations/click-crash-course-modern-hardware/

---
layout: true
## Memory
---

### Basics

.pull-left[
Data and program code are stored in memory.
{{content}}
]
--

Memory consists of [_bits_ (binary
integers)](https://en.wikipedia.org/wiki/Bit)
{{content}}
--

On most computers

* Bits are collected into groups of eight, called a
  [_byte_](https://en.wikipedia.org/wiki/Byte).
{{content}}
--

* There is a natural _word size_ of `$W$` bits.
{{content}}
--

* The most common value of `$W$` used to be 32; it is probably now 64;
  16 also occurs.

.pull-right[
* Bytes are numbered consecutively, `$0, 1, 2, \dots, N = 2^W$`.
{{content}}
]
--

* An _address_ for code or data is a number between `$0$`
  and `$N$` representing a location in memory, usually in bytes.
{{content}}
--

* `$2^{32} = 4,294,967,296 = 4\text{GB}$`.
{{content}}
--

* The maximum amount of memory a 32-bit process can address is 4
  Gigabytes.
{{content}}
--

* Some 32-bit machines can use more than 4G of memory, but each
  process gets at most 4G.
{{content}}
--

* Most hard disks are _much_ larger than 4G.

---

### Memory Layout

A process can conceptually access up to `$2^W$` bytes of address space.

The operating system usually reserves some of the address space for
things it does on behalf of the process.

On 32-bit Linux the upper 1GB is reserved for the operating system
kernel.

Only a portion of the usable address space has memory allocated to it.

Standard 32-bit Linux memory layout:

---

The standard heap can only grow to 1G.

`malloc` implementations can allocate more using memory mapping.

Obtaining large amounts of contiguous address space can be hard.

Memory allocation can slow down when memory mapping is needed.

Other operating systems differ in detail only.

64-bit machines are much less limited.

---

The design matrix for `$n$` cases and `$p$` variables stored in double
precision needs `$8np$` bytes of memory.

|             | `$p = 10$`   |  `$p = 100$`  | `$p = 1000$`  |
|-------------|-----------:|------------:|------------:|
| n = 100     | 8,000      |  80,000     | 800,000     |
| n = 1,000   |  80,000    |  800,000    | 8,000,000   |
| n = 10,000  |  800,000   |  8,000,000  | 80,000,000  |
| n = 100,000 |  8,000,000 |  80,000,000 | 800,000,000 |

---

### Virtual and Physical Memory

To use address space, a process must ask the kernel to map physical
space to the address space.

.pull-left[
There is a hierarchy of physical memory:

<img src="img/memhier.png" width="50%" style="display: block; margin: auto;" />
]

.pull-right[
Hardware/OS hides the distinction.
{{content}}
]
--

Caches are usually on or very near the processor chip and very fast.
{{content}}
--

RAM usually needs to be accessed via the bus
{{content}}
--

The hardware/OS try to keep recently accessed memory and locations
nearby in cache.

---

A simple example:

.pull-left.small-code[
```r
msum <- function(x) {
    nr <- nrow(x)
    nc <- ncol(x)
    s <- 0
    for (i in 1 : nr)
        for (j in 1 : nc)
            s <- s + x[i, j]
    s
}
m <- matrix(0, nrow = 5000000, 2)
system.time(msum(m))
##   user  system elapsed 
##  1.712   0.000   1.712 
```
{{content}}
]
--
```r
fix(msum) ## reverse the order of the sums
system.time(msum(m))
##   user  system elapsed 
##  0.836   0.000   0.835 
```
--

.pull-right[
* Matrices are stored in _column major order_.
{{content}}
]
--

* This effect is more pronounced in low level code.
{{content}}
--

* Careful code tries to preserve _locality of reference_.

---

### Registers

Registers are storage locations on the processor that can be accessed
very fast.

Most basic processor operations operate on registers.

Most processors have separate sets of registers for integer and
floating point data.

On some processors, including i386 and x64, the floating point registers have
_extended precision_.

<!--
The i386 architecture has few registers, 8 floating point, 8 integer
data, 8 address; some of these have dedicated purposes.  Not sure
about x86_64 (our lab computers).

RISC processors usually have 32 or more of each kind.
-->

Optimizing compilers work hard to keep data in registers.

Small code changes can cause dramatic speed changes in optimized code
because they make it easier or harder for the compiler to keep data in
registers.

If enough registers are available, then some function arguments can be
passed in registers.

Vector support facilities, like SSE2, provide additional registers
that compilers may use to improve performance.

---
layout: true
## Processes and Shells
---

A _shell_ is a command line interface to the computer's operating
system.

Common shells on Linux and MacOS are `bash` and `tcsh`.

You can now set your default Linix shell at <https://hawkid.uiowa.edu/>.

Shells are used to interact with the file system and to start
_processes_ that run _programs_.

You can set process limits and environment variables in the shell.

Programs run from shells take command line arguments.

---

### Some Basic `bash`/`tcsh` Commands

`hostname` prints the name of the computer the shell is running on.

`pwd` prints the current working directory.

`ls` lists files a directory
  
* `ls` lists files in the current directory.
* `ls foo` lists files in a sub-directory `foo`.

`cd` changes the working directory:

* `cd` or `cd ~` moves to your home directory;
* `cd foo` moves to the sub-directory `foo`;
* `cd ..` moves up to the parent directory;

`mkdir foo` creates a new sub-directory `foo` in your current working
directory;

`rm`, `rmdir` can be used to remove files and directories; **BE VERY
CAREFUL WITH THESE!!!**

---

### Standard Input, Standard Output, and Pipes

Programs can also be designed to read from _standard input_ and write
to _standard output_.

.pull-left[
Shells can _redirect_ standard input and standard output.
{{content}}
]
--

Shells can also connect processes into _pipelines_.
{{content}}
--

On multi-core systems pipelines can run in parallel.
{{content}}
--

A simple example using the `bash` shell script
[`P1.sh`](https://homepage.divms.uiowa.edu/~luke/classes/STAT7400-2023/examples/pipes/P1.sh):

```shell
#!/bin/bash

while true; do echo $1; done
```

.pull-right[
This can be run using the `rev` program as

```shell
bash P1.sh fox
bash P1.sh fox > /dev/null
bash P1.sh fox | rev
bash P1.sh fox | rev > /dev/null
bash P1.sh fox | rev | rev > /dev/null
```
{{content}}
]
--

Examples are available [here](https://homepage.divms.uiowa.edu/~luke/classes/STAT7400-2023/examples/pipes).

---

### The `proc` File System

The `proc` file system allows you to view many aspects of a computer
and a process.

//adapted from Emi Tanaka's gist at //https://gist.github.com/emitanaka/eaa258bb8471c041797ff377704c8505