class: center, middle, title-slide .title[ # Basic Computer Architecture ] .author[ ### Luke Tierney ] .institute[ ### University of Iowa ] .date[ ### 2023-05-19 ] --- <link rel="stylesheet" href="stat7400.css" type="text/css" /> ## Typical Machine Layout <img src="img/arch.png" width="80%" style="display: block; margin: auto;" /> Figure based on M. L. Scott, _Programming Language Pragmatics_, Figure 5.1, p. 205 --- layout: true ## Structure of Lab Workstations --- ### Processor and Cache .pull-left.tiny-code.width-55[ ```shell luke@l-lnx200 ~% lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 24 On-line CPU(s) list: 0-23 Vendor ID: GenuineIntel Model name: 12th Gen Intel(R) Core(TM) i9-12900K CPU family: 6 Model: 151 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 1 Stepping: 2 CPU(s) scaling MHz: 59% CPU max MHz: 5200.0000 CPU min MHz: 800.0000 BogoMIPS: 6374.40 Flags: ... Virtualization features: Virtualization: VT-x Caches (sum of all): L1d: 640 KiB (16 instances) L1i: 768 KiB (16 instances) L2: 14 MiB (10 instances) L3: 30 MiB (1 instance) ... ``` ] -- .pull-right.width-40[ * There is a single _16-core_ processor with _hyperthreading_ that acts like 24 separate processors {{content}} ] -- * _Hyperthreading_ is enabled, which makes each core to some extent behave like two processors. {{content}} -- * The total L3 cache is 30MiB ] --- ### Memory and Swap Space .pull-left.tiny-code.width-60[ ```shell luke@l-lnx200 ~% free -m total used free shared buff/cache available Mem: 31924 1144 20258 25 10521 30258 Swap: 24255 0 24255 ``` ] -- .pull-right.width-40[ * The workstations have about 32G of memory. {{content}} ] -- * The swap space is about 24G. -- ### Disk Space .pull-left.tiny-code.width-60[ Using the `df` command produces: ```shell luke@l-lnx200 ~% df -BG Filesystem 1G-blocks Used Available Use% Mounted on ... /dev/nvme0n1p2 1G 1G 1G 16% /boot /dev/mapper/vg00-tmp 8G 1G 8G 1% /tmp /dev/mapper/vg00-var 47G 15G 31G 33% /var ... /dev/mapper/vg00-scratch 26G 1G 26G 1% /var/scratch ... clasnetappvm...:/students 300G 144G 157G 48% /mnt/nfs/clasnetappvm/students ... clasnetappvm...:/shared 82G 20G 63G 24% /mnt/nfs/clasnetappvm/shared ... clasnetappvm...:/grad 1536G 560G 977G 37% /mnt/nfs/clasnetappvm/grad ... ``` ] -- .pull-right.width-40[ * Local disks are large but mostly unused {{content}} ] -- * Space in `/var/scratch` can be used for temporary storage. {{content}} -- * User space is on network disks. {{content}} -- * Network speed can be a bottle neck. --- ### Performance Monitoring .pull-left.tiny-code.width-55[ Using the `top` command produces: ```shell top - 11:06:34 up 4:06, 1 user, load average: 0.00, 0.01, 0.05 Tasks: 127 total, 1 running, 126 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 0.0%sy, 0.0%ni, 99.8%id, 0.2%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 16393524k total, 898048k used, 15495476k free, 268200k buffers Swap: 18481148k total, 0k used, 18481148k free, 217412k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1445 root 20 0 445m 59m 23m S 2.0 0.4 0:11.48 kdm_greet 1 root 20 0 39544 4680 2036 S 0.0 0.0 0:01.01 systemd 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd 3 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/0 5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H 6 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/u:0 7 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/u:0H 8 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0 9 root RT 0 0 0 0 S 0.0 0.0 0:00.07 watchdog/0 10 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/1 12 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/1:0H 13 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/1 14 root RT 0 0 0 0 S 0.0 0.0 0:00.10 watchdog/1 15 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/2 17 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/2:0H 18 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/2 ... ``` ] -- .pull-right.width-40[ * Interactive options allow you to kill or _renice_ (change the priority of) processes you own. {{content}} ] -- * The command `htop` may be a little nicer to work with. {{content}} -- * A GUI tool, `System Monitor`, is available from one of the menus. From the command line this can be run as `gnome-system-monitor`. --- Another useful command is `ps` (process status) ```shell luke@l-lnx200 ~% ps -u luke PID TTY TIME CMD 4618 ? 00:00:00 sshd 4620 pts/0 00:00:00 tcsh 4651 pts/0 00:00:00 ps ``` There are many options; see `man ps` for details. --- layout: true ## Processors --- ### Basics Processors execute a sequence of instructions. -- Each instruction requires some of * decoding instruction * fetching operands from memory * performing an operation (add, multiply, ...) * etc. -- Older processors would carry out one of these steps per clock cycle and then move to the next. -- Most modern processors use _pipelining_ to carry out some operations in parallel. --- ### Pipelining A simple example: `\(s \leftarrow 0\)`<br> **for** `\(i = 1\)` **to** `\(n\)` **do**<br> `\(s \leftarrow s + x_i y_i\)`<br> **end** <!-- Maybe use something like this: https://gist.github.com/cderv/5479d8e526db635dd666293e25405b42 --> -- Simplified view: Each step has two parts, * Fetch `\(x_i\)` and `\(y_i\)` from memory * Compute `\(s = s + x_i y_i\)` --- Suppose the computer has two functional units that can operate in parallel, * An _Integer_ unit that can fetch from memory * A _Floating Point_ unit that can add and multiply -- If each step takes roughly the same amount of time, a pipeline can speed the computation by a factor of two: <img src="img/pipeline.png" width="80%" style="display: block; margin: auto;" /> --- * Floating point operations are much slower than this. -- * Modern chips contain many more separate functional units. -- * Pipelines can have 10 or more stages. -- * Some operations take more than one clock cycle. -- * The compiler or the processor orders operations to keep the pipeline busy. -- * If this fails, then the pipeline _stalls_. --- ### Superscalar Processors, Hyper-Threading, and Multiple Cores Some processors have enough functional units to have more than one pipeline running in parallel. -- Such processors are called _superscalar_. -- In some cases there are enough functional units per processor to allow one physical processor to pretend like it is two (somewhat simpler) logical processors. This approach is called _hyper-threading_. -- * Hyper-threaded processors on a single physical chip share some resources, in particular cache. -- * Benchmarks suggest that hyper-threading produces about a 20% speed-up in cases where dual physical processors would produce a factor of 2 speed-up --- It is now possible to fully replicate processors within one chip; these are _multi core_ processors. * Multi-core machines are effectively full multi-processor machines (at least for most purposes). -- * Dual core processors are now ubiquitous. -- * The department research machine `r-lnx404` has two 16-core processors. -- * Our lab machines have a single 16-core processor. -- * Processors with even more cores are available. -- Many processors support some form of vectorized operations, e.g. SSE2 (Single Instruction, Multiple Data, Extensions 2) on Intel and AMD processors. -- GPUs provide even more parallelism but require specialized programming. --- ### Implications Modern processors achieve high speed though a collection of clever tricks. -- Most of the time these tricks work extremely well. -- Every so often a small change in code may cause pipelining heuristics to fail, resulting in a pipeline stall. -- These small changes can then cause large differences in performance. -- The chances are that a "small change" in code that causes a large change in performance was not in fact such a small change after all. -- Processor speeds have not been increasing very much recently. -- Though the `arm64` family (Apple M1) has produced some significant speedups while reducing power consumption. -- Many believe that speed improvements will need to come from increased use of explicit parallel programming. -- More details are available in a talk at > https://www.infoq.com/presentations/click-crash-course-modern-hardware/ --- layout: true ## Memory --- ### Basics .pull-left[ Data and program code are stored in memory. {{content}} ] -- Memory consists of [_bits_ (binary integers)](https://en.wikipedia.org/wiki/Bit) {{content}} -- On most computers * Bits are collected into groups of eight, called a [_byte_](https://en.wikipedia.org/wiki/Byte). {{content}} -- * There is a natural _word size_ of `\(W\)` bits. {{content}} -- * The most common value of `\(W\)` used to be 32; it is probably now 64; 16 also occurs. -- .pull-right[ * Bytes are numbered consecutively, `\(0, 1, 2, \dots, N = 2^W\)`. {{content}} ] -- * An _address_ for code or data is a number between `\(0\)` and `\(N\)` representing a location in memory, usually in bytes. {{content}} -- * `\(2^{32} = 4,294,967,296 = 4\text{GB}\)`. {{content}} -- * The maximum amount of memory a 32-bit process can address is 4 Gigabytes. {{content}} -- * Some 32-bit machines can use more than 4G of memory, but each process gets at most 4G. {{content}} -- * Most hard disks are _much_ larger than 4G. --- ### Memory Layout A process can conceptually access up to `\(2^W\)` bytes of address space. -- The operating system usually reserves some of the address space for things it does on behalf of the process. -- On 32-bit Linux the upper 1GB is reserved for the operating system kernel. -- Only a portion of the usable address space has memory allocated to it. -- Standard 32-bit Linux memory layout: <img src="img/lnxmem.png" width="80%" style="display: block; margin: auto;" /> --- The standard heap can only grow to 1G. -- `malloc` implementations can allocate more using memory mapping. -- Obtaining large amounts of contiguous address space can be hard. -- Memory allocation can slow down when memory mapping is needed. -- Other operating systems differ in detail only. -- 64-bit machines are much less limited. --- The design matrix for `\(n\)` cases and `\(p\)` variables stored in double precision needs `\(8np\)` bytes of memory. | | `\(p = 10\)` | `\(p = 100\)` | `\(p = 1000\)` | |-------------|-----------:|------------:|------------:| | n = 100 | 8,000 | 80,000 | 800,000 | | n = 1,000 | 80,000 | 800,000 | 8,000,000 | | n = 10,000 | 800,000 | 8,000,000 | 80,000,000 | | n = 100,000 | 8,000,000 | 80,000,000 | 800,000,000 | --- ### Virtual and Physical Memory To use address space, a process must ask the kernel to map physical space to the address space. -- .pull-left[ There is a hierarchy of physical memory: <img src="img/memhier.png" width="50%" style="display: block; margin: auto;" /> ] -- .pull-right[ Hardware/OS hides the distinction. {{content}} ] -- Caches are usually on or very near the processor chip and very fast. {{content}} -- RAM usually needs to be accessed via the bus {{content}} -- The hardware/OS try to keep recently accessed memory and locations nearby in cache. --- A simple example: .pull-left.small-code[ ```r msum <- function(x) { nr <- nrow(x) nc <- ncol(x) s <- 0 for (i in 1 : nr) for (j in 1 : nc) s <- s + x[i, j] s } m <- matrix(0, nrow = 5000000, 2) system.time(msum(m)) ## user system elapsed ## 1.712 0.000 1.712 ``` {{content}} ] -- ```r fix(msum) ## reverse the order of the sums system.time(msum(m)) ## user system elapsed ## 0.836 0.000 0.835 ``` -- .pull-right[ * Matrices are stored in _column major order_. {{content}} ] -- * This effect is more pronounced in low level code. {{content}} -- * Careful code tries to preserve _locality of reference_. --- ### Registers Registers are storage locations on the processor that can be accessed very fast. -- Most basic processor operations operate on registers. -- Most processors have separate sets of registers for integer and floating point data. -- On some processors, including i386 and x64, the floating point registers have _extended precision_. -- <!-- The i386 architecture has few registers, 8 floating point, 8 integer data, 8 address; some of these have dedicated purposes. Not sure about x86_64 (our lab computers). RISC processors usually have 32 or more of each kind. --> Optimizing compilers work hard to keep data in registers. -- Small code changes can cause dramatic speed changes in optimized code because they make it easier or harder for the compiler to keep data in registers. -- If enough registers are available, then some function arguments can be passed in registers. -- Vector support facilities, like SSE2, provide additional registers that compilers may use to improve performance. --- layout: true ## Processes and Shells --- A _shell_ is a command line interface to the computer's operating system. -- Common shells on Linux and MacOS are `bash` and `tcsh`. -- You can now set your default Linix shell at <https://hawkid.uiowa.edu/>. -- Shells are used to interact with the file system and to start _processes_ that run _programs_. -- You can set process limits and environment variables in the shell. -- Programs run from shells take command line arguments. --- ### Some Basic `bash`/`tcsh` Commands `hostname` prints the name of the computer the shell is running on. -- `pwd` prints the current working directory. -- `ls` lists files a directory * `ls` lists files in the current directory. * `ls foo` lists files in a sub-directory `foo`. -- `cd` changes the working directory: * `cd` or `cd ~` moves to your home directory; * `cd foo` moves to the sub-directory `foo`; * `cd ..` moves up to the parent directory; -- `mkdir foo` creates a new sub-directory `foo` in your current working directory; -- `rm`, `rmdir` can be used to remove files and directories; **BE VERY CAREFUL WITH THESE!!!** --- ### Standard Input, Standard Output, and Pipes Programs can also be designed to read from _standard input_ and write to _standard output_. -- .pull-left[ Shells can _redirect_ standard input and standard output. {{content}} ] -- Shells can also connect processes into _pipelines_. {{content}} -- On multi-core systems pipelines can run in parallel. {{content}} -- A simple example using the `bash` shell script [`P1.sh`](https://homepage.divms.uiowa.edu/~luke/classes/STAT7400-2023/examples/pipes/P1.sh): ```shell #!/bin/bash while true; do echo $1; done ``` -- .pull-right[ This can be run using the `rev` program as ```shell bash P1.sh fox bash P1.sh fox > /dev/null bash P1.sh fox | rev bash P1.sh fox | rev > /dev/null bash P1.sh fox | rev | rev > /dev/null ``` {{content}} ] -- Examples are available [here](https://homepage.divms.uiowa.edu/~luke/classes/STAT7400-2023/examples/pipes). --- ### The `proc` File System The `proc` file system allows you to view many aspects of a computer and a process.
//adapted from Emi Tanaka's gist at //https://gist.github.com/emitanaka/eaa258bb8471c041797ff377704c8505