22C:116, Lecture Notes, Lecture 7, Fall 1999

Douglas W. Jones
University of Iowa Department of Computer Science

Memory
What if the aggregate memory requirements of the system plus applications excede the available main memory. If there is auxiliary memory available, for example, on disk, there are at least three ways of solving the problem:
- Overlays
- Swapping
- Paging
Overlays are sections of an application's code or data that are stored on secondary memory when not in use. It is up to the application program to bring overlays into main memory when they are needed by that application, and it is up to the application to return overlays to disk, if necessary, when they are unneeded.
Typically, each overlay contains a procedure or function. The main program must load the overlay before calling the routine, and the routine in the overlay may call any routines that were loaded with the main program. In addition, an overlay may contain secondary routines called only from within that overlay, and it may load other overlays.
In the most primitive overlay systems, memory to hold the overlays is allocated statically. In more advanced overlay systems, memory to hold overlays is allocated dynamically. In such a system, a call to a routine contained in an overlay might be coded as follows:
```
	{ /* load overlay X and then call function x within it */
		int X;       /* file descriptor */
		int Xsiz;    /* size of file */
		void * Xpt;  /* pointer to overlay */

		/* first find out about the file */
		X = open("X", ...);
		Xsiz = lseek( X, 0, L_XTND );

		/* second, make space for the contents */
		Xpt = malloc( Xsiz  );

		/* third, load the file in memory */
		Xsiz = lseek( X, 0, L_SET );
		read( X, Xpt, Xsiz);

		/* then, call the function */
		(*Xpt)(params)

		/* finally, dispose of the memory it occupied */
		free( Xpt )
	}
```
The above C code assumes that the overlay can be loaded into memory by a simple call to read(). This will be true if the file contains position independent code, but a more complicated loader is required if any relocation or linkage needs to be performed.
A common form of overlay is the load-on-call procedure. This purpose of this is to hide the details of overlay management from the user of the overlay, packaging these details in a special procedure called an overlay stub. The only part of the procedure initially loaded in main memory is the load-on-call stub; a call to this loads the procedure into main memory and then transfers control to it. Thus, the act of calling the procedure causes it to be loaded.
The word stub is widely used in discussions of program development, dynamic linkage and distributed software, so it is important to make it clear that this word is not just technical jargon, but also common English.
A stub is the remainder of a branch or limb that has been cut or broken off of a tree or animal. In the case of software, the tree that is of interest is the abstract tree describing the history of procedure and function calls during the execution of a program. The root of this tree is the main program, and each routine called from the root corresponds to a branch. A routine that calls no other routines is a leaf of the tree, and if some branch of the tree is run on a different machine, or loaded by a different loader, or involves code that has not yet been written, the code at the base of that branch is referred to as a stub.
Consider the procedure X. In a program calling X where the memory is not sufficient for all routines needed by that program, a call to X might be linked to a stub instead of being linked to X. The following stub is designed to mimic the external interface of X, and when it is called, it loads X into memory, calls it, returns the results of X, and then deallocates the memory used so that it may be used for other routines.
```
	Xstub(params)
		Xpt = malloc( sizeof(X) )
		load( X, Xpt )
		(*Xpt)(params)
		free( Xpt )
		return
```
Here, sizeof(f) returns the size of the memory region needed by the object file f, and load(f,p) is a system routine that loads object file f into the memory region pointed to by p.
More intelligent overlay management stubs can swap in a routine only if that routine was not already in memory. Consider the following stub for X that shares memory with routines Y and Z:
```
        Xstub(params)
            if Xpt is null then
                Xpt = malloc( sizeof(X) )
                if Xpt is null then -- the allocate failed
                    if Ypt is not null then
                        free(Ypt)
                    else if Zpt is not null then
                        free(Zpt)
                    endif
                    Xpt = malloc( sizeof(X) )
                endif
                load( X, Xpt )
            endif
            (*Xpt)(params)
            return
```
Here, when X is first called, the stub will allocate space and load the code. On subsequent calls, the code will remain loaded so no new allocation is needed. If, however, space is needed for routines Y or Z, X may be deallocated. If X is needed ans space is unavailable, the stub for X will deallocate Y or Z to make space.
The above scheme rests on the assumption that the stubs for X, Y and Z cooperate to share memory, and it rests on the assumption that the code for X can be deallocated safely when Y or Z need the space! This assumption can be met if no procedure called from X calls Y or Z, no procedure called from Y calls X or Z, and no procedure called from X cally X or Y.
By the late 1960's, most computer manufacturers offered elaborate linkage editors that would analyze the procedure call graph of a program to determine what procedures could operate as peers in overlay management schemes such as the above. Today, such overlay managers are almost obsolete because of the widespread use of virtual memory.
Swapping involves the process scheduler. When swapping is used, memory assigned to ready processes is considered preemptable. If memory is needed, a ready process is selected to be swapped out. Swapping a process out of main memory involves copying the data segments of that process to a swapping device, usually a disk. When a process is selected to be run, if that process is swapped out, the system must first find or make space in main memory, then swap that process in, before running it.
This leads to a process state diagram such as the following:
```
         preempt
           or         swap
       relinquish     out
         ---->--     ---->-
       /         \ /        \
   RUNNING      READY     SWAPPED
       \         / \        /
         --<----     -<----
        schedule      swap
                       in
```
Swapping systems are at their best if main memory can be divided into at least two partitions, each able to hold one process. In this case, the process in one partition can be running while processes in other partitions are being copied to or from disk. The classic IBM mainframe operating systems of the 1960's operated this way, as did early timesharing systems for the DEC PDP-8 and PDP-10 (a minicomputer and a mainframe, respectively). The first versions of UNIX, on the DEC PDP-9 and PDP-11 were also based on this idea.
If memory on a uniprocessor was divided into N partitions, there could be at most N-1 ready processes, but typically, there were N-2 ready processes because at any given instant, the process in one partition was typically in transit to or from disk.
Virtual memory based on Paging and the related technique of Segmenting involves hardware address translation mechanisms that allow individual pages of the main memory being used by a program to be preemptively claimed by the system on behalf of other programs.
Virtual Memory
The subject of Virtual memory is on the interface between operating systems and computer architecture
```
  CPU <---> MMU <---> Memory
	   /
	  Address
          translation
          hardware
```
Addresses issued by the CPU are considered to be virtual addresses. The memory management unit translates these to physical addresses or it complains (with a trap request) that the translation is impossible or that the proposed access is not allowed.
Hardware Aspects of Address Translation
- Is the architecture paged or segmented?
- A page is a fixed size small block of memory
- A segment is a variable sized large block of memory
We will concentrate on paged memory, but there are many segmented machines. Notably, the 8086 family of microprocessors support a segmented memory model that is most clearly expressed in the 80286. This was not effectively exploited, and later 80x86 machines allowed users to largely ignore segmenting.
Pages are typically used without regard to the logical arrangement of memory -- that is, data structures are not typically arranged in any particular way with respect to page boundaries. Programmers on paged systems do not usually pay any attention to the presence of pages.
In a segmented system, each data structure is typically assigned to a particular segment. Thus, a segments might be dedicated to each procedure or to each activation record on the stack in a fine- grained segmented system. In a coarse-grained use of segmenting, one segment might be reserved for the stack, another for the code, and a third for the global data of a program.
On some systems, large segments are divided into pages.
On paged systems, the virtual address is viewed as a single integer that indexes a word (or byte) in a single linear virtual address space.
On segmented systems, the virtual address space may be a single linear space, or it may be divided into separate spaces. The 8086 is an example of the latter, where the segment being addressed is determined by the context. PUSH and POP reference the stack segment. Most loads and stores reference the data segment, and instruction fetches reference the code segment. (This default behavior may be overridden by special instruction prefixes).

Paged Address Translation

Here is a diagram of the data flow through the hardware used to translate a virtual address to a physical address in a simple paged virtual address translation memory management unit.

    Virtual Address from CPU
     ______________________
    |__________|___________|
         | Page      |______________________
         | ___ _________       Word in Page |
         |  | |         |                   |
    array|__| |  Page   |         __________|
    index   | |   Table |        |
            V |         |        |
           -->| |       |        |
              |_|_______|        |
               |    |Frame       |
             __|    |__          |
            |          |         |
            v     _____v_________v______
         Access  |__________|___________|
         Control
                 Physical Address to Memory

The picture shown here shows the logical function of the memory management unit, but there are many possible physical organizations -- if the page table is large, it is not practical to store it in the MMU!

The Contents of the Page Table

The format of one entry in the page table will typically be something similar to the following:

    _ _ _  _______ ______________________
  | _ _ _ |_______|______________________|
      |       |              |
      |       |            Frame -- where is the
      |       |                     page stored.
      |     Access
      |     Control -- what actions are allowed,
      |                by hardware on this page.
    Soft                         || invalid
    fields -- used by software   || read-only
              to record other    || read-write
  (optional)  attributes of the  || execute-only
              page.              || dirty

Some hardware leaves space for software use in each page-table entry. This allows the system software to maintain additional attributes of pages that are not known to the hardware. On other systems, the sofware may have to allocate space for any such fields in a parallel array. In any case, the hardware does not interpret the soft fields in any way, and on some systems, the software may not need such fields.

Brute Force Page Table Storage
On systems with a small virtual address space, the page table can be stored in hardware registers within the memory management unit.
Consider a machine with: 16 bits per virtual address, 8 bits to indicate which word in page, 8 bits to indicate which page.
This requires 256 registers in the MMU to hold one page table.
Thus, Even with special hardware, context switching may be quite slow compared to a machine where only the registers in the CPU are saved and restored.
The above example is based on the MODCOMP Classic series of computers first sold in around 1973 (www.modcomp.com still exists). This were extensively used by NASA (in the space shuttle ground support system) and the oil industry (in real-time control of oil refineries and drilling operations). The machine had 16 registers of 16 bits each in the CPU, usable as 8 registers of 32 bits each.
Alternatives for Page Table Storage
The table can be stored in main memory. In this case, the MMU takes 2 memory cycles per memory reference, unless the table is store in main memory with a cache located in the MMU to speed access.
Cache-based paging systems are simplest if the cache size is the same as the number of page frames in main memory.
The Ferranti Atlas computer, introduced in 1960, was the very first machine to use this scheme; the page table was stored in main memory, and there was one cache entry per page frame so that there was no significant run-time penalty for address translation.
```
Cache Entry
   _______________________________
  |____________|__________________|
	| Page         | Table
	| Number       | Entry
	|              | 
   Subject to     Returned by
   Associative    Successful
   Search         Search
```
In its simplest form, the associative search involved in searching the cache involves a simple parallel comparison of the page number in each cache entry with the page number issued by the CPU. This can be done very quickly with appropriate hardware. Cache hardware to support fully associative searches is expensive, but the alternatives are the subject of an architecture course and are will not be discussed here.
If the cache size is the same as the number of page frames in main memory, a cache miss signals a page fault, and the software can be put in charge of replacing the cache entry as it replaces the page in the page frame. This approach results in the simplest possible cache hardware, since the hardware need not manage cache misses and replacement.
The Software Hardware Interface
The page fault trap service routine gains control when the MMU detects access to a page that is not in main memory.
A trap is very similar to an interrupt. The traditional distinction is that interrupts are caused by asynchronous events, while traps are caused by the actions of the CPU. Other traps include such things as arithmetic exception traps (caused by the values of operands to arithmetic instructions), and bus-fault traps (caused by attempts to access physical addresses to which no physical memory is attached).
On some machines, there is an additional convention -- When an interrupt is requested, the current instruction is allowed to complete before the PC is saved and the interrupt service routine is entered. When a trap occurs, the current instruction is aborted and the trap service routine is entered directly.
For it to be possible to serve a page fault, the instruction must be aborted before it has any side effects, or (as on the PDP-11) all side effects must be recorded so that they can be undone by the trap service routine prior to any attempt to restart the instruction.
The exception to the above rule is on machines such as the Intel 8086 family, where some instructions are designed to be interruptable -- typically, this applies to block copy instructions. These instructions, if interrupted, save their state in CPU registers in such a way that they can be restarted from the point of interruption instead of restarting from the beginning.
Page Fault Traps
On entry to the trap service routine, the saved PC is the address of the instruction that caused the trap
Before return, the trap service routine must adjust the page table so that re-executing the instruction that caused the trap will not cause the same page fault.
On the PDP-11, the saved PC was not the correct one, but sufficient information was saved about the partial exectuion of the instruction the the correct PC could be reconstructed by the trap service routine.
A single instruction can cause many page faults. Consider a memory to memory move double word instruction. The instruction itself will typically be more than one word long, and it may span a page boundary. Both the source and destination addresses may span page boundaries. Thus, an attempt to execute this single instruction might cause as many as 6 page faults.
Clearly, if there are fewer than 6 page frames on the machine, this instruction cannot be executed. In general, the instruction set of a virtual machine must be examined to determine the maximum number of pages a single instruction can execute, and then that number of page frames must be guaranteed.
Each entry to the page fault trap service routine must bring in one of the pages needed to complete an instruction. It need not bring in all of the required pages, since it is frequently faster to simply retry the instruction to cause the next fault than it is to use software to try to determine what other pages the instruction might need.