7. Virtual Memory

Part of the 22C:116 Lecture Notes for Fall 2002
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Primary and Secondary Memory

What if the aggregate memory requirements of the system plus applications excede the available main memory. If there is auxiliary memory available, for example, on disk, there are at least three ways of using this as a secondary memory or backing storage:

Overlays
Swapping
Paging

Overlays

Overlays are sections of an application's code or data that are stored on secondary memory when not in use. It is up to the application to read overlays into main memory when they are needed, and it is up to the application to write overlays back to secondary memory, if necessary, when they are unneeded.

Typically, each overlay contains a procedure or function. The main program must load the overlay before calling the routine, and the routine in the overlay may call any routines that were loaded with the main program. In addition, an overlay may contain secondary routines called only from within that overlay, and it may load other overlays.

In the most primitive overlay systems, memory to hold the overlays is allocated statically. In more advanced overlay systems, memory to hold overlays is allocated dynamically. In such a system, a call to a routine contained in an overlay might be coded as follows:

	{ /* load overlay X and then call function x within it */
		int X;       /* file descriptor */
		int Xsiz;    /* size of file */
		void * Xpt;  /* pointer to overlay */

		/* first find out about the file */
		X = open("X", ...);
		Xsiz = lseek( X, 0, L_XTND );

		/* second, make space for the contents */
		Xpt = malloc( Xsiz  );

		/* third, load the file in memory */
		Xsiz = lseek( X, 0, L_SET );
		read( X, Xpt, Xsiz);

		/* then, call the function */
		(*Xpt)(params)

		/* finally, dispose of the memory it occupied */
		free( Xpt )
	}

The above C code assumes that the overlay can be loaded into memory by a simple call to read(). This will be true if the file contains position independent code, but a more complicated loader is required if any relocation or linkage needs to be performed.

A common form of overlay is the load-on-call procedure. This purpose of this is to hide the details of overlay management from the user of the overlay, packaging these details in a special procedure called an overlay stub. The only part of the procedure initially loaded in main memory is the load-on-call stub; a call to this loads the procedure into main memory and then transfers control to it. Thus, the act of calling the procedure causes it to be loaded.

The word stub is widely used in discussions of program development, dynamic linkage and distributed software, so it is important to make it clear that this word is not just technical jargon, but also common English.
A stub is the remainder of a branch or limb that has been cut or broken off of a tree or animal. In the case of software, the tree that is of interest is the abstract tree describing the history of procedure and function calls during the execution of a program. The root of this tree is the main program, and each routine called from the root corresponds to a branch. A routine that calls no other routines is a leaf of the tree, and if some branch of the tree is run on a different machine, or loaded by a different loader, or involves code that has not yet been written, the code at the base of that branch is referred to as a stub.

Consider the procedure X. In a program calling X where the memory is not sufficient for all routines needed by that program, a call to X might be linked to a stub instead of being linked to X. The following stub is designed to mimic the external interface of X, and when it is called, it loads X into memory, calls it, returns the results of X, and then deallocates the memory used so that it may be used for other routines.

	Xstub(params)
		Xpt = malloc( sizeof(X) )
		load( X, Xpt )
		(*Xpt)(params)
		free( Xpt )
		return

Here, sizeof(f) returns the size of the memory region needed by the object file f, and load(f,p) is a system routine that loads object file f into the memory region pointed to by p.

More intelligent overlay management stubs can swap in a routine only if that routine was not already in memory. Consider the following stub for X that shares memory with routines Y and Z:

        Xstub(params)
            if Xpt is null then
                Xpt = malloc( sizeof(X) )
                if Xpt is null then -- the allocate failed
                    if Ypt is not null then
                        free(Ypt)
                    else if Zpt is not null then
                        free(Zpt)
                    endif
                    Xpt = malloc( sizeof(X) )
                endif
                load( X, Xpt )
            endif
            (*Xpt)(params)
            return

Here, when X is first called, the stub will allocate space and load the code. On subsequent calls, the code will remain loaded so no new allocation is needed. If, however, space is needed for routines Y or Z, X may be deallocated. If X is needed ans space is unavailable, the stub for X will deallocate Y or Z to make space.

The above scheme rests on the assumption that the stubs for X, Y and Z cooperate to share memory, and it rests on the assumption that the code for X can be deallocated safely when Y or Z need the space! This assumption can be met if no procedure called from X calls Y or Z, no procedure called from Y calls X or Z, and no procedure called from X cally X or Y.

By the late 1960's, most computer manufacturers offered elaborate linkage editors that would analyze the procedure call graph of a program to determine what procedures could operate as peers in overlay management schemes such as the above. Today, such overlay managers are almost obsolete because of the widespread use of virtual memory.

Overlays allow a single program to run in a main memory region smaller than that program. The next technique we will discus allows multiple processes to run in a meory that cannot hold all of them at once. Note, however, that it is almost always possible to break a single program into multiple communicating processes, where the purpose of the breakup is not to achieve parallelism, but merely to reduce the size of the address space needed by any single component of the original program. Therefore, we can also use the next idea as an alternative to overlays!

Swapping

Swapping involves the process scheduler. The term was originally based on the common English verb to swap meaning to exchange, because the basic idea involved exchanging a process (or process image) stored on disk for one stored in main memory. When swapping is used, memory assigned to ready processes is considered preemptable. If memory is needed, a ready process is selected to be swapped out. Swapping a process out of main memory involves copying the data segment(s) of that process to a swapping device, usually a disk. When a process is selected to be run, if that process is swapped out, the system must first find or make space in main memory, then swap that process in, before running it.

This leads to a new process state swapped. Ready processes become swapped when their memory resources are moved to secondary memory, and swapped processes become ready when their memory resources are moved back to primary memory. Swapped processes cannot be run because the CPU is unable to execute instructions from secondary memory and unable to load or store operands of machine instructions in secondary memory!

         preempt
           or         swap
       relinquish     out
         ---->--     ---->-
       /         \ /        \
   RUNNING      READY     SWAPPED
       \         / \        /
         --<----     -<----
        schedule      swap
                       in

Swapping systems are at their best if main memory can be divided into at least two partitions, each able to hold one process. In this case, the process in one partition can be running while processes in other partitions are being copied to or from disk. The classic IBM mainframe operating systems of the 1960's operated this way, as did early timesharing systems for the DEC PDP-8 and PDP-10 (a minicomputer and a mainframe, respectively). The first versions of UNIX, on the DEC PDP-9 and PDP-11 were also based on this idea.

It is worth noting that, on these systems of the 1960's, swapping was frequently done to a drum memory, so early system descriptions frequently speak of the swapping drum.

If memory on a uniprocessor was divided into N partitions, there could at most N-1 ready processes and one running process. If the total number of processes that could be run was greater than N, there would typically be were N-2 ready processes because at any given instant, the process in one of the partitions was in transit to or from disk.

Paging

Virtual memory based on paging and the related technique of segmenting involves hardware address translation mechanisms that allow individual pages of the main memory used by a program to be preemptively claimed the system for use by other programs.

The subject of virtual memory lies on one of the most important interface between operating systems and computer architecture because it depends on the nature of the memory management unit hardware, and the memory management unit hardware exists primarily to support virtual memory. The memory management unit sits between the processor and main memory:

         _____               data               _____
        |     |  |<------------------------>|  |main |
        | CPU |--|          _____           |--| RAM |
        |_____|  |-------->|     |--------->|  |_____|
                   virtual | MMU | physical
                   address |_____| address

Addresses issued by the CPU are considered to be virtual addresses. The memory management unit translates these to physical addresses or it complains (with a trap request) that the translation is impossible or that the proposed access is not allowed.

Two primary models of address translation have been explored by the designers of memory management units: Paged memory management units operate in terms of fixed-size pages of memory, while segmented memory management units operate in terms of variable-size segments of memory. Mixed models are common in modern hardware; most of these are based on variable sized segments where each segment is composed of one or more fixed size pages.

We will concentrate on paged memory, but there are many segmented machines. Notably, the 80x86 family of microprocessors support a segmented memory model that is most clearly expressed in the 80286. This was not effectively exploited by most operating systems developed for the x86 family.

Pages are typically used without regard to the logical arrangement of memory -- that is, data structures are not typically arranged in any particular way with respect to page boundaries. Programmers on paged systems do not usually pay any attention to the presence of pages. The exception to this is typically the heap manager; better heap managers know of the presence of pages in the heap and enlarge the heap by requesting extra pages when it fills, and also sometimes relinquish unneeded pages in the middle of large free blocks.

In a segmented system, each data structure is typically assigned to a particular segment. Thus, a segments might be dedicated to each procedure or to each activation record on the stack in a fine- grained segmented system. In a coarse-grained use of segmenting, one segment might be reserved for the stack, another for the code, and a third for the heap and global data of a program.

On paged systems, users generally view the virtual address is viewed as a single integer that indexes a word (or byte) in a single linear virtual address space. In contrast, on segmented systems, the virtual address space may be a single linear space, or it may be divided into separate spaces. The 80x86 is an example of the latter, where the segment being addressed is determined by the context. By default, PUSH and POP reference the stack segment, most loads and stores reference the data segment, and instruction fetches reference the code segment, but this default behavior may be overridden by special instruction prefixes.

Paged Address Translation

Here is a diagram of the data flow through the memory management unit hardware used to translate a virtual address to a physical address in a simple paged virtual address space.

    Virtual Address from CPU
     ______________________
    |__________|___________|   Word in Page
         | Page      |___________
         | ___ _________         |
         |  | | |       |        |
    array|__| |  Page   |        |
    index   | |   Table |        |
            V |         |        |
           -->| |       |        |
              |_|_______|        |
               |    |Frame       |
             __|    |__          |
            |          |         |
            v     _____v_________v______
         Access  |__________|___________|
         Control
                 Physical Address to Memory

The picture shown here shows the logical function of the memory management unit, but there are many possible physical organizations -- if the page table is large, it is not practical to store it entirely in a table within the MMU! However the page table is stored, the virtual address is divided into two fixed size fields, where the most significant field gives the page number in the address space, and the least significant gives the word number in that page (or byte-in-page, for byte addressable memory). The physical address is similary divided into a frame number field and a word -in-frame field. The translation function, done by a lookup in the page table, only deals with translating the page number to a frame number; the least significant bits of the address are not changed.

The Contents of a Page Table Entry

The format of one entry in the page table will typically be something similar to the following:

    _ _ _  _______ ______________________
  | _ _ _ |_______|______________________|
      |       |              |
      |       |            Frame -- where is the
      |       |                     page stored.
      |     Access
      |     Control ------- what actions are allowed,
      |                     by hardware on this page:
    Soft                         || invalid
    fields -- used by software   || read-only
              to record other    || read-write
  (optional)  attributes of the  || execute-only
              page.              || dirty

Some hardware allows extra bits in each page-table entry where those bits have no hardware function; these bits are available for use by software for any purpose the software may have, so they are called the soft fields of the page table entry. If the software needs extra space beyond what the hardware may have provided, it is always possible to allocate this in a parallel array unknown to the MMU hardware but also indexed by the page number.

Page Table Storage Options

On systems with a small virtual address space, the page table can be stored entirely in hardware registers within the memory management unit.

Consider the MODCOMP IV computer, a minicomputer first sold in around 1973. The MODCOMP Classic series of computers descended from the MODCOMP IV remained in production for many years, and in fact, www.modcomp.com still exists and still offers some support for their Classic line of processors. The MODCOMP IV had a 16-bit virtual address, 8 bits giving the page number and 8 bits giving the 16-bit word in page. The address map for this machine therefore occupied 256 internal registers in the MMU. This allowed very fast address translation, at the cost of slow context switching. Loading or storing one address map took as much as 256 memory cycles! The designers of this machine built the MMU with 1K internal registers, so that the MMU could hold 4 complete address maps. As a result, so long as the number of ready processes remained small, it was rarely necessary to load and store maps when context switching between processes. All that was needed was to change a small map-select register.

If the page table is too large to store in special purpose registers inside the MMU, it can be stored in main memory. In this case, the MMU needs to contain a page-table base register plus a memory data register. Context switching between virtual address spaces, can be very fast, all that is needed is a change to the page-table base register. Unfortunately, this has its cost! For each address to be translated, the MMU must read a page table from main memory to the internal memory data register, and then use this plus information from the virtual address to compute a physical address. Thus, each memory cycle of the CPU takes two memory cycles, one to translate the address and one to do useful work.

Despite this overhead, this solution is very common! The key to efficient implementation of this is to add a special cache memory called the translation lookaside buffer. This is a cache memory dedicated to memory address translation and integrated into the memory management unit. So long as the address being translated is in this cache, no extra memory cycles are needed and the MMU operates as quickly as a simple MMU holding the entire page table in internal registers. From a software perspective, context switching is simple, just change the page-table base register in the MMU, but this has a run-time cost because each time this register is changed, the entire contents of the cache are invalidated by hardware.

The Ferranti Atlas computer, introduced in 1960, was the very first machine to support paged address translation, and the MMU scheme it used was remarkably close to a TLB based scheme, although the MMU on the Atlas was actually implemented in a distributed manner, distributed over all of the physical memory modules of the machine.

Each entry in the TLB of a typical moder machine has the following form:

Cache Entry
   _______________________________
  |____________|__________________|
	| Page         | Page Table
	| Number       | Entry
	|              | 
   Subject to     Returned by
   Associative    Successful
   Search         Search

In its simplest form, the associative search involved in searching the cache involves a simple parallel comparison of the page number in each cache entry with the page number issued by the CPU. This can be done very quickly with appropriate hardware. Cache hardware to support fully associative searches is expensive, but the alternatives are the subject of an architecture course and are will not be discussed here.

If the cache size is the same as the number of page frames in main memory, a cache miss signals a page fault, and the software can be put in charge of replacing the cache entry as it replaces the page in the page frame. This is what was done on the Atlas. This approach results in the simplest possible cache hardware, since the hardware need not manage cache misses and replacement. Several modern RISC processors use variations on this simple model, but most MMU designs from the 1970's and early 1980's (including that of the Pentium) rest on hardware cache management.

The Page-Fault Trap Service Routine

The page-fault trap service routine gains control when the MMU detects an attempt to use a page-table entry that is invalid or an attempt to perform an operation that is not permitted by the access-control bits in the page-table entry.

A trap is very similar to an interrupt. The traditional distinction is that interrupts are caused by asynchronous events, while traps are caused by the actions of the CPU. Other traps include such things as arithmetic exception traps (caused by the values of operands to arithmetic instructions), and bus-fault traps (caused by attempts to access physical addresses to which no physical memory is attached).

On some machines, there is an additional convention -- When an interrupt is requested, the current instruction is allowed to complete before the PC is saved and the interrupt service routine is entered. When a trap occurs, the current instruction is aborted and the trap service routine is entered directly.

For it to be possible to serve a page fault, the instruction must be aborted before it has any side effects, or (as was done on the DEC PDP-11) all side effects must be recorded so that they can be undone by the trap service routine prior to any attempt to restart the instruction.

The exception to the above rule is on machines such as the Intel 80x86 family, where some instructions are designed to be interruptable -- typically, this applies to block copy instructions. These instructions, if interrupted, save their state in CPU registers in such a way that they can be restarted from the point of interruption instead of restarting from the beginning.

The memory management unit hardware requests a page-fault trap when an untranslatable virtual address is encountered. The response to this trap is entirely under the control of the operating system software, specifically the page fault trap service routine.

For all interrupt and trap service routines, the hardware and software cooperate to save the program counter and other key registers as control is transferred to the trap service routine. The details of how this is done vary considerably from machine to machine. On completion of the entry to the page-fault trap service routine, the saved PC is the address of the instruction that caused the trap.

Before return, the trap service routine must adjust the page table so that re-executing the instruction that caused the trap will not cause the same page fault. On return, the state of the CPU is returned exactly as it was prior to executing the instruction that caused the trap. As a result, from the perspective of the program, it will be as if there was no page fault trap!

The PDP-11, sold from 1970 into the 1980's, had a complex instruction set where some instructions could have several side effects before they computed a memory address that could cause a trap. The memory management unit on this machine included a special register that recorded these side effects (what registers had been incremented or decremented by how much); on entry to the page-fault service routine, the first computation required was to undo these side effects, restoring the CPU state to what it had been before the instruction that caused the trap.

A single instruction can potentially cause many page faults. Consider a memory to memory move double word instruction on a machine like the PDP-11 or the Motorola M68000. The instruction itself will typically be more than one word long, so the instruction may span a page boundary. Furthermore, both the source and destination addresses may reference double words that span page boundaries. Thus, an attempt to execute this single instruction could cause as many as 6 page faults!

Clearly, if there are fewer than 6 page frames on the machine, this instruction cannot be executed. In general, the instruction set of a virtual machine must be examined to determine the maximum number of pages a single instruction can execute, and then that number of page frames must be guaranteed.

Each entry to the page fault trap service routine must bring in one of the pages needed to complete an instruction. It need not bring in all of the required pages, since it is frequently faster to simply retry the instruction again and again so that the CPU can compute the successive addresses and let the MMU check each to see if it causes a fault.