22C:116, Lecture Notes, Lecture 8, Fall 1999

Douglas W. Jones
University of Iowa Department of Computer Science

Page Faults
When a program references a page that the memory management unit is unable to handle, or when the reference requires access rights that are not allowed by the memory management unit, the memory management unit reports a page fault by forcing the CPU to enter a trap service routine, usually a part of the operating system, but there are systems such as Mach where page faults may be handled by application-level code.
The instruction that was executing at the time of the trap is aborted! On entry to any trap service routine or interrupt service routine, a properly designed CPU will cooperate with the prologue (the introductory sequence of instructions) of the service routine to save the values of all of the registers of the interrupted program, including the PC. In the case of trap handlers, the saved values should be the values that were in effect immediately prior to the instruction that caused the trap.
Note: On some machines, the values for some saved registers include changes made by the instruction that caused the trap. For example, the program counter may have been incremented as a side effect of the instruction fetch prior to the memory cycle that caused a page fault, or the stack pointer may have been decremented prior to the memory cycle that caused the fault. If such a machine is to be used to support demand-paged virtual memory, there must be a way to determine what registers have been changed so that the prologue on the page fault trap service routine can undo these changes!
In theory, given the PC and registers prior to the instruction that caused the trap, the trap service routine can simulate the execution of the next instruction to determine all of the memory addresses it might use; for each address, it can check to see which of these might have caused the trap. This is simple enough that on some machines, the memory management units do not save the actual virtual address that was used to cause the trap; in this case, the address must be computed by the prologue to the trap service routine. The software is simplified, however, if the MMU saves the virtual address that caused the trap, thus saving the trap service routine from having to simulate an instruction cycle.
Here, in outline form, is a generic page fault service routine:
```
page fault service routine:
    -- prologue
    save state prior to instruction that caused the trap

    -- find necessary information about cause of fault
    va = virtual address that was referenced
    op = operation that was attempted (read or write)

    -- is this an access violation?
    if map[va.pageno].rights = read_only
    and if op = write
        handle as user access violation (bus trap!)
    else

        -- find a free page frame
        if there are free page frames
            pa = free frame address
        else

            -- page replacement policy
            pa = address of page to replace
            va' = virtual address of page in that frame
            write( pa, diskaddress( va' ))
            update virtual address map, etc to reflect change
        endif

        -- bring in the required page
        read( pa, diskaddress( va ))
        update virtual address map, etc to reflect change

        -- epilogue
        restore saved state
        return from page fault service routine
    endif
```
This service routine faces two key choices; what page(s) should be copied into main memory from disk, and, assuming that main memory is full, what page(s) should be copied out from main memory to disk in order to make space.
What page should be copied into main memory?
- demand paging -- when a page fault occurs, only the page containing the memory location that was referenced to cause the page fault is moved into main memory. This was shown in the above code.
- anticipatory paging -- when a page fault occurs, the software tries to anticipate other pages that might be needed soon, and brings them into memory at the same time. In addition, some systems provide mechanisms that allow user programs to explicitly initiate the transfer of a page into memory even though that page has yet to be referenced.
What page in main memory should be replaced? In the above code, this issue was glossed over with one line saying ``address of page to replace''. Here are some choices
- Random if you do worse than this, shame!
- FIFO curiously, as bad as random.
- LIFO obviously horrible!
- LRU the best practical algorithm.
- Clock almost as good as LRU, commonly used.
- Belady's optimal policy impossible to use.
- Random page replacement requires no intelligence and works moderately well. When a page frame is needed, the random algorithm simply selects any page frame at random, dumps the page from that frame onto disk, and then uses it. If the memory reference pattern of the application program is very irregular, for example, as occurs during some list processing applications, it may even be the best solution.
- FIFO or first-in first-out replacement is an obvious policy, where the page that has been in main memory the longest is the one replaced. In practice, with most applications, this turns out to be as bad as random replacement. The fact that a page has been in memory for a long time does not always imply that it will not be needed again!
- LIFO or last-in first-out replacement is included in this list only for the sake of completeness. This policy is among the worst possible policies! The page most recently brought into main memory is frequently the one that will be needed next, and sending it back to disk is almost always a very bad idea.
- LRU or least-recently-used replacement is usually the best choice. This policy selects for replacement the page in main memory that has not been used for the longest time. If a page has not been recently used, it is reasonably likely it will not be needed anytime soon. Unfortunately, LRU replacement requires that the MMU perform some fairly complex record keeping, for example, by keeping a constant record of when each page was accessed.
- Clock replacement is almost as good as LRU replacement. This scheme divides the set of pages in main memory into those that have recently been used, and those that are less recently used, and replaces one of the less recently used pages, using an algorithm to be discussed later. The MMU needs to maintain only one bit per page frame to support this algorithm.
- Belady's optimal page replacement policy is unimplementable because it requires predicting the future. This policy orders all pages in main memory by how far into the future they will be needed, and replaces that page that will not be needed for the longest time into the future. Retrospective analysis of execution traces of real programs shows that LRU replacement usually comes surprisingly close to Belady's scheme. That is, how recently a page has been used is generally a good predictor of how likely it is to be needed again.
LRU Replacement
The best practical page replacement policy is to replace the Least Recently Used page, that is, the page which has not been referenced by the CPU for the longest time in the past.
Hardware Support Needed for LRU
This requires special hardware to track every memory access, for example, to put a timestamp on each page frame.
```
      A Page Table Entry
       _________________________________
      |__________|______|_______________|
      |  Serial  |Rights      Frame     |
      |  Number  |                      |
      |          |                      |
      |   added  |   original fields    |
```
We use the serial number (time-stamp) to record when each entry is used, and when there is a page fault, we search for the smallest (or oldest) valid page-table entry (that is, one for a page in main memory) and select that for replacement. Either or both of the hardware and software required for this are likely to be expensive.
Why does LRU work well?
Demand paged virtual memory works, that is, it offers performance comparable to main memory performance, at a price comparable to secondary memory prices because programs exhibit what is called a working set. The working set of a program is the set of pages it needs, at some instant, to make progress. If enough page frames are available to store the entire working set, the program will execute almost as quickly as it would if it was executing out of main memory with no paging.
In practice, most programs exhibit a well defined working set that changes only slowly with time. That is, they exhibit strong locality of reference. The locality property involves two components: In programs that exhibit strong temporal locality, the fact that a memory location has recently been referenced strongly suggests that it will soon be referenced again. In programs that exhibit strong spatial locality, the fact that some memory location has recently been referenced strongly suggests that nearby locations will soon be needed. Squential execution of programs and sequential processing of arrays lead to spatial locality, while iterative control structures lead temporal locality for both code and data.
The LRU policy directly addresses the issue temporal locality, and the organization of memory into pages that contain more than one consecutive memory location addresses spatial locality. Programs with good locality have well defined working sets, while programs with poor locality have poorly defined working sets.
Belady's optimal page replacement algorithm is optimal because it examines the future to make page replacement decisions. The LRU policy, in contrast, uses the predictions it can make by assuming that the program has good locality, and usually, these are good predictions.
LRU is usually considered to be optimal among practical algorithms, although there are contexts where some anticipatory paging can lead to marginal improvements.
MULTICS on the GE 600 used true LRU replacement, back in the late 1960's, but few systems since that time have used true LRU. The MULTICS implementation used a special high-speed core memory to implement its cache of all currently available page frames. The Following sequential algorithm was used by the address translation unit, implemented in fast hardware:
```
   I = 0
   Fetch entry(I) into temp
   while I < cache_size
   and temp.page /= virtual_address.page do
       I = I + 1;
       exchange temp with entry(I)
   endloop
   if temp.page = virtual_address.page
       Store temp into entry(0)
   else
       Report a page fault for virtual_address.page
       Report that temp.frame is the least recently used page frame
       Copy the map entry from main memory into cache entry(0).
```
This keeps the entries in the cache ordered by recency of use. This ordering guarantees that the cache will be fast for references to recently used frames (those that are likely to be frequently used) while it is slow for frames that weren't recently used (those that are unlikely to be frequently used).
The MULTICS scheme is an elegant historical curiosity, of only limited practical value today because it is predicated on core memory technology.

MULTICS was the first design for a large-scale timesharing system, developed with federal funds at Project MAC at MIT starting in the mid 1960's. The original development involved a consortium involving MIT, GE and Bell Labs, but Bell Labs dropped out of the project, and Honeywell bought GE's computer business. In the 1970's, MULTICS became one of two standard operating systems offered on Honeywell mainframes.
MULTICS was originally conceived of as a computer utility. The idea was that such a utility would be operated in much the same way as the telephone and power utilities, but instead of selling power or audio communication services, the computer utility would sell CPU time and disk storage to its customers. Today, this isn't a radical idea -- there are many computer utilities, ranging from national services like AOL to smaller local providers. The concept was a radical one in the mid 1960's, and this idea drove a number of technical developments.
Clock page replacement
The clock algorithm allows a good approximation of LRU replacement, although if there are two few page frames available, it degenerates to FIFO replacement.
Clock replacement requires that there be one bit per page frame to record whether that page has been recently referenced. This bit is set by hardware when the page frame is referenced, and it is reset by the page-fault service routine, in the process of deciding what page frame to replace.
To select a page frame to replace, the page fault service routine software sweeps through the page frames, resetting the referenced bit on each page frame and stopping when it finds a page frame with this bit already reset. The page in this frame is the one that is replaced (copied out to disk to make an empty frame).
The clock algorithm is usually pictured with a clock face, where each minute on the clock face corresponds to a page frame. The clock hand remains fixed except during the execution of the page fault service routine. When a page fault occurs, the clock hand begins its sweep. The hand stops as soon as it sees an unmarked page frame. The clock hand removes the marks from each page frame it sweeps over.
Data Structures and code for the Clock Algorithm
The Clock Algorithm requires a frame table, that is, a table indexed by physical memory page frame number and containing, for each page frame, the virtual address or page number of the page in that frame, along with at least one status bit, the mark bit.
```
                          Frame Table
                        ________________
                       |_X__|___________|
                       |____|___________|
  Clock Hand -         |____|___________|
               \       |____|___________|
                 ----->|____|___________|
                    |  |_X__|___________|
     Sweep Direction|  |_X__|___________|
                    |  |____|___________|
                    V  |_X__|___________|
                       |mark| what page |


	find page to replace:
           while frame_table[clock_hand].mark is set do
              reset frame_table[clock_hand].mark
              clock_hand = (clock_hand + 1) mod table_size
           -- now, frame_table[clock_hand].what_page
           -- is the page number to be written back to disk,
           -- and clock_hand is the frame number to use with
           -- the page being brought in from disk.
```
Modified Clock Replacement
It turns out that the clock algorithm can be modified to work without a hardware supported referenced bit. The first large-scale use of this modification was in the DEC VAX family of computers, but many modern RISC systems also take advantage of this to simplify their memory management units.
As the clock hand sweeps by a page frame, all page table entries referencing that frame are marked as invalid. In fact, the page is still in memory, but the memory management unit has been told that it is not there. This combination, where the page is in memory but the page-table entry for that page is marked as if it is not, is used to mean "not referenced".
When the CPU references an invalid page, there is a page fault. If the page is actually in memory, this is called a soft page fault. The page table entry is simply changed from invalid to valid. This combination, page in memory and marked as such, is considered to mean "referenced".
Modified Clock Page Table Entry
```
       _____________________________
      |______|______|_______________|
      | Real |Rights      Frame     |
      |Rights|                      |
      |      |                      |
      | soft |   hardware fields    |
      | bits |                      |
```
Real Rights = Rights = accessable
Marked. This page is not a candidate for clock replacement; it has been recently used.
Real Rights = inaccessible; Rights = accessable
Not Marked. A candidate for replacement. Faults involving this page are "soft"; this means, the page should be marked.
Real Rights = Rights = inaccessable
Paged out! Faults involving this page are "hard" -- page replacement and disk I/O are needed to resolve this fault.
Real Rights = accessible; Rights = inaccessable Strange! This combination could perhaps be used by software for something, but it is not clear what. Access to this page is legal and will not cause a fault.
Dirty Pages
A dirty page is a page that has been modified since it was copied from disk to main memory. Dirty pages must be copied back to disk when the frames that hold them are preempted to hold other pages. Clean pages, those that are not dirty, need not be copied back to disk if original copy of the page on disk is still available. This can cut the time needed to service some page faults in half.
Tracking clean and dirty pages requires one bit per page frame; typically, we assume that this bit is set to mark the page as dirty whenever there is a write operation on that page, and we let the software reset this bit whenever it writes the contents of a page frame to disk. The latter operation is called cleaning the page frame.
This can be combined with the clock algorithm as follows:
As the clock hand sweeps by dirty page frames they are scheduled for cleaning; the actual write operations to disk take place in parallel with computation and are typically done in whatever order is convenient for the disk interface hardware and software.
The clock hand only stops its sweep when it encounters a page frame that is both unreferenced and clean.
This scheme (modified clock replacement) is typical of modern virtual memory operating systems. This scheme requires minimal hardware support and yet delivers performance comparable to LRU replacement.
If the computer does not include hardware to record the fact that a page is dirty, this may be done by software by marking clean pages as read-only. When a write operation is made to a read-only page, the resulting soft page fault marks the page as read-write, which is interpreted to mean that it is a dirty page.
Some implementations of such schemes are enhanced with anticipatory paging -- for example, when there is a page fault for page i, they might bring in pages i, i+1 and i+2. This assumes that a program that references page i is highly likely to use the following pages. Since this assumption is not always a good one, such systems usually provide a way to turn off the anticipation logic in the page replacement algorithm. Anticipatory paging works very well for programs that use numerous large arrays and it provides few benefits for programs dominated by complex linked lists jumbled into the heap.
Exercise: Outline a procedure to be called from the page-fault service routine that will search the set of page frames and select a page to replace. Do this for the simple clock algorithm, the clock algorithm with clean-dirty page management, and each of the above with memory management unit hardware that does not mark used and dirty pages. Before you do this, you may want to solve the following:
Subsidiary Exercise: What are the states of a page-frame.
Subsidiary Exercise: The clock algorithm cyclicaly sweeps through the table of all page frames. This is not the same as the address map of the virtual address space! Describe the relationship between these two data structures!
A digression into architecture
Memory management units are fairly simple to design only if the entire page table is held in the MMU and if the MMU does not support the mark and dirty bits needed by the clock algorithm. The latter fact explains why, starting with the DEC VAX architecture, many modern systems have omitted hardware support for the clock and dirty bits, forcing the use of software solutions to these problems.
If the entire page table cannot fit into the MMU, it is common for the MMU to use a cache memory, so that the most recently used page table entries are in the MMU while the entire page table is in main memory. This is quite effective, except that cache design is itself a fairly messy business. As a result, some MMU designs in use today take this one step further; the MMU cache, called the translation lookaside buffer, is simplified so that, if the desired page table entry is not in the TLB, the MMU issues a page fault! Thus, the page-fault service routine is responsible for both MMU TLB management and page replacement in main memory.
This introduces a new kind of soft page fault, where the desired page is in main memory but the desired page table entry is not in the TLB, so the page fault service routine merely loads the page table entry into the MMU and returns, with no actual paging activity required.
You can find an example MMU specification illustrating this extremity in the fictional Hawk architecture, described in
http://homepage.cs.uiowa.edu/~dwjones/arch/hawk/overview.html#mmu