The key questions for any resource management system are:
This is not trivial! On most systems, resources are allocated to processes, but this is not universal!
If the process is the object to which all allocation is attributed, then the process becomes a very heavy object, and changing processes becomes expensive. As a result, on systems where this is done, the system's notion of process is frequently augmented with lightweight processes, to which no resources other than the CPU are allocated. These are called threads.
On the old Berkeley Timesharing System, in the late 1960's, resources were allocated to users, and each interactive user could have many processes running on their behalf, all in the same virtual address space (this is the earliest known implementation of the idea we now know as threads. This has its own disadvantages -- processes that disagree about the use of some location in the address space they share will not function correctly.
Aside: The word process originally always meant a locus of control, or some entity in a system that could execute a sequential program. Another good definition (though not usually used) is that each process runs on a virtual CPU created by the scheduler.
This definition has been largely forgotten in the UNIX community, where it is common to use the word process to refer to the unit of activity to which allocated resources are bound, even if that unit contains many lightweight threads, each of which is technically a process in the older sense of the word. For now, we must live with the resulting confusion.
Memory can be allocated by the byte, the word, the page, or the segment. In any particular context, the system typically only provides one unit of allocation, and it is up to the user or the applications library to manufacture the desired units of memory out of those allocated by the system.
Bytes and Words are typically the smallest addressable units of memory. Allocating by the byte or word is quite rare, except in systems where the word is sufficiently large to accomodate two pointers. In that case, randomly allocated words are sufficient to build arbitrary binary-linked data structures, which in turn, can be used to mimic any other data structure. This is the foundation of the LISP worldview.
Pages are fixed sized blocks of memory, with the size typically determined by some characteristic of the addressing hardware on the machine. Because the page size is determined by the hardware and not by the application, it is generally necessary to construct large logical objects out of multiple pages, or to divide single pages into multiple logical objects. Typically, the addressing hardware allows multiple pages to occupy logically contiguous addresses, so that large objects can easily span many pages.
Segments are variable sized blocks of memory, with the size typically determined by some characteristic of the application. Segments are typically supported by hardware, while otherwise identical blocks of memory that are purely software artifacts are usually referred to simply as Blocks. Each block or segment typically contains one logical object, but sometimes, the economics of allocation are such that the application sub-allocates a block or segment.
Blocks of memory requested by some application are allocated in main memory? Is this right? No!
Only in the simplest of systems is it common for blocks of memory requested by users to be allocated directly from main memory. More usually, only internal data structures of the operating system are allocated there. Instead, users allocate their data structures in memory segments allocated to them by the system.
Why? Because protection mechanisms frequently limit the system in what it can give the user, because users are executing programs in a virtual address space, not physical, and because the system's goal of reclaiming memory from the user is simplified if the user is given one big chunk instead of many little ones.
This question has another interpretation: Within any given address space from which memory is being allocated, where should it be taken. The answer is trivial if all but the requested amount has already been allocated. Another trivial case is when each block of allocatable memory is identical and interchangable with all others, as is the case when the unit of allocation is the page or any other fixed size block.
A final aspect of this question arises in machines with non-uniform memory access times (so-called NUMA architectures). On such machines, it is important to try to allocate memory "near" the primary user of that memory, where distance is measured in terms of expected access time.
Allocation for all time may be different from temporary allocation, and some systems allow preemption of memory, either with the cooperation of the user to which the memory was allocated, or without notice.
Almost all preemptable resources are used for implementing virtual resources. Preemptive schedulers create one virtual CPU per process. Demand paged storage managers preempt page frames with every page fault in order to create virtual address spaces. Window managers preempt regions on the screen to support different overlapped virtual screens (windows).
Non preemptable resources exist at the bottom level in each such virtual resource hierarchy. The lowest level backing store in a virtual memory environment cannot easily be preempted. The (possibly virtual) memory used to implement a window manager is similar, as is the memory used to hold the states of the ready and waiting processes under a process manager.
Note that there are many different types of virtual resources in the above example, but all are implemented in terms of the same underlying non preemptable resource. This is a common pattern.
Why not! It is up to the user to decide why they want a resource. Operating systems have no basis for asking this question, although it should be asked by program designers.
Can the user wait? Can the user give advance notice? Must the user give notice? If the resource is allocated on demand, deadlock may result.
With preemptable memory resources, such as page frames, demand allocation is common, but performance can frequently be improved by either forcing some users to wait before their demands are met, or by asking users to provide advanced notice of their expected demand.
Consider the simple case of allocating blocks of physical memory to system or user code on a system with no protection mechanisms. The same algorithms used in this context are frequently used for new and dispose (Pascal) or malloc and free (C), to allocate and deallocate space in a heap that has been allocated by the system to a single application program.
There are a huge number of algorithms for solving this problem; only two will be given here, binary buddy system and a boundary tag method.
Note that memory management was once a central problem on operating systems, when user processes shared real memory with each other. In modern uniprocessors, it may seem far less critical because user code runs in virtual memory, and allocation and preemption of page frames is relatively trivial.
The use of virtual memory may seem to push the burden of memory management into the user processes, where it shows up either as an application programming problem or as part of the language run-time system, for example, in the implementation of new and dispose or malloc and free in Pascal and C, or as the garbage collector underlying an implementation of LISP, Smalltalk or Simula.
Of course, systems must manage the backing store, but this is typically allocated in units of interchangable disk tracks or sectors. There are performance penalties for allocating disk space poorly, but the problem is one of performance, not absolute necessity. Simple allocation heuristics such as allocating that free sector closest to the other allocated sectors of the same file (or virtual address space) can give excellent results.
In fact, much of the message of the above paragraphs is incorrect. When an application chooses the wrong allocation algorithm, it can paralyze the system. On some UNIX systems, for example, the system must be periodically rebooted because the X-windows window manager, an applications program, consumes all of the available virtual memory. The problem is severe external fragmentation within the memory allocated to the window manager by the system.
Fragmentation is the term used to describe how the free space in a pool from which memory is allocated is slowly broken up into many small pieces. There may be hundreds of words of free space, and yet a request to allocate a ten word block of memory may fail because there are no ten consecutive free words!
Overhead is the term used to describe additional memory that must be allocated, above and beyond that requested by the users, in order to provide for management of the pool of memory. For example, it may require a few words of memory to describe the location of each allocated memory block.
The consequence of fragmentation and overhead is that, although the total user demand for memory may be less than the number of words available, the system may be unable to meet that demand because the largest available block of free memory is smaller than the smallest user demand that has yet to be satisfied.
Fragments of free space, and fragmentation in general, comes from two sources:
Internal fragments are fragments of free space that are allocated to users even though the user does not request them. For example, on many modern RISC processors, each allocated block of memory must be aligned on a word boundary. If a user wants an odd number of bytes, this will require rounding up the user request to the next word boundary, allocating space to that user that the user does not want.
External fragments are fragments of free space between allocated blocks. External fragmentation can be eliminated by a compacting garbage collector; one that can move blocks of memory around in the heap in order to consolidate the fragments.
Allocate fixed size blocks with a size greater than the maximum possible user request.
This leads to awful internal fragmentation unless all user requests are for the same sized block. Allocation and deallocation are very fast, requiring a single linked list, free-list, to keep track of free blocks.
If there are a variety of block sizes being requested, this algorithm can be improved by using a separate free list for each popular block size. The problem with this is that it introduces a new fragmentation problem, there may be plenty of free space, but only in sizes that don't match the user's request. If blocks of one size can be combined to make larger blocks or broken up to make smaller blocks, this problem is solved. The result is a class of algorithms known as buddy systems.
In the simplest of the buddy systems, all block sizes are powers of two. Any large block may be split in half to make two smaller blocks, and any two blocks that were originally split from the same parent may be combined to reconstruct that parent.
The central data structure used in all buddy systems is an array of free-lists:
Array of heads of free lists ______ ______ ________ ---->|_____o----->|______| 128K |________| | | 64k | | 64k | 64K |_______o---- | | | | 32K |_______o------ |______| |______| 16K |________| | ______ ______ 8K |_______o---- -->|_____o----->|_____o--- 4K |________| | | 32k | | 32k | | 2K |________| | |______| |______| | 1K |________| | ______ ______ | 512 |________| | |______|<-----o_____|<- 256 |________| | | 32k | | 32k | | |______| |______| | ______ ______ ---->|_____o----->|______| |___8k_| |___8k_|Note that the above picture shows the state of a buddy system storage allocator with an array of free lists that allows for blocks of size up to 128k bytes. The system currently has two blocks of size 64k bytes, 3 blocks of size 32k bytes, and 2 blocks of 8k bytes. When a block is in use, the block is entirely available to the user. When the block is free, the first few bytes of the block are used to hold a pointer to the next free block of the same size.
The minimum block size is determined by the size of a pointer. If a user asks for less than this, there will be internal fragmentation.
The number of table entries required is determined by the maximum and minimum block sizes. In the above example, the maximum was 128k, 217, and the minimum on a 32 bit machine would typically be 4 (22); in this case, the table itself requires 17-2 or 15 table entries. This is the storage overhead of the binary buddy system.
to allocate a block of size n: pick i such that 2(i-1) < n <= 2i if freelist[i] not empty return the first element of freelist[i] to the user and set freelist[i] to point to the next element else allocate a block of size 2(i+1) (this is a recursive) and split it in half. Return half to the user and put the other half on freelist[i]. end
Aside: From this it can be seen that if blocks are allocated with random sizes, typically 3/4 of each block will be used and 1/4 will be wasted to internal fragmentation.
to deallocate a block of size n: pick i such that 2(i-1) < n <= 2i compute b, the buddy of the deallocated block. if b is already in freelist[i] remove b from freelist[i]. combine b and its buddy, and deallocate the resulting block of size 2(i+1) (this is recursive) else add the deallocated block to freelist[i].
Note: The above shows only one option, called the prompt merger variant of the buddy system. In the absence of real-time requirements, heap managers sometimes perform better if merger of free blocks is deferred until it becomes necessary. In that case, deallocate is trivial, but in allocate, if no free block larger than the desired size is found, all smaller free lists must be scanned to combine buddies in the hopes of building a large enough free block. This variant is called a lazy heap manager.
Finding the buddy of a block i of size s is easy in the binary buddy system:
buddy(i) = i xor s
The Fibonacci buddy system is an alternative where block sizes are Fibonacci numbers. This reduces internal fragmentation by providing for a larger variety of free block sizes.
The Fibonacci numbers are 1 1 2 3 5 8 13 21 ....
Finding buddies of a block to merge is harder with the Fibonacci buddy system.
Boundary tags are data structures on the boundary between blocks in the heap from which storage is allocated. The use of such tags allows blocks of arbitrary size to be used -- the tag describes, for each block, how big it is, its status, and its neighbors:
HEAP________ | used | | | | | |________| | used | |________| | free | | | |________| | used | |________| | free | |________|Aside The consequence of allocating arbitrary sized blocks on the heap is that the bookkeeping needed to split blocks, merge adjacent free blocks, find the right size block, and find the neighbors of a block all become more complex.
The decision to use a boundary tag method poses many choices. Among these are the lazy versus prompt merger decision that was present with the buddy systems. Another important choice is that between best fit and first fit allocation: Should the first block found be allocated to the user, or should a more exhaustive search be made for the best fit.
Intution suggests that best-fit allocation should reduce fragmentation, but in fact, if an exact fit is unavailable and there is a wide variety of commonly requested block sizes, empirical evidence shows that first fit generally works better because it produces larger fragments that are more likely to prove useful later.
It is very important to note that in most heap management environments, an exact fit is highly likely to be available! Most applications repeatedly allocate and deallocate objects of only a few sizes, and a good storage manager should take advantage of this. On the other hand, it is important to remember that other applications frequently encounter genuinely random sequences of allocations. One example of such a random sequence of requests comes when opening a succession of windows to view images, where each image was cropped (trimmed to size) to eliminate distracting or irrelevant content. Some web pages are full of such randomly sized images, and viewing such pages can lead to significant heap fragmentation.
Much of the early research in the area of heap management was done by the Burroughs Corporation in the early 1960's, but research in this area continues. Some of the best recent papers in the area were published in a the Proceedings of the ACM SIGPLAN International Symposium on Memory Management, published as SIGPLAN Notices, Vol 34, No 3, March 1999. The most relevant paper in this proceedings is "The Memory Fragmentation Problem: Solved?" by Johnstone and Wilson, pages 26-48.
|_______| |_______| <-- | | |-- Block boundary tags | | | give | block | | 1) Block status | | | (used or free) | | | 2) Block size |_______| | 3) Addresses of neighbors |_______| <-- | | (2 & 3 are correlated!)Aside: With the buddy system, the heap manager needed no record of the blocks allocated to users, but the arbitrary sized blocks used in boundary tag methods require that the manager keep a record of each block in the heap, whether it is allocated or free. These records are easily maintained in the heap itself, in the form of a prefix or suffix on each block. From the user's point of view, these prefixes and suffixes are not part of the block -- they are part of the overhead of the storage manager.
When the heap is viewed as a sequence of blocks, the suffix on one block is adjacent to the prefix on the next block, and every boundary between blocks, whether they are used blocks or free blocks, contains a prefix-suffix pair. This pair is the boundary tag from which the entire class of storage allocation methods takes its name.
For each block, the boundary tag must contain the status of that block so that, at the appropriate time, it can be determined whether the block is free, available for allocation or merger with another free block, or in use, and therefore unavailable. Another piece of necessary information is the size of the block, so that the allocation algorithm can determine whether the block is big enough to satisfy some allocation request. Finally, when it is necessary to merge free blocks, it must be possible to find the neighbors of a free block in order to see whether they are free.
The size of a block and the addresses of the neighbors are redundant, since if you know the address of the following block and the size of the boundary tag fields, you can compute the size of a block, and if you know the size of a block, you can compute the address of either end if you are given the address of the other end.
First Alternative Second Alternative | | | ^ |________| | | | | |__size__| ->->|________| | | | | |_status_| | | block | | |__next_o-- |________| | |__prev_o---- |__size__| | | | | |_status_| | | | v | |
There are two popular ways of storing the information needed in the boundary tags. In the left method shown above, the boundary tags are arranged into a doubly linked list. Each tag contains a pointer to the next tag and to the previous tag, and each tag contains the status of the block it immediately precedes. As shown in the illustrations, all pointers point to the first word of a block in the heap, so positive offsets from the pointer are indices into the block and negative offsets are indices into the boundary tag. This convention is convenient but not essential.
The right method shown above records the size of each block as a prefix and suffix on that block, and includes the status in either the prefix or suffix or, on some systems, both (above, it is shown in the prefix).
^ | | In the worst case, | | free | free blocks must | |________| be arranged into --o_______| a doubly-linked -----o_______| free list. This | ^ |////////| tag allows fast | | | | allocation and eager | | | used | merger. | | |________| | | |////////| tag For some other | | | | choices of allocation | | | free | method, a simpler | | |________| free list structure | --o_______| will suffice. -> --o_______| | |////////| tag v | |
The doubly linked free list is maintained in the free blocks themselves, so the minimum free block size is two pointers, typically one word each. The boundary tag itself requires two pointers or two block sizes (typically, the block size is the same size as a pointer), and one status bit (typically a whole word), so in order to allocate a two word block on the heap, it is typically necessary to find a total of five free words.
The above argument implies that boundary tag methods cannot eliminate internal fragmentation! When a user requests a block of size X and the only block available is size X + E (E is the extra part), then the available block must be split into a block of size exactly X and a block of size E - T (where T is the size of a boundary tag). If E - T is smaller than needed to store the pointers needed to link it into the free-list, then the split cannot be done. For the example data structure shown above, E must be at least 5 words, so there will be occasional internal fragmentation involving fragments of 4 or fewer words.
In the extreme case, no free list is needed. When a user wishes to do allocation, a linear search over the list of boundary tags will always manage to find a free block, but if the heap is largely full, this search will be unnecessarily slow as it examines all of the allocated blocks in the heap in the search for a free block of the right size.
If free block merger is lazy, then the merger can be done by a linear scan of the entire heap, rebuilding the free list as free blocks are found and merged. In this case, the free list can be singly linked.
If free block merger is done on an eager basis (at deallocate time) or on an eager basis (at allocate time) during the scan of the free list, where the free list is randomly linked with no attention to the addresses of blocks in the heap, then double linking is needed because when two adjacent free blocks are merged, one has to be removed from the free list and without a double link, it's successor in the list cannot be found.
The so-called fastest fit storage allocation algorithm uses a simple heap manager with a singly linked free list as its foundation and lazy block mergers; free blocks of popular sizes are stored in secondary free lists, one for each popular block size. Allocations for commonly used sizes check the appropriate secondary free list first, before using the underlying heap manager. Deallocations of common block sizes simply place the blocks in the secondary list (uncommon sizes can be returned to the main free list).
Only when an allocation request to the underlying heap manager cannot be satisfied are free blocks merged -- this is done by scanning the entire heap and rebuilding the main free list, leaving all the secondary free lists empty.
The fastest fit algorithm gives what may be the best average-case performance, but its worst case performance is not good. For the worst case, first-fit is better! This is one of many examples of situations where the best algorithm for real-time use is not the same as the best algorithm for minimizing the total run-time!