What is a System Call

Part of 22C:169, Computer Security Notes
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

General

If you look at the Unix programmer's reference manual, you will find documentation for system calls and subroutines in the standard library look substantially the same. Consider the sbrk system call:

NAME
       sbrk - change data segment size

SYNOPSIS
       #include <unistd.h>
       void *sbrk(intptr_t increment);

Contrast this with the interface to malloc in the standard C/C++ library:

NAME
       malloc - a memory allocator

SYNOPSIS
       #include <stdlib.h>
       void *malloc(size_t size);

By design, C and C++ programmers can call either of these routines with essentially identical syntax. The following program illustrates this:

       void * a = malloc( 64 );
       void * b = sbrk( 64 );

Here, the variables a and b are declared identically, and the results of the two calls are two practically interchangable 64 byte blocks of memory. The only difference is that the block allocated by malloc can be freed later, returning it to the heap, while the other block is outside the heap manager's control and thus difficult to reclaim for other purposes. Mixing the two can lead to other problems because the heap manager may not be able to expand the heap if the programmer has also directly allocated storage using sbrk.

In point of fact, malloc() is an example of middleware, part of the C standard library, while sbrk() is a system call. Every byte of memory that that is returned to the application using malloc() was originally allocated by a call to sbrk() made within the system. The middleware adds functionality. Specifically, storage allocated by sbrk() is always added at the end of the static segment. It can be deallocated by shrinking the static segment, but this means that the net effect is LIFO allocation, akin to using a stack. In contrast, storage allocated by malloc() may be deallocated using free() and allocated memory blocks may be deallocated in any order - a pool of storage allowing this is called a heap, and the software that manages the heap, malloc() and free() in this case, is called a heap manager. It is up to the heap manager to maintain a data structure to organize the free space under its control (usually called the free list) and to efficiently reuse previously freed memory so that the total size of the static segment remains small.

But, the question we are interested in here is not "what functionality does malloc() add?" Rather, we are interested in how a call to a middleware routine in the system library differs from a system call. Put simply, calls to routines in the system library involve calls that are in the same protection domain as the caller.

Protection domain: The set of all operations on resources accessible to a particular entity (such as a user, program, or subroutine). The domain may include such things as the right to read a particular file, to write a particular variable, to call a particular subroutine, or manage a particular resource. In programming language semantics, the concept of a protection domain is closely related to the concept of scope.

User programs have very limited protection domains, while parts of the operating system need unlimited access to various system resources. The file system needs unlimited access to the disk in order to create files. The memory manager needs unlimited access to physical memory and to the memory management unit in order to create address spaces for user processes, and so on.

Kernel versus User Modes

The simplest and most widespread hardware support for protection mechanisms involves adding a single flipflop to the CPU, the protection mode flipflop. The state of this flipflop defines two operating modes for the CPU:

User mode or unprivileged state: In this state, any attempt to manipulate the memory management unit or access input/output devices or to change the CPU's protection mode will result in a trap. Thus, a program in user mode cannot alter its protection domain.

Kernel mode or privileged state: In this state, all operations are legal. A program oerating in kernel mode can freely alter its protection domain.

On systems implementing this model, when the hardware detects an interrupt or trap condition, it saves not only the program counter and other registers, but also the value of the protection mode flipflop, sometimes called the protection state. On return from trap or return from interrupt, the CPU provides a way to restore not only the program counter and other registers but also the protection mode. On such a machine, therefore, all traps that occur while a program is in user state cause a change of protection domain.

Some of these traps, of course, signal program errors. Illegal instruction traps might be the result of programs that accidentally attempt to execute random data. Memory addressing traps are a typical result of attempting to follow undefined pointers.

Implementing System Calls

Once we have a trap mechanism such as was described above, we have a mechanism that can be used to transfer control from a user program, running in user mode, to part of the operating system, running in kernel mode.

If you think of the protection mode bit in the CPU (and its associated semantics) as creating a fence between the current user's protection domain and the operating system's protection domain, then the trap mechanism can be thought of as creating a gate through this fence.

Gate Crossing Mechanism: A gate crossing mechanism is a mechanism allowing a process to pass through the gate between one protection domain and another. The existence of such mechanisms is essential to implementing system calls but their existence also creates the potential for serious security breaches.

The most common implementation of system calls involves setting aside any particular trap or traps and using those traps as system calls. Consider these examples:

Illegal instruction trap as a system call: If the machine has any illegal instructions, simply define one of these, arbitrarily, as a system call. So, if the opcode F3₁₆ is not defined by hardware, let the illegal instruction trap handler check the opcode of the instruction that caused the trap and, if it detects this particular opcode, interpret it as a system call.
Illegal memory address as a system call: If some page of the address space is guaranteed never to be mapped into the virtual address space of user processes, simply define all memory references to that page as system calls. Since it is extremely desirable to trap references to memory location zero -- the referent of null pointers, it is natural to leave page zero of the address space unmapped. Loads and stores to fields of objects pointed to by null pointers should not cause system calls, but jumps to fields of objects are essentially never done, so the illegal memory address trap handler can check inspect the opcode and the address that caused the trap, and interpret it as a system call if the opcode was a jump and the address was a nonzero address in page zero.

System calls are conceptually subroutine calls, so once the trap handler has determined that a system call is taking place, it simply calls the code of the appropriate system call, and then, on return, does a return from trap.

The system call mechanisms outlined above require significant bits of code to determie whether or not a particular trap is a system call or just an error on the part of the application program. Some (but not all) hardware designers have been aware of this problem since the 1960s and have invented a variety of mechanisms to speed up system calls:

A reserved system call instruction: Sometimes, the hardware designers pick some particular illegal opcode and reserve it for system calls. Other illegal opcodes simply raise the generic illegal instruction trap, but the designated system call opcode raises a different trap, the system call trap. This eliminates the need for special code to inspect the opcode in order to determine if there is a system call.
Automatic system call vectoring: Most systems need many system calls (look at the list of kernel calls under Unix or Linux, the list is long), so it is common for the system call instruction to include a parameter. Response to system calls is slowed by the need to interpret this parameter in order to determine which system call to make, so some systems include system call trap mechanisms where the hardware automatically checks this parameter and uses it to select a trap handler out of a vector of trap handlers, a unique handler for each system call.

User and System in the same Address Space

In the 1960's, the developers of Multics invented an alternative way to do system calls. In their model, each page of the address space was marked with its protection level. In a simplified two-level version of this, pages are marked as either user pages or system pages. In user state, programs may not access system pages. This allows the operating system and the user program to share the same address space. When the user attempts to use or modify data a system page, there is a trap.

The developers of Multics added one more bit to each page of the address space (actually, to each page table entry). If this bit was set, it marked that page as a gateway. User programs could not read or write data from system pages, nor could they jump to arbitrary locations in system pages, but if a user program executed a call instruction to address zero in a system page that was marked as a gateway, the call was permitted. This allows each system call to be implemented as a call to a different gateway into the operating system.

Calls to gateway pages must push not only the return address but also the protection state of the CPU onto the stack, so that a return from the gateway will return to the callers protection state.

The developers of Multics went overboard with their design, using a 4-bit protection state that they called a ring number. Memory references were legal if the current protection state was less than or equal to the ring number marked on the page. So, ring 0000 was the innermost and most secure level, while level 1111 was the outermost or least secure level.

Cloaking the system call mechanism

Back to the original question: How is it that calls to routines in the standard library and system calls look the same in C or C++ programs? Calls to user routines use simple subroutine call instructions, while calls to system routines must do strange things for gate crossing.

The answer is, the developers of Unix and all systems descended from Unix have created a special library of regular subroutines, one per system call. When you call, say, sbrk, this is a call to a library routine that gets the parameters, puts them in the right places to pass to the actual system call, then does whatever is required to make the system call occur. The trap handler then follows through, figuring out what system call is involved, getting the parameters for that call from the useer, and then using a regular call instruction to calling the real sbrk instruction in the operating system.

This ingenious approach to the problem of system calls hides the complexity from most users and from most system programs. Only the caller-side stub routines in the library and the code of the trap handler that calls the real system call need to be written in assembly language with an understanding of the eccentric mechanisms of the particular CPU being used. Everything else is just normal code.

References

General concepts covered here are covered in several other places. Google finds many copies of lecture notes on line that cover this materia, but the following is more likely to remain available:

IBM's Tutorial on the system call mechanism in their AIX flavor of Unix is very specific to one system but otherwise covers the right concepts.

General definitions of many of the terms used here can be found in or are linked from:

The Wikipedia entry for System call includes pointers to a number of relevant definitions