The Unix/Linux System Interface

Part of 22C:169, Computer Security Notes
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

General

On any Unix or Linux system, the command man xxx will give you the "man page" for the xxx command, subroutine or system call. This is a page from the Unix Programmer's Reference Manual as customized for that system.

The manual is divided into sections. You can restrict your search to one section of the manual using commands such as the following:

man 1 xxx -- looks for xxx as a command
man 2 xxx -- looks for xxx as a system call
man 3 xxx -- looks for xxx in the standard library

Here, we're concerned with section 2 of the manual, documenting the system calls.

Modifying a Process's Memory Resources

A process can modify its memory resources using the following kernel calls

char * brk( const char *addr )

Given addr, a memory address within the range of virtual addresses assigned to the data segment, if addr is outside of the currently allocated data, enlarge the data segment to include addr, or if addr is in the currently allocated data segment, shrink the segment so that it just includes addr and no more. This returns a pointer to the new end of the data segment, which is generally not equal to addr because of rounding to a page boundary.

char * sbrk( int incr )

Enlarge the data segment by incr bytes. This returns a pointer to the first byte of the block of memory that resulted from enlarging the data segment. sbrk(0) therefore returns a pointer to the current end of the data segment.

Note that most users never call sbrk(); rather, users use some kind of heap manager in the standard library for whatever programming language they are using. This manager is expected to call sbrk() when it needs to enlarge the heap. For example, C and C++ programmers may call malloc() and free() to allocate and deallocate space for objects on the heap. Usually, C++ programmers don't even do this, because C++ automatically calls malloc() at the start of the initializer method for objects of each class.

int execve( const char * path, char * const argv[], char * const envp[] )

This kernel call causes the process to abandon its former code, data and stack segments. The file named by the character string path is opened; if this file is an executable and begins with an object file header, the code segment of that file is mapped into memory as the new read-only executable code segment, and the data segment of that file is used as the initial value of the new read-write data segment. A new empty stack segment is then allocated and initialized with the strings in the arrays argv and envp, and new arrays of pointers to these copies are created and pointers to these two arrays are passed as parameters to the main program in the new code segment.

If the open file is executable and begins with the "magic" characters #!, the file is interpreted as a file that is supposed to be submitted to an interpreter. The bytes immediately following the !# characters, up until the next blank or end of line, are taken as the name of an interpreter, and this interpreter is executed. The interpreter gets the name of the execved file as its first argment, so it may open that file and execute it. Other arguments provided by the caller to execve are shifted over to allow for this.

If the executed file has the SUID or SGID bits set in its mode, the process's effective user ID and or effective group ID are set to the user ID and group ID of the file.

execve() only returns to the caller if there was an error, for example, if the indicated file was not an executable object file, the indicated interpreter could not be found, or the file did not begin with the necessary magic characters that signified the start of an interpreter or loadable object file.

Note that, originally, Unix supported a simple exec command that didn't deal with parameters or environment variables. As the system evolved, these were eventually superceded and abandoned. In some cases, standard library routines that call the new system interface routines in order to provide support for old interfaces have been added, but this was only done if it was found that there were user programs that needed these. Few user programs directly call any of the exec services.

In fact, all of the above services can boil down to sequences of primitive executable instructions plus calls to the following two services. Note, however, that these services are relatively late additions to the Unix system interface. The original interface did not include these general mechanisms, and the original interface did not assume the availability of a memory management unit sufficiently flexible to implement these.

void * mmap( void * addr, size_t len, int prot, int flags, int fd, off_t offset)

This causes consecutive pages of the address space continuing for at most len bytes to be mapped so that they reference the contents of the open file fd starting at byte offset of the file. Because of the use of paged virtual memory and the sector structure of files, the actual length of the mapped memory segment will vary. addr is a hint to the system suggesting where in the address space the mapped region should fall; the actual starting address of the mapped range of addresses will be returned. The system is not obligated to take the addressing hint, but if addr is aligned on a page boundary and the next len bytes of the address space are currently unused, taking the hint is certainly easy.

The access rights will be set to prot, which can be some combination of PROT_READ, PROT_WRITE or PROT_EXEC, using the or operator to combine the desired rights.

The flags option controls how the region is shared. MAP_FILE and MAP_ANON are alternatives, depending on whether you want to map pages of a file or just pages with no connection to a file. MAP_PRIVATE and MAP_SHARED are alternatives, depending on whether you want your changes to the mapped region to be seen by other users or you want those changes to be private to you. Obviously, this is irrelevant unless you have PROT_WRITE access to the segment. MAP_FIXED forces the segment to start at addr, which had better be the address of the first byte of some memory page.

On many systems, the code segment is implemented by giving access to the executable file as PROT_READ+PROT_EXEC, MAP_SHARED. The stack and static segments, in contrast, are PROT_READ+PROT_WRITE, MAP_PRIVATE.

unmap( caddr_t addr, size_t len )

Undoes any mapping of pages from addresses addr to addr+len.

Modifying a Process's File Resources

Just as the memory addresses in a program are essentially integers in the range from 0 to some maximum (usually 2³²-1), open files in a program are referenced by integer file descriptors in the range 0 up to some maximum (usually 31). The files numbered 0, 1 and 2 correspond to standard input, standard output and standard error, respectively.

fd = int open( const char * path, int flags, mode-t mode )

associate a new file descriptor fd with the file named by path. The number assigned to fd will be the lowest numbered file descriptor for the current process that is currently not in use. The argument flags indicates the maximum desired access for the file; this may be some combination of: O_RDONLY, O_WRONLY, O_RDWR, or a number of other options, combined with the or operator as needed.

The mode argument is used if the open command has to create the file. This gives the access rights for the file, as it is allocated on disk. Once a file is opened, the flags argument determines what operations may actually be performed on the file. An attempt to open a file for which the user has insufficient access will result in failure.

If the process's effective user ID matches the user ID of the file, the maximum rights the process may have are set by the owner rights. If this is not the case, but the process's effective group ID matches the group ID of the file, then the process's maximal rights are determined by the group rights. Finally, if neither of the former are the case, the process is entitled to the rights determined by the file's other rights.

On success, the open() call returns fd; this may be passed to the read() or write() system calls to directly read or write the file, or it may be used as an argument to mmap to map the file into the memory address space of the process.

fchmod( int fd, mode-t mode )

fchmod( const char * path, mode-t mode )

The fchmod() kernel call changes the access rights for an already opened file to mode. The kernel call chmod() is equivalent to opening the file, then using fchmod() on the open file, then closing it.

The mode is a 12 bit string, arranged as follows:

special owner group other
SUID SGID SVTX R W X R W X R W X

In the above, the R, W and X access rights correspond to the rights to read, write or execute the file, on behalf of the file owner, the users in the group associated with the file, or others who are neither the owner nor members of the group. Each file has, associated with it, an owner (by default, the user ID of the user who created the file), and a group, by default, (by default, the group ID taken from the directory in which the file was created).

fd = int dup( int oldd )

fd is set to the lowest numbered unused file descriptor, as with open(). Given that the file descriptor oldd refers to an open file, the new file descriptor fd is set to refer to the same resource as is referred to by oldd. The two descriptors fd and oldd are therefore aliases for the same actual object, so that, for example, reading or writing either of them will have the identical effect.

If successful, dup() returns fd.

int pipe( int * filedes )

This creates a pipe object, a bounded FIFO buffer, where write() operations append to the buffer and read() operations consume data from the buffer.

The two lowest numbered unused file descriptors are allocated as readfd and writefd, referring respectively to the read and write ends of the new pipe object. These are returned in the array filedes so that filedes[0]=readfd and filedes[1]=writefd.

int close( int fd )

Detatches the file descriptor from any object to which it has been attached, for example by open().

int read( int fd, void * buffer, int nbytes )

int write( int fd, void * buffer, int nbytes )

These two operations are the fundamental operations on files, reading and writing from a stream of bytes.

Creating and Destroying Processes

pid_t fork( void )

Creates a new process with the process id pid. That process will initially appear identical to the parent process except for the value returned by fork(). The child will get a return value of zero, while the parent will get pid, the identity of the child.

All files that were open in the parent process at the time of the fork will be open in the child, and these files will all be shared between the parent and child.

All memory that was mapped into the address space of the parent will be mapped into the address space of the child. The read-only program segment will be shared. The read-write data and the stack segments will be copied, so that the parent and child have separate copies of all variables. Any files mapped into the address space by mmap will be mapped into both address spaces. If the file was inserted into the address space with the MAP_SHARED attribute, then the segment mapped to that file will be fully shared by the parent and child.

void _exit(int status)

Terminates the process. Most applications call the standard library routine exit() without the underline because this does nice things like cleanly closing open files before it calls _exit(). In either case, the low order 8 bits of status are used as the exit status of the process, usually either EXIT_SUCCESS (zero) or EXIT_FAILURE (some nonzero value).

pid_t wait(int * status)

When a process calls wait() it is blocked until either some child process terminates or until a signal is delivered indicating that something else occurred. In the event that a child terminates, the process ID of the child is returned. If status is not NULL, the integer variable pointed to by status is set to hold the 8-bit exit status of the child, plus a considerable amount of other information coded in the other bits of the value. The function WEXITSTATUS(status) returns the 8 bit exit status extrated from this value. Other functions are available to determine exactly how the child process terminated or was terminated.

An Example

This little chunk of C code allows the running program to execute a program called myprogram and wait for it to terminate before the calling program continues.

{
	pid_t pid;
	if (pid = fork()) {
		/* parent process with nonzero pid */

		/* wait for child to terminate */
		while (pid != wait( NULL )) /* do nothing */;
	} else {
		/* child process with returned pid set to zero */
		(void) execve( "myprogram", NULL, NULL );
	}
}

special			owner			group			other
`SUID`	`SGID`	`SVTX`	`R`	`W`	`X`	`R`	`W`	`X`	`R`	`W`	`X`