2. UNIX

Part of the 22C:169 Lecture Notes for Sppring 2006
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Introduction

Unix and its clones such as Linux illustrate many important operating system features that are of critical importance to system security. Here, we summarize these features, looking particularly at the kernel interface, that is, the set of services that an application process may call. Most users never directly use these kernel services; rather, they call various middleware routines that then call the kernel services for them.

To look up the documentation for a kernel service on a Unix or Linux system, use the man command. For example, to get documentation on the kernel service for reading from a file, the read service, type man 2 read. The numeral 2 is essential here! It names the section of the programmer's reference manual you want to look in. Section 2 is the kernel interface, which is what we care about here, while section 1 is shell commands and section 3 is the standard library.

Note, the man pages for the services mentioned here each end with a see also section that references other services. The services listed here only scratch the surface of the full Unix/Linux system, and you can follow chains of references from the see-also lists to get the big picture.

The Unix Process Model

A Unix (or linux) process has direct access to the following resources:

memory resources
code segment read only
stack segment read write
data segment read write
other segments ...
open file resources
standard input read only
standard output write only
standard error write only
other files ...

Note that the definition of a segment under Unix has no necessary relationship to the use of the segment by the designers of the memory management unit used on the system. Unix defines a segment as a range of consecutive virtual addresses that are seen by the applications program as being consecutive, irrespective of how the memory management unit manages this job. Of course, whoever writes the virtual memory software for a particular memory management unit must figure out how to implement the Unix memory model using whatever tools that the host memory management unit provides, and they may well opt to use one hardware defined segments for each Unix segment, but many Unix implementations don't do this.

In any case, all memory resources of the process are accessed through the memory management unit, while all file resources are accessed through the file system software in the kernel.

A unix process has a user ID, by default, the ID of the person who ran the process, and a group ID, by default, the ID of the group to which the person running the process is associated.

Modifying a Process's Memory Resources

A process can modify its memory resources using the following kernel calls

char * brk( const char *addr )
Given addr, a memory address within the range of virtual addresses assigned to the data segment, if addr is outside of the currently allocated data, enlarge the data segment to include addr, or if addr is in the currently allocated data segment, shrink the segment so that it just includes addr and no more. This returns a pointer to the new end of the data segment, which is generally not equal to addr because of rounding to a page boundary.

char * sbrk( int incr )
Enlarge the data segment by incr bytes. This returns a pointer to the first byte of the block of memory that resulted from enlarging the data segment. sbrk(0) therefore returns a pointer to the current end of the data segment.

Note that most users never call sbrk(); rather, users use some kind of heap manager in the standard library for whatever programming language they are using. This manager is expected to call sbrk() when it needs to enlarge the heap. For example, C and C++ programmers may call malloc() and free() to allocate and deallocate objects on the heap.

int execve( const char * path, char * const argv[], char * const envp[] )
This kernel call causes the process to abandon its former code, data and stack segments. The file named by the character string path is opened; if this file is an executable and begins with an object file header, the code segment of that file is mapped into memory as the new read-only executable code segment, and the data segment of that file is used as the initial value of the new read-write data segment. A new empty stack segment is then allocated and initialized with the strings in the arrays argv and envp, and new arrays of pointers to these copies are created and pointers to these two arrays are passed as parameters to the main program in the new code segment.

If the open file is executable and begins with the "magic" characters #!, the file is interpreted as a file that is supposed to be submitted to an interpreter. The bytes immediately following the !# characters, up until the next blank or end of line, are taken as the name of an interpreter, and this interpreter is executed. The interpreter gets the name of the execved file as its first argment, so it may open that file and execute it. Other arguments provided by the caller to execve are shifted over to allow for this.

If the executed file has the SUID or SGID bits set in its mode, the process's effective user ID and or effective group ID are set to the user ID and group ID of the file.

execve() only returns to the caller if there was an error, for example, if the indicated file was not an executable object file, the indicated interpreter could not be found, or the file did not begin with the necessary magic characters that signified the start of an interpreter or loadable object file.

In fact, all of the above services can boil down to sequences of primitive executable instructions plus calls to the following two services:

void * mmap( void * addr, size_t len, int prot, int flags, int fd, off_t offset)
This causes consecutive pages of the address space continuing for at most len bytes to be mapped so that they reference the contents of the open file fd starting at byte offset of the file. Because of the use of paged virtual memory and the sector structure of files, the actual length of the mapped memory segment will vary. addr is a hint to the system suggesting where in the address space the mapped region should fall; the actual starting address of the mapped range of addresses will be returned.

The access rights will be set to prot, which can be some combination of PROT_READ, PROT_WRITE or PROT_EXEC, using the or operator to combine the desired rights.

The flags option controls how the region is shared. MAP_FILE and MAP_ANON are alternatives, depending on whether you want to map pages of a file or just pages with no connection to a file. MAP_PRIVATE and MAP_SHARED are alternatives, depending on whether you want your changes to the mapped region to be seen by other users or you want those changes to be private to you. Obviously, this is irrelevant unless you have PROT_WRITE access to the segment. MAP_FIXED forces the segment to start at addr, which had better be the address of the first byte of some memory page.

unmap( caddr_t addr, size_t len )
Undoes any mapping of pages from addresses addr to addr+len.

Modifying a Process's File Resources

Just as the memory addresses in a program are essentially integers in the range from 0 to some maximum (usually 232-1), open files in a program are referenced by integer file descriptors in the range 0 up to some maximum (usually 31). The files numbered 0, 1 and 2 correspond to standard input, standard output and standard error, respectively.

int open( const char * path, int flags, mode-t mode )
associate a new file descriptor fd with the file named by path. The number assigned to fd will be the lowest numbered file descriptor for the current process that is currently not in use. The argument flags indicates the maximum desired access for the file; this may be some combination of: O_RDONLY, O_WRONLY, O_RDWR, or a number of other options, combined with the or operator as needed.

The mode argument is used if the open command has to create the file. This gives the access rights for the file, as it is allocated on disk. Once a file is opened, the flags argument determines what operations may actually be performed on the file. An attempt to open a file for which the user has insufficient access will result in failure.

If the process's effective user ID matches the user ID of the file, the maximum rights the process may have are set by the owner rights. If this is not the case, but the process's effective group ID matches the group ID of the file, then the process's maximal rights are determined by the group rights. Finally, if neither of the former are the case, the process is entitled to the rights determined by the file's other rights.

On success, the open() call returns fd; this may be passed to the read() or write() system calls to directly read or write the file, or it may be used as an argument to mmap to map the file into the memory address space of the process.

fchmod( int fd, mode-t mode )
fchmod( const char * path, mode-t mode )
The fchmod() kernel call changes the access rights for an already opened file to mode. The kernel call chmod() is equivalent to opening the file, then using fchmod() on the open file, then closing it.

The mode is a 12 bit string, arranged as follows:
special owner group other
SUID SGID SVTX R W X R W X R W X

In the above, the R, W and X access rights correspond to the rights to read, write or execute the file, on behalf of the file owner, the users in the group associated with the file, or others who are neither the owner nor members of the group. Each file has, associated with it, an owner (by default, the user ID of the user who created the file), and a group, by default, (by default, the group ID taken from the directory in which the file was created).

int dup( int oldd )
fd is set to the lowest numbered unused file descriptor, as with open(). Given that the file descriptor oldd refers to an open file, the new file descriptor fd is set to refer to the same resource as is referred to by oldd. The two descriptors fd and oldd are therefore aliases for the same actual object, so that, for example, reading or writing either of them will have the identical effect.

If successful, dup() returns fd.

int pipe( int * filedes )
This creates a pipe object, a bounded FIFO buffer, where write() operations append to the buffer and read() operations consume data from the buffer.

The two lowest numbered unused file descriptors are allocated as readfd and writefd, referring respectively to the read and write ends of the new pipe object. These are returned in the array filedes so that filedes[0]=readfd and filedes[1]=writefd.

int close( int fd )
Detatches the file descriptor from any object to which it has been attached, for example by open().

int read( int fd, void * buffer, int nbytes )
int write( int fd, void * buffer, int nbytes )
These two operations are the fundamental operations on files, reading and writing from a stream of bytes.

Creating and Destroying Processes

pid_t fork( void )
Creates a new process with the process id pid. That process will initially appear identical to the parent process except for the value returned by fork(). The child will get a return value of zero, while the parent will get pid, the identity of the child.

All files that were open in the parent process at the time of the fork will be open in the child, and these files will all be shared between the parent and child.

All memory that was mapped into the address space of the parent will be mapped into the address space of the child. The read-only program segment will be shared. The read-write data and the stack segments will be copied, so that the parent and child have separate copies of all variables. Any files mapped into the address space by mmap will be mapped into both address spaces. If the file was inserted into the address space with the MAP_SHARED attribute, then the segment mapped to that file will be fully shared by the parent and child.

void _exit(int status)
Terminates the process. Most applications call the standard library routine exit() without the underline because this does nice things like cleanly closing open files before it calls _exit(). In either case, the low order 8 bits of status are used as the exit status of the process, usually either EXIT_SUCCESS (zero) or EXIT_FAILURE (some nonzero value).

pid_t wait(int * status)
When a process calls wait() it is blocked until either some child process terminates or until a signal is delivered indicating that something else occurred. In the event that a child terminates, the process ID of the child is returned. If status is not NULL, the integer variable pointed to by status is set to hold the 8-bit exit status of the child, plus a considerable amount of other information coded in the other bits of the value. The function WEXITSTATUS(status) returns the 8 bit exit status extrated from this value. Other functions are available to determine exactly how the child process terminated or was terminated.

An Example

This little chunk of C code allows the running program to execute a program called myprogram and wait for it to terminate before the calling program continues.

{
	pid_t pid;
	if (pid = fork()) {
		/* parent process with nonzero pid */

		/* wait for child to terminate */
		while (pid != wait( NULL )) /* do nothing */;
	} else {
		/* child process with returned pid set to zero */
		(void) execve( "myprogram", NULL, NULL );
	}
}