`J.BACH
`
`PRENTICE-HALL SOFTWARE SERIES
`
`CSCO-1013
`Page 1 of 8
`
`
`
`AT&T
`
`THE DESIGN OF THE UNIX®
`OPERATING SYSTEM
`Maurice J. Bach
`
`PRENTICE-HALL, INC., Englewood Cliffs, New Jersey 07632
`
`Page 2 of 8
`
`
`
`- - - - - - - -
`
`Copyright© 1986 by Bell Telephone Laboratories, Incorporated.
`
`Published by Prentice-Hall, Inc.
`A division of Simon & Schuster
`Englewood Cliffs, New Jersey 07632
`Prentice-Hall Software Series
`Brian W. Kernighan, Advisor
`Cover Design Consultant: Sarah Bach
`
`UNIX® is a registered trademark of AT&T.
`DEC, PDP, and VAX are trademarks of Digital Equipment Corp.
`Series 32000 is a trademark of National Semiconductor Corp.
`®Ada is a registered trademark of the U.S. Government (Ada Joint Program Office).
`UNIVAC is a trademark of Sperry Corp.
`This document was set on an AUTOLOGIC, Inc. APS-5 phototypesetter driven by the
`TROFF formatter operating under the UNIX system on an AT&T 3B20 computer.
`
`The Publisher offers discounts on this book when ordered in bulk
`quantities. For more information write:
`
`Special Sales/College Marketing
`Prentice-Hall, Inc.
`College Technical and Reference Division
`Englewood Cliffs, New Jersey 07632
`
`The author and publisher of this book have used their best efforts in preparing
`this book. These efforts include the development, research, and testing of the
`theories and programs to determine their effectiveness. The author and
`publisher make no warranty of any kind, expressed or implied, with regard to
`these programs or the documentation contained in this book. The author and
`publisher shall not be liable in any event for incidental or consequential
`damages in connection with, or arising out of, the furnishing, performance, or
`use of these programs.
`
`All rights reserved. No part of this book may be
`reproduced, in any form or by any means,
`without permission in writing from the publisher.
`
`Printed in the United States of America
`
`20 19 18 17 16 15 14 13 12 11
`
`ISBN 0-13-201799-7 025
`
`Prentice-Hall International (UK) Limited, London
`Prentice-Hall of Australia Pty. Limited, Sydney
`Prentice-Hall Canada Inc., Toronto
`Prentice-Hall Hispanoamericana, S.A., Mexico
`Prentice-Hall of India Private Limited, New Delhi
`Prentice-Hall of Japan, Inc., Tokyo
`Prentice-Hall of Southeast Asia Pte. Ltd., Singapore
`Editora Prentice-Hall do Brasil, Ltda., Rio de Janeiro
`
`Page 3 of 8
`
`
`
`22
`
`INTRODUCTION TO THE KERNEL
`
`There are several forms of interprocess communication, ranging from asynchronous
`signaling of events to synchronous transmission of messages between processes.
`Finally, the hardware control is responsible for handling interrupts and for
`communicating with the machine. Devices such as disks or terminals may interrupt
`the CPU while a process is executing. If so, the kernel may resume execution of
`the interrupted process after servicing the interrupt: Interrupts are not serviced by
`special processes but by special functions in the kernel, called in the context of the
`currently running process.
`
`2.2 INTRODUCTION TO SYSTEM CONCEPTS
`
`This section gives an overview of some major kernel data structures and describes
`the function of modules shown in Figure 2.1 in more detail.
`
`2.2.1 An Overview of the File Subsystem
`
`The internal representation of a file is given by an inode, which contains a
`description of the disk layout of the file data and other information such as the file
`owner, access permissions, and access times. The term inode is a contraction of the
`term index node and is commonly used in literature on the UNIX system. Every
`file has one inode, but it may have several names, all of which map into the inode.
`Each name is called a link. When a process refers to a file by name, the kernel
`parses the file name one component at a time, checks that the process has
`permission to search the directories in the path, and eventually retrieves the inode
`for the file. For example, if a process calls
`
`open ("/fs2/mjb/rje/sourcefile", 1);
`
`the kernel retrieves the inode for "/fs2/mjb/rje/sourcefile". When a process
`creates a new file, the kernel assigns it an unused inode. Inodes are stored in the
`file system, as will be seen shortly, but the kernel reads them into an in-core' inode
`table when manipulating files.
`The kernel contains two other data structures, the file table and the user file
`descriptor table. The file table is a global kernel structure, but the user file
`descriptor table is allocated per process. When a process opens or creats a file, the
`kernel allocates an entry from each table, corresponding to the file's inode. Entries
`in the three structures - user file descriptor table, file table, and inode table -
`maintain the state of the file and the user's access to it. The file table keeps track
`of the byte offset in the file where the user's next read or write will start, and the
`
`1. The term core refers to primary memory of a machine, not to hardware technology.
`
`Page 4 of 8
`
`
`
`2.2
`
`INTRODUCTION TO SYSTEM CONCEPTS
`
`23
`
`User
`File Descriptor
`Table
`
`File
`Table
`
`I node
`Table
`
`-- ' - - :: ::
`
`...
`
`Figure 2.2. File Descriptors, File Table, and Inode Table
`
`access rights allowed to the opening process. The user file descriptor table
`identifies all open files for a process. Figure 2.2 shows the tables and their
`relationship to each other. The kernel returns a file descriptor for the open and
`creat system calls, which is an index into the user file descriptor table. When
`executing read and write system calls, the kernel uses the file descriptor to access
`the user file descriptor table, follows pointers to the file table and inode table
`entries, and, from the inode, finds the data in the file. Chapters 4 and 5 describe
`these data structures in great detail. For now, suffice it to say that use of three
`tables allows various degrees of sharing access to a file.
`The UNIX system keeps regular files and directories on block devices such as
`tapes or disks. Because of the difference in access time between the two, few, if
`any, UNIX system installations use tapes for their file systems. In coming years,
`diskless work stations will be common, where files are located on a remote system
`and accessed via a network (see Chapter 13). For simplicity, however, the ensuing
`text assumes the use of disks. An installation may have several physical disk units,
`each containing one or more file systems. Partitioning a disk into several file
`systems makes it easier for administrators to manage the data stored there. The
`kernel deals on a logical level with file systems rather than with disks, treating each
`one as a logical device identified by a logical device number. The conversion
`between logical device (file system) addresses and physical device (disk) addresses
`is done by the disk driver. This book will use the term device to mean a logical
`device unless explicitly stated otherwise.
`A file system consists of a sequence of logical blocks, each containing 512, 1024,
`2048, or any convenient multiple of 512 bytes, depending on
`the system
`implementation. The size of a logical block is homogeneous within a file system but
`may vary between different file systems in a system configuration. Using large
`logical blocks increases the effective data transfer rate between disk and memory,
`
`Page 5 of 8
`
`
`
`24
`
`INTRODUCTION TO THE KERNEL
`
`because the kernel can transfer more data per disk operation and therefore make
`fewer time-consuming operations. For example, reading lK bytes from a disk in
`one read operation is faster than reading 512 bytes twice. However, if a logical
`block is too large, effective storage capacity may drop, as will be shown in Chapter
`5. For simplicity, this book will use the term "block" to mean a logical block, and
`it will assume that a logical block contains lK bytes of data unless explicitly stated
`otherwise.
`
`boot
`block
`
`super
`block
`
`··········--------------i
`___ _._ ___ ........ ,__J
`data blocks
`inode list
`
`Figure 2.3. File System Layout
`
`A file system has the following structure (Figure 2.3).
`
`• The boot block occupies the beginning of a file system, typically the first sector,
`and may contain the bootstrap code that is read into the machine to boot, or
`initialize, the operating system. Although only one boot block is needed to boot
`the system, every file system has a (possibly empty) boot block.
`• The super block describes the state of a file system -
`how large it is, how
`many files it can store, where to find free space on the file system, and other
`information.
`• The inode list is a list of inodes that follows the super block in the file system.
`Administrators specify the size of the inode list when configuring a file system.
`The kernel references in ode!? by index into the in ode list. One in ode is the root
`inode of the file system: it ig the inode by which the directory structure of the
`file system is accessible after executibn of the mount system call (Section 5.14).
`• The data blocks start at the end of the inode list and contain file data and
`administrative data. An allocated data block can belong to one and only one
`file in the file system.
`
`2.2.2 Processes
`It describes the
`This section examines the process subsystem more closely.
`structure of a process and some process data structures used for memory
`management. Then it gives a preliminary view of the process state diagram and
`considers various issues involved in some state transitions.
`A process is the execution of a program and consists of a pattern of bytes that
`the CPU interprets as machine instructions (called "text"), data, and stack. Many
`processes appear to execute simultaneously as the kernel schedules them for
`execution, and several processes may be instances of one program. A process
`
`Page 6 of 8
`
`
`
`4.0
`
`INTERNAL REPRESENTATION OF FILES
`
`61
`
`:r
`
`Lower Level File System Algorithms
`
`namei
`
`iget
`
`iput
`
`bmap
`
`alloc
`
`free
`
`ialloc ifree
`
`buffer allocation algorithms
`
`getblk
`
`brelse
`
`bread
`
`breada bwrite
`
`Figure 4.1. File System Algorithms
`
`name to an inode, using the algorithms iget, iput, and bmap. Algorithms a//oc and
`free allocate and free disk blocks for files, and algorithms ia//oc and ifree assign
`and free inodes for files.
`
`4.1 INODES
`
`4.1.1 Definition
`
`!nodes exist in a static form on disk, and the kernel reads them into an in-core
`inode to manipulate them. Disk inodes consist of the following fields:
`
`• File owner identifier. Ownership is divided between an individual owner and a
`"group" owner and defines the set of users who have access rights to a file. The
`superuser has access rights to all files in the system.
`• File type. Files may be of type regular, directory, character or block special, or
`FIFO (pipes).
`• File access permissions. The system protects files according to three classes:
`the owner and the group owner of the file, and other users; each class has access
`rights to read, write and execute the file, which can be set individually. Because
`directories cannot be executed, execution permission for a directory gives the
`right to search the directory for a file name.
`• File access times, giving the time the file was last modified, when it was last
`accessed, and when the inode was last modified.
`
`Page 7 of 8
`
`
`
`62
`
`INTERNAL REPRESENTATION OF FILES
`
`• Number of links to the file, representing the number of names the file has in the
`directory hierarchy. Chapter 5 explains file links in detail.
`• Table of contents for the disk addresses of data in a file. Although users treat
`the data in a file as a logical stream of bytes, the kernel saves the data in
`discontiguous disk blocks. The inode identifies the disk blocks that contain the
`file's data.
`• File size. Data in a file is addressable by the number of bytes from the
`beginning of the file, starting from byte offset 0, and the file size is 1 greater
`than the highest byte offset of data in the file. For example, if a user creates a
`file and writes only 1 byte of data at byte offset 1000 in the file, the size of the
`file is 1001 bytes.
`The inode does not specify the path name(s) that access the file.
`
`owner mjb
`
`group os
`
`type regular file
`
`perms rwxr-xr-x
`
`accessed Oct 23 1984 1:45 P.M.
`
`modified Oct 22 1984 10:30 A.M.
`inode Oct 23 1984 1:30 P.M.
`
`size 6030 bytes
`disk addresses
`
`Figure 4.2. Sample Disk lnode
`
`Figure 4.2 shows the disk inode of a sample file. This inode
`is that of a
`regular file owned by "mjb," which contains 6030 bytes. The system permits
`"mjb" to read, write, or execute the file; members of the group "os" and all other
`users can only read or execute the file, not write it. The last time anyone read the
`file was on October 23, 1984, at 1:45 in the afternoon, and the last time anyone
`wrote the file was on October 22, 1984, at 10:30 in the morning. The inode was
`last changed on October 23, 1984, at 1:30 in the afternoon, although the data in
`the file was not written at that time. The kernel encodes the above information in
`the inode. Note the distinction between writing the contents of an inode to disk
`and writing the contents of a file to disk. The contents of a file change only when
`writing it. The contents of an inode change when changing the contents of a file.or
`when changing its owner, permission, or link settings. Changing the contents of a
`
`Page 8 of 8
`
`