`
`539
`
`9.8 I
`
`Designing an 1/0 System
`
`The art of 1/0 is finding a design that meets goals for cost and variety of devices
`while avoiding bottlenecks to 1/0 performance. This means that components
`must be balanced between main memory and the I/O device because perfor(cid:173)
`mance-and hence effective cost/performance-can only be as good as the
`weakest link in the I/0 chain. The architect must also plan for expansion so that
`customers can tailor the I/O to their applications. This expansibility, both in
`numbers and types of 1/0 devices, has its costs in longer backplanes, larger
`power supplies to support I/0 devices, and larger cabinets.
`In designing an I/0 system, analyze performance, cost, and capacity using
`varying 1/0 connection schemes and different numbers of 1/0 devices of each
`type. Here is a series of six steps to follow in designing an 1/0 system. The
`answers in each step may be dictated by market requirements or simply by
`cost/performance goals.
`
`1. List the different types of 1/0 devices to be connected to the machine, or a
`list of standard buses that the machine will support.
`
`2. List the physical requirements for each 1/0 device. This includes volume,
`power, connectors, bus slots, expansion cabinets, and so on.
`
`3. List the cost of each 1/0 device, including the portion of cost of any
`controller needed for this device.
`
`4. Record the CPU resource demands of each 1/0 device. This should include:
`
`Clock cycles for instructions used to initiate an 1/0, to support operation
`of an I/O device (such as handling interrupts), and complete 1/0
`
`CPU clock stalls due to waiting for I/O to finish using the memory, bus, or
`cache
`
`CPU clock cycles to recover from an I/0 activity, such as a cache flush
`
`5. List the memory and 1/0 bus resource demands of each 1/0 device. Even
`when the CPU is not using memory, the bandwidth of main memory and the
`1/0 bus are limited.
`
`6. The final step is establishing performance of the different ways to organize
`these 1/0 devices. Performance can only be properly evaluated with
`simulation, though it may be estimated using queuing theory.
`
`You then select the best organization, given your performance and cost goals.
`Cost and performance goals affect the selection of the 1/0 scheme and
`physical design. Performance can be measured either as megabytes per second
`or I/Os per second, depending on the needs of the application. For high per(cid:173)
`formance, the only limits should be speed of 1/0 devices, number of 1/0 devices,
`and speed of memory and CPU. For low cost, the only expenses should be those
`
`Ex.1035.571
`
`DELL
`
`
`
`540
`
`9.8 Designing an 1/0 System
`
`for the 1/0 devices themselves and for cabling to the CPU. Cost/performance
`design, of course, tries for the best of both worlds.
`To make these ideas clearer, let's go through several examples.
`
`Example
`
`First, let's look at the impact on the CPU of reading a disk page directly into the
`cache. Make the following assumptions:
`
`Each page is 8 KB and the cache-block size is 16 bytes.
`
`The addresses corresponding to the new page are not in the cache.
`
`The CPU will not access any of the data in the new page.
`
`90% of the blocks that were displaced from the cache will be read in again,
`an.d each will cause a miss.
`
`The cache uses write back, and 50% of the blocks are dirty on average.
`
`The 1/0 system buffers a full cache block before writing to the cache (this is
`called a speed-matching buffer, matching transfer bandwidth of the 1/0
`system and memory).
`
`The accesses and misses are spread uniformly to all cache blocks.
`
`There is no other interference between the CPU and 1/0 for the cache slots.
`
`There are 15,000 misses every one million clock cycles when there is no 1/0.
`
`The miss penalty is 15 clock cycles, plus 15 more cycles to write the block if
`it was dirty.
`
`Assuming one page is brought in every one million clock cycles, what is the
`impact on performance?
`
`Each page fills 8192/16 or 512 blocks. 1/0 transfers do not cause cache misses
`on their own because entire cache blocks are transferred. However, they do
`displace blocks already in the cache. If half of the displaced blocks are dirty it
`takes 256*15 clock cycles to write them back to memory. There are also misses
`from 90% of the blocks displaced in the cache because they are referenced later,
`adding another 90%*512, or 461 misses. Since this data was placed into the
`cache from the 1/0 system, all these blocks are dirty and will need to be written
`back when replaced. Thus, the total is 256*15 + 461*30 more clock cycles than
`the original 1,000,000 + 15,000* 15. This turns into a 1 % decrease in
`performance:
`
`256*15 + 461*30
`17670
`014
`1000000+15000*15 = 1225000 = 0·
`
`Now let's look at the cost/performance of different 1/0 organizations. A
`simple way to perform this analysis is to look at maximum throughput assuming
`
`Answer
`
`Ex.1035.572
`
`DELL
`
`
`
`Input/Output
`
`541
`
`that resources can be used at 100% of their maximum rate without side effects
`from interference. A later example takes a more realistic view.
`
`Example
`
`Given the following performance and cost information:
`
`a 50-MIPS CPU costing $50,000
`
`an 8-byte-wide memory with a 200-ns cycle time
`
`80 MB/sec 1/0 bus with room for 20 SCSI buses and controllers
`
`SCSI buses that can transfer 4 MB/sec and support up to 7 disks per bus
`(these are also called SCSI strings)
`
`a $2500 SCSI controller that adds 2 milliseconds (ms) of overhead to perform
`a disk 1/0
`
`an operating system that uses 10,000 CPU instructions for a disk 1/0
`
`a choice of a large disk containing 4 GB or a small disk containing 1 GB,
`each costing $3 per MB
`
`both disks rotate at 3600 RPM, have a 12-ms average seek time, and can
`transfer 2MB/sec
`
`the storage capacity must be 100 GB, and
`
`the average 1/0 size is 8 KB
`
`Evaluate the cost per 1/0 per second (IOPS) of using small or large drives.
`Assume that every disk 1/0 requires an average seek and average rotational
`delay. Use the optimistic assumption that all devices can be used at 100% of
`capacity and that the workload is evenly divided between all disks.
`
`Answer
`
`1/0 performance is limited by the weakest link in the chain, so we evaluate the
`maximum performance of each link in the 1/0 chain for each organization to
`determine the maximum performance of that organization.
`Let's start by calculating the maximum number of IOPS for the CPU, main
`memory, and 1/0 bus. The CPU 1/0 performance is determined by the speed of
`the CPU and the number of instructions to perform a disk 1/0:
`
`.
`50MIPS
`Maximum IOPS for CPU= lOOOO .
`.
`l/O = 5000
`mstructions per
`
`The maximum performance of the memory system is determined by the memory
`cycle time, the width of the memory, and the size of the 1/0 transfers:
`
`(1/200 ns)*8
`.
`.
`Maximum IOPS for mam memory = 8 KB per l/O ""' 5000
`
`Ex.1035.573
`
`DELL
`
`
`
`542
`
`9.8 Designing an 1/0 System
`
`The 1/0 bus maximum performance is limited by the bus bandwidth and the size
`of the 1/0:
`
`80 MB/sec
`.
`Maximum IOPS for the 1/0 bus = 8 KB per l/O :::: 10000
`
`Thus, no matter which disk is selected, the CPU and main memory limits the
`maximum performance to no more than 5000 IOPS.
`Now its time to look at the performance of the next link in the 1/0 chain, the
`SCSI controllers. The time to transfer 8 KB over the SCSI bus is
`
`SCSI bus transfer time = 4 ~~fsec = 2 ms
`Adding the 2-ms SCSI controller overhead means 4 ms per 1/0, making the
`maximum rate per con.troller
`
`Maximum IOPS per SCSI controller= -
`4
`
`1
`ms
`
`= 250 IOPS
`
`All the organizations will use several controllers, so 250 IOPS is not the limit for
`the whole system.
`The final link in the chain is the disks themselves. The time for an average
`disk 1/0 is
`1/0 time= 12 ms+ 360~·~PM + 2 ~~fsec = 12+8.3+ 4 = 24.3 ms
`
`so the disk performance is
`
`Maximum IOPS (using average seeks) per disk= 24.~ ms:::: 41 IOPS
`
`The number of disks in each organization depends on the size of each disk: 100
`GB can be either 25 4-GB disks or 100 1-GB disks. The maximum number of
`I/Os for all the disks is:
`Maximum IOPS for 25 4-GB disks = 25 * 41=1025
`Maximum IOPS for 100 1-GB disks = 100 * 41=4100
`
`Thus, provided there are enough SCSI strings, the disks become the new limit to
`maximum performance: 1025 IOPS for the 4-GB disks and 4100 for the 1-GB
`disks.
`While we have determined the performance of each link of the 1/0 chain, we
`still have to determine how many SCSI buses and controllers to use and how
`many disks to connect to each controller, as this may further limit maximum
`performance. The 1/0 bus is limited to 20 SCSI controllers and the SCSI
`
`Ex.1035.574
`
`DELL
`
`
`
`Input/Output
`
`543
`
`standard limits disks to 7 per SCSI string. The minimum number of controllers is
`for the 4-GB disks
`
`Minimum number of SCSI strings for 25 4-GB disks = ;
`
`5
`or 4
`
`and for 1-GB disks
`
`Minimum number ofSCSI strings for 100 1-GB disks= l~O or 15
`
`We can calculate the maximum IOPS for each configuration:
`Maximum IOPS for 4 SCSI strings = 4 * 250 = 1000 IOPS
`Maximum IOPS for 15 SCSI strings = 15 * 250 = 3750 IOPS
`
`The maximum performance of this number of controllers is slightly lower
`than the disk I/0 throughput, so let's also calculate the number of controllers so
`they don't become a bottleneck. One way is to find the number of disks they can
`support per string:
`
`Number of disks per SCSI string at full bandwidth = ~~O = 6.1 or 6
`
`and then calculate the number of strings:
`
`Number of SCSI strings for full bandwidth 4-GB disks = ~5
`= 4.1 or 5
`
`Number of SCSI strings for full bandwidth 1-GB disks= l~O = 16.7 or 17
`
`This establishes the performance of four organizations: 25 4-GB disks with 4
`or 5 SCSI strings and 100 1-GB disks with 15 to 17 SCSI strings. The maximum
`performance of each option is limited by the bottleneck (in boldface):
`
`4-GB disks, 4 strings = Min(5000,5000,10000,1025,1000) = 1000 IOPS
`
`4-GB disks, 5 strings = Min(5000,5000,10000,1025,1250) = 1025 IOPS
`
`1-GB disks, 15 strings = Min(5000,5000,10000,4100,3750) = 3750 IOPS
`•
`1-GB disks, 17 strings = Min(5000,5000,10000,4100,4250) = 4100 IOPS
`
`We can now calculate the cost for each organization:
`
`Ex.1035.575
`
`DELL
`
`
`
`544
`
`9.8 Designing an 1/0 System
`
`4-GB disks, 4 strings = $50,000 + 4*$2,500 + 25 * (4096*$3)
`
`= $367,200
`
`4-GB disks, 5 strings = $50,000 + 5*$2;500 + 25 * (4096*$3) = $369,700
`
`1-GB disks, 15 strings = $50,000 + 15*$2,500 + 100 * (1024*$3) = $394,700
`
`1-GB disks, 17 strings = $50,000 + 17*$2,500 + 100 * (1024*$3) = $399,700
`
`Finally, the cost per IOPS for each of the four configurations is $367, $361,
`$105, and $97, respectively. Calculating maximum number of average I/Os per
`second assuming 100% utilization of the critical resources, the best
`cost/performance is the organization with the small disks and the largest number
`of controllers. The small disks have 3.4 to 3.8 times better cost/performance than
`the large disks in this example. The only drawback is that the larger number of
`disks will affect system availability unless some form of redundancy is added
`(see pages 520-521).
`
`This above example assumed that resources can be used 100%. It is
`instructive to see what is the bottleneck in each organization.
`
`Example
`
`For the organizations in the last example, calculate the percentage of utilization
`of each resource in the computer system.
`
`Answer
`
`Figure 9.31 gives the answer.
`
`Resource
`
`CPU
`Memory
`1/0 bus
`SCSI buses
`Disks
`
`4-GB disks,
`4 strings
`
`4-GB disks,
`5 strings
`
`1-GB disks,
`15 strings
`
`1-GB disks,
`17 strings
`
`20%
`20%
`10%
`100%
`98%
`
`21%
`21%
`10%
`82%
`100%
`
`75%
`75%
`38%
`100%
`91%
`
`82%
`82%
`41%
`96%
`100%
`
`FIGURE 9.31 The percentage of utilization of each resource given the four
`organizations in the previous example. Either the SCSI buses or the disks are the
`bottleneck.
`
`In reality buses cannot deliver close to 100% of bandwidth without severe
`increase in latency and reduction in throughput due to contention. A variety of
`rules of thumb have been evolved to guide I/0 designs:
`
`Ex.1035.576
`
`DELL
`
`
`
`Input/Output
`
`545
`
`No 1/0 bus should be utilized more than 75% to 80%;
`
`No disk string should be utilized more than 40%;
`
`No disk arm should be seeking more than 60% of the time.
`
`Example
`
`Recalculate performance in the example above using these rules of thumb, and
`show the utilization of each component. Are there other organizations that
`follow these guidelines and improve performance?
`
`Answer
`
`Figure 9.31 shows that the 1/0 bus is far below the suggested guidelines, so we
`concentrate on the utilization of seek and SCSI bus. The utilization of seek time
`per disk is
`
`Time of average seek =
`12 ms = 12 = 5001
`1
`24
`Time between I/Os
`41 IOPS
`
`70
`
`which is below the rule of thumb. The biggest impact is on the SCSI bus:
`
`Suggested IOPS per SCSI string = -
`4
`
`l * 40% = 100 IOPS.
`ms
`
`With this data we can recalculate IOPS for each organization:
`
`4-GB disks, 4 strings = Min(5000,5000,7500,1025,400) = 400 IOPS
`
`4-GB disks, 5 strings = Min(5000,5000,7500,1025,500) = 500 IOPS
`
`1-GB disks, 15 strings = Min(5000,5000,7500,4100,1500) = 1500 IOPS
`
`1-GB disks, 17 strings = Min(5000,5000,7500,4100,1700) = 1700 IOPS
`
`Under these assumptions, the small disks have about 3.0 to 4.2 times the
`performance of the large disks.
`Clearly, the string bandwidth is the bottleneck now. The number of disks per
`string that would not exceed the guideline is
`Number of disks per SCSI string at full bandwidth= ~~O = 2.4 or 2
`
`and the ideal number of strings is
`
`Number of SCSI strings for full bandwidth 4-GB disks = ~ = 12.5 or 13
`
`Number of SCSI strings for full bandwidth 1-GB disks= l~O = 50
`
`Ex.1035.577
`
`DELL
`
`
`
`
`
`Input/Output
`
`547
`
`The IBM 360/370 I/O architecture has evolved over a period of 25 years.
`Initially, the I/0 system was general purpose, and no special attention was paid
`to any particular device. As it became clear that magnetic disks were the chief
`consumers of I/0, the IBM 360 was tailored to support fast disk I/0. IBM's
`dominant philosophy is to choose latency over throughput whenever it makes a
`difference. IBM almost never uses a large buffer outside the CPU; their goal is
`to set up a clear path from main memory to the I/O device so that when a device
`is ready, nothing can get in the way. Perhaps IBM followed a corollary to the
`quote on page 526: you can buy bandwidth, but you need to design for latency.
`As a secondary philosophy, the CPU is unburdened as much as possible to allow
`the CPU to continue with computation while others perform the desired I/O
`activities.
`The example for this section is the high-end IBM 3090 CPU and the 3990
`Storage Subsystem. The IBM 3090, models 3090/100 to 3090/600, can contain
`one to six CPUs. This 18.5-ns-clock-cycle machine has a 16-way interleaved
`memory that can transfer eight bytes every clock cycle on each of two
`(3090/100) or four (3090/600) buses. Each 3090 processor has a 64-K.B, 4-way(cid:173)
`set-associative, write-back cache, and the cache supports pipelined access taking
`two cycles. Each CPU is rated about 30 IBM MIPS (see page 78), giving at
`most 180 MIPS to the IBM 3090/600. Surveys of IBM mainframe installations
`suggest a rule of thumb of about 4 GB of disk storage per MIPS of CPU power
`(see Section 9.12).
`It is only fair warning to say that IBM terminology may not be self-evident,
`although the ideas are not difficult. Remember that this I/O architecture has
`evolved since 1964. While there may well be ideas that IBM wouldn't include if
`they were to start anew, they are able to make this scheme work, and make it
`work well.
`
`The 3990 1/0 Subsystem Data-Transfer Hierarchy
`and Control Hierarchy
`
`The I/0 subsystem is divided into two hierarchies:
`
`1. Control-This hierarchy of controllers negotiates a path through a maze of
`possible connections between the memory and the I/O device and controls
`the timing of the transfer.
`
`2. Data-This hierarchy of connections is the path over which data flows
`between memory and the I/O device.
`
`After going over each of the hierarchies, we trace a disk read to help understand
`the function of each component.
`For simplicity, we begin by discussing the data-transfer hierarchy, shown in
`Figure 9.33 (page 548). This figure shows one section of the hierarchy that con(cid:173)
`tains up to 64 large IBM disks; using 64 of the recently announced IBM 3390
`disks, this piece could connect to over one trillion bytes of storage! Yet this
`
`Ex.1035.579
`
`DELL
`
`
`
`
`
`Input/Output
`
`549
`
`96 channels to a 3090/600. Because they are "multiprogrammed," channels can
`actually service several disks. For historical reasons, IBM calls this block
`multiplexing.
`Channels are connected to the 3090 main memory via two speed-matching
`·buffers, which funnel all the channels into a single port to main memory. Such
`buffers simply match the bandwidth of the I/O device to the bandwidth of the
`memory system. There are two 8-byte buffers per channel.
`The next level down the data hierarchy is the storage director. This is an
`intermediary device that allows the many channels to talk to many different I/0
`devices. Four to sixteen channels go to the storage director depending on the
`model, and two or four paths come out the bottom to the disks. These are called
`two-path strings or four-path strings in IBM parlance. Thus, each storage
`director can talk to any of the disks using one of the strings. At the top of each
`string is the head of string, and all communication between disks and control
`units must pass through it.
`At the bottom of the datapath hierarchy are the disk devices themselves. To
`increase availability, disk devices like the IBM 3380 provide four paths to
`connect to the storage director; if one path fails, the device can still be
`connected.
`The redundant paths from main memory to the I/0 device not only improve
`availability, but also can improve performance. Since the IBM philosophy is to
`avoid large buffers, the path from the I/0 device to main memory must remain
`connected until the transfer is complete. If there were a single hierarchical path
`from devices to the speed-matching buffer, only one I/0 device in a subtree
`could transfer at a time. Instead, the multiple paths allow multiple devices to
`transfer simultaneously through the storage director and into memory.
`The task of setting up the datapath connection is that of the control hierarchy.
`Figure 9.34 shows both the control and data hierarchies of the 3990 1/0
`subsystem. The new device is the I/0 processor. The 3090 channel controller
`and 1/0 processor are load/store machines similar to DLX, except that there is no
`memory hierarchy. In the next subsection we see how the two hierarchies work
`together to read a disk sector.
`
`Tracing a Disk Read in the IBM 3990 1/0
`Subsystem
`
`The 12 steps below trace a sector read from an IBM 3380 disk. Each of the 12
`steps is labeled on a drawing of the full hierarchy in Figure 9.34 (page 550).
`
`1. The user sets up a data structure in memory containing the operations that
`should occur during this I/0 event. This data structure is termed an //0 control
`block, or IOCB, which also points to a list of channel control words (CCWs).
`This list is called a channel program. Normally, the operating system provides
`the channel program, but some users write their own. The operating system
`checks the IOCB for protection violations before the I/0 can continue.
`
`Ex.1035.581
`
`DELL
`
`
`
`
`
`Input/Output
`
`551
`
`Location ccw
`Define
`CCWl:
`Extent
`Locate
`Record
`
`CCW2:
`
`CCW3:
`
`Comment
`
`Transfers a 16-byte parameter to the storage director. The
`channel sees this as a write data transfer.
`Transfers a 16-byte parameter to the storage director as
`above. The parameter identifies the operation (read in this
`case) plus seek, sector number, and record ID. The channel
`again sees this as a write data transfer.
`Read Data Transfers the desired disk data to the channel and then to
`the main memory.
`
`FIGURE 9.35 A channel program to perform a disk read, consisting of three channel
`command words (CCWs). The operating system checks for virtual memory access
`violations of CCWs by simulating them to check for violations. These instructions are linked
`so that only one START SUBCHANNEL instruction is needed.
`
`(
`
`3. The I/O processor uses the control wires of one of the channels to tell the
`storage director which disk is to be accessed and the disk address to be read. The
`channel is then released.
`
`4. The storage director sends a SEEK command to the head-of-string controller
`and the head-of-string controller connects to the desired disk, telling it to seek to
`the appropriate track, and then disconnects. The disconnect occurs between
`CCW2 and CCW3 in Figure 9.35.
`
`Upon completion of these first four steps of the read, the arm on the disk
`seeks the correct track on the correct IBM 3380 disk drive. Other I/O operations
`can use the control and data hierarchy while this disk is seeking and the data is
`rotating under the read head. The I/O processor thus acts like a multipro(cid:173)
`grammed system, working on other requests while waiting for an I/O event to
`complete.
`An interesting question arises: When there are multiple uses for a single disk,
`what prevents another seek from screwing up the works before the original
`request can continue with the I/O event in progress? The answer is the disk
`appears busy to the programs in the 3090 between the time a s TART
`SUBCHANNEL instruction starts a channel program (step 2) and the end of that
`channel program. An attempt to execute another START
`SUBCHANNEL
`instruction would receive busy status from the channel or from the disk device.
`After both the seek completes and the disk rotates to the desired point relative
`to the read head, the disk reconnects to a channel. To determine the rotational
`position of the 3380 disk, IBM provides rotational positional sensing (RPS), a
`feature that gives early warning when the data will rotate under the read head.
`IBM essentially extends the seek time to include some of the rotation time,
`thereby tying up the datapath as little as possible. Then the I/0 can continue:
`
`5. When the disk completes the seek and rotates to the correct position, it
`contacts the head-of-string controller.
`
`Ex.1035.583
`
`DELL
`
`
`
`552
`
`9.9 Putting It All Together: The IBM 3990 Storage Subsystem
`
`6. The head-of-string controller looks for a free storage director to send the
`signal that the disk is on the right track.
`
`7. The storage director looks for a free channel so that it can use the control
`wires to tell the I/0 processor that the disk is on the right track.
`
`8. The 1/0 processor simultaneously contacts the storage director and I/0
`device (the IBM 3380 disk) to give the OK to transfer data, and tells the channel
`controller where to put the information in main memory when it arrives at the
`channel.
`
`There is now a direct path between the I/0 device and memory and the
`transfer can begin:
`
`9. When the disk is ready to transfer, it sends the data at 3 megabytes per
`second over a bit-serial line to the storage director.
`
`10. The storage director collects 16 bytes in one of two buffers and sends the
`information on to the channel controller.
`
`11. The channel controller has a pair of 16-byte buffers per storage director and
`sends 16 bytes over a 3-MB or 4.5-MB per second, 8-bit-wide datapath to the
`speed-matching buffers.
`
`12. The speed-matching buffers take the information corning in from all
`channels. There are two 8-byte buffers per channel that send 8 bytes at a time to
`the appropriate locations in main memory.
`
`Since nothing is free in computer design, one might expect there to be a cost
`in anticipating the rotational delay using RPS. Sometimes a free path cannot be
`established in the time available due to other I/0 activity, resulting in an RPS
`miss. An RPS miss means the 3990 I/0 Subsystem must either:
`
`• Wait another full rotation-16.7 ms-before the data is back under the head,
`or
`
`• Break down the hierarchical datapath and start all over again!
`
`Lots of RPS misses can ruin response times.
`As mentioned above, the IBM 1/0 system evolved over many years, and
`Figure 9.36 shows the change in response time for a few of those changes. The
`first improvement concerns the path for data after reconnection. Before the
`Systern/370-XA, the data path through the channels and storage director (steps 5
`through 12) had to be the same as the path taken to request the seek (steps 1
`through 4). The 370-XA allows the path after reconnection to be different, and
`this option is called dynamic path reconnection (DPR). This change reduced the
`time waiting for the channel path and the time waiting for disks (queueing
`delay), yielding a reduction in the total average response time of 17%. The
`second change in Figure 9.36 involved a new disk design. Improvements to the
`
`Ex.1035.584
`
`DELL
`
`
`
`
`
`554
`
`9.9 Putting It All Together: The IBM 3990 Storage Subsystem
`
`• A large number of 1/0 devices at a time
`
`• High performance
`
`• Low latency
`
`Substantial expendability and lower latency are hard to get at the same time.
`IBM channel-based systems achieve the third and fourth goals by utilizing
`hierarchical data paths to connect a large number of devices. The many devices
`and parallel paths allow simultaneous transfers and, thus, high throughput. By
`avoiding large buffers and providing enough extra paths to minimize delay from
`congestion, channels offer low-latency 1/0 as well. To maximize use of the
`hierarchy, IBM uses rotational positional sensing to extend the time that other
`tasks can use the hierarchy during an 1/0 operation.
`Therefore, a key to performance of the IBM 1/0 subsystem is the number of
`rotational positional misses and congestion on the channel paths. A rule of
`thumb is that the single-path channels should be no more than 30% utilized and
`the quad-path channels should be no more than 60% utilized, or too many
`rotational positional misses will result. This 1/0 architecture dominates the
`industry, yet it would be interesting to see what, if anything, IBM would do
`differently if given a clean slate.
`
`9.10 I Fallacies and Pitfalls
`
`Fallacy: 110 plays a small role in supercomputer design
`
`The goal of the Illiac IV was to be the world's fastest. computer. It may not have
`achieved that goal, but it showed 1/0 as the Achilles' Heel of high-performance
`machines. In some tasks, more time was spent in loading data than in computing.
`Amdahl's Law demonstrated the importance of high performance in all the parts
`of a high-speed computer. (In fact, Amdahl made his comment in reaction to
`claims for performance through parallelism made on behalf of the Illiac IV.) The
`Illiac IV had a very fast transfer rate (60 MB/sec), but very small, fixed-head
`disks (12-MB capacity). Since they were not large enough, more storage was
`provided on a separate computer. This led to two ways of measuring 1/0
`overhead:
`
`Warm start-Assuming the data is on the fast, small disks, 1/0 overhead is
`the time to load the Illiac IV memory from those disks.
`
`Cold start-Assuming the data is in on the other computer, 1/0 overhead
`must include the time to first transfer the data to the Illiac IV fast disks.
`
`Figure 9.37 shows ten applications written for the Illiac IV in 1979. Assuming
`warm starts, the supercomputer was busy 78% of the time and waiting for 1/0
`22% of the time; assuming cold starts, it was busy 59% of the time and waiting
`for 1/0 41 % of the time.
`
`Ex.1035.586
`
`DELL
`
`
`
`
`
`556
`
`9.1 O Fallacies and Pitfalls
`
`The most telling example comes from the IBM 360. It was decided that the
`performance of the ISAM system, an early database system, would improve if
`some of the record searching occurred in the disk controller itself. A key field
`was associated with each record, and the device searched each key as the disk
`rotated until it found a match. It would then transfer the desired record. For the
`disk to find the key, there had to be an extra gap in the track. This scheme is
`applicable to searches through indices as well as data.
`The speed a track can be searched is limited by the speed of the disk and of
`the number of keys that can be packed on a track. On an IBM 3330 disk the key
`is typically 10 characters, but the total gap between records is equivalent to 191
`characters if there were a key. (The gap is only 135 characters if there is no key,
`since there is no need for an extra gap for the key.) If we assume the data is also
`10 characters and the track has nothing else on it, then a 13165-byte track can
`contain
`
`13165
`191+lO+10 = 62 key-data records
`
`This performance is
`
`62
`
`/k
`h
`16.7 ms (1 revolution)_ 25 - . ms ey searc
`
`In place of this scheme, we could put several key-data pairs in a single block and
`have smaller inter-record gaps. Assuming there are 15 key-data pairs per block
`and the track has nothing else on it, then
`
`13165
`13165
`.
`135+15*(l0+10) = 135+300 = 30 blocks of key-data parrs
`The revised performance is then
`/k
`h
`16.7 ms (1 revolution) _ 04
`30*15
`- . ms ey searc
`Yet as CPUs gotfaster, the CPU time for a search was trivial. While the strategy
`made early machines faster, programs that use the search-key operation in the
`I/0 processor run six times slower on today's machines!
`
`Fallacy: Comparing the price of media versus the price of the packaged
`system.
`
`This happens most frequently when new memory technologies are compared to
`magnetic disks. For example, comparing the DRAM-chip price to magnetic-disk
`packaged price in Figure 9.16 (page 518) suggests the difference is less than a
`factor of 10, but its much greater when the price of packaging DRAM is
`included. A common mistake with removable media is to compare the media
`cost not including the drive to read the media. For example, optical media: costs
`
`Ex.1035.588
`
`DELL
`
`
`
`Input/Output
`
`557
`
`only $1 per MB in 1990, but including the cost of the optical drive may bring the
`price closer to $6 per MB.
`
`Fallacy: The time of an average seek of a disk in a computer system is the
`time for a seek of one-third the number of cylinders.
`
`This fallacy comes from confusing the way manufacturers market disks with the
`expected performance and with the false assumption that seek time~ are linear in
`distance. The 1/3 distance rule of thumb comes from calculating the distance of
`a seek from one random location to another random location, not including the
`current cylinder and assuming there are a large number of cylinders. In the past,
`manufacturers listed the seek of this distance to offer a consistent basis for
`comparison. (As mentioned on page 516, today they calculate the "average" by
`timing all seeks and dividing by the number.) Assuming (incorrectly) that seek
`time is linear in distance, and using the manufacturers reported minimum and
`"average" seek times, a common technique to predict seek time is:
`
`T'
`T.
`Distance
`1meseek = 1meminimum + D'
`1stanceaverage
`
`(T'
`* 1meaverage -
`
`)
`T'
`1meminimum
`
`The fallacy concerning seek time is twofold. First, seek time is not linear
`with distance; the arm must accelerate to overcome inertia, reach its maximum
`traveling speed, decelerate as it reaches the requested position, and then wait to
`allow the arm to stop vibrating (settle time). Moreover, in recent disks
`sometimes the arm must pause to control vibrations. Figure 9.38 (page 558)
`plots time versus seek distance for an example disk. It also shows the error in
`the simple seek-time formula above. For short seeks, the acceleration phase
`plays a larger role than the maximum traveling speed, and this phase is typically
`modeled as the square root of the distance. Figure 9.39 (page 558) shows
`accurate formulas used to model the seek time versus distance for two disks.
`The second problem is the average in the product specification would only be
`true if there was no locality to disk activity. Fortunately, there is both temporal
`and spatial locality (page 403 in Chapter 8): disk blocks get used more than once
`and disk blocks near the current cylinder are more likely to be used than those
`farther away. For example, Figure 9.40 (page 559) shows sample measurements
`of seek distances for two workloads: a UNIX timesharing workload and a
`business-processing workload. Notice the high percentage of disk accesses to the
`same cylinder, labeled distance 0 in the graphs, in both workloads.
`Thus, this fallacy couldn't be more misleading. The Exercises debunk this
`fallacy in more