`
`
`
`interweT ARcHive Go|SEP (http:/fen.wikipedia.org/wiki'Central_processing_unit
`
`
`
`<
`WAYBICHMAChINE 2027ca
`
`Pe ee ee 1htaalitinianiliuna 2007
`
`———
`WIKIPEDIA
`The Free Encyclopedia
`
`‘
`
`L
`
`Wikipediais there when you need it — nowit needs you.
`$3,371,543
`Our Gaal: $6 million
`
`[Collapse]
`
`navigation
`= Contents
`= Main page
`= Featured content
`= Curent events
`= Randomarticle
`
`Search
`
`devices in modern life far beyondthe limited application of dedicated computing machines. Moder microprocessors appearin everything from
`
`.
`
`ry
`Pr
`Central processing unit
`From Wikipedia, the free encyclopedia
`"CPU" redirects here. For other uses, see CPU (disambiguation).
`ACentral Processing Unit (CPU)is a machine that can execute computer programs. This broad definition can easily be applied to many early
`computers that existed lang before the term "CPU" ever came into widespread usage. The term itself andits initialism have been in use in the
`computerindustry at least since the early 1960s (Weik 1961). The form, design and implementation of CPUs have changed dramatically since
`[so|the earliest examples, but their fundamental operation has remained much the same.
`interaction
`Early CPUs were custom-designed as a part of a larger, sometimes one-of-a-kind, computer. However, this costly method of designing custom
`= About Wikipedia
`CPUsfora particular application has largely given way to the developmentof mass-produced processors that are suited for one or many
`= Community portal
`purposes. This standardization trend generally beganin the era of discrete transistor mainframes and minicomputers and has rapidly accelerated
`= Recent changes
`with the popularization of the integrated circuit (IC). The IC has allowed increasingly complex CPUs to be designed and manufactured to
`= Contact Wikipedia
`tolerances on the order of nanometers. Both the miniaturization and standardization of CPUs have increased the presence ofthese digital
`= Donate to Wikipedia
`= Help
`automobiles to cell phonesto children’stoys. Die of an Intel 80486DX2 microprocessor=G1
`
`toolbox
`{actual size: 126.75 mm)inits packaging.
`= What links here
`Contents [hice]
`= Related changes
`1 History of CPUs
`= Uploadfile
`1.1 Discrete transistor and IC CPUs
`= Special pages
`1.2 Microprocessors
`= Printable version
`2. CPU operation
`= Permanentlink
`3 Design and implementation
`= Cite this page
`3.1 Integer range
`languages
`3.2 Clockrate
`= Afrikaans
`3.3 Parallelism
`m iad
`3.3.1 Instruction level parallelism
`= Bosanski
`3.3.2 Threadlevelparallelism
`= Bonrapen
`3.3.3 Data parallelism
`= Catala
`4 See also
`= Cesky
`5 Notes
`= Dansk
`6 References
`= Deutsch
`7 Externallinks
`= Eesti
`
`. c
`
`
`
`leat
`
`
`
`EDVAC.oneof thefirst electronic stored
`program computers.
`
`History of CPUs
`one
`_
`= Esperanto
`Main article: History of general purpose CPUs
`Euskara
`Prior to the adventof machines that resemble today's CPUs, computers such as the ENIAC had to be physically rewired in order ta perform
`wwe
`different tasks. These machines are often referred to as “fixed-program computers,” since they had to be physically reconfiguredin order to run a
`= Frangais
`different program. Since the term "CPU"is generally defined as a software (computer program) execution device, the earliest devices that could
`= Furlan
`rightly be called CPUs came with the advent of the stored-program camputer.
`= Galego
`The idea of a stored-program computer was already present during ENIAC's design, but wasinitially omitted so the machine could befinished
`nese
`sooner. On June 30, 1945, before ENIAC was even completed, mathematician John von Neumann distributed the paperentitled "First Draft of a
`: aedomes
`Report on the EDVAC."It outlined the design of a stored-program computerthat would eventually be completed in August 1949 (van Neumann
`1 Interlingua
`1945). EDVAC was designed to perform a certain numberofinstructions (or operations) of various types. These instructions could be combined
`= ltsfano
`to create useful programs for the EDVAC to run. Significantly, the programs written for EDVAC were stored in high-speed computer memory
`= nay
`rather than specified by the physical wiring ofthe computer. This overcame a severe limitation of ENIAC, which was the large amount oftime and
`= Kiswahili
`effort it took to reconfigure the computerto perform a new task. With von Neumann's design, the program,or software, that EDVAC ran could be
`= Latina
`changed simply by changing the contents ofthe computer's memory. [!]
`= Latviegu
`While von Neumannis most often credited with the design of the stored-program computer because ofhis design of EDVAC, others before him
`a Lietuviy
`such as Konrad Zuse had suggested similar ideas. Additionally, the so-called Harvard architecture ofthe Harvard Mark|, which was completed
`= Lingala
`before EDVAC,also utilized a stored-program design using punched papertape ratherthan electronic memory. The key difference between the
`= Magyar
`ven Neumann and Harvard architectures is that the latter separates the storage and treatment of CPU instructions and data, while the former
`= Bahasa Melayu
`uses the same memory space for bath. Most modern CPUsare primarily von Neumannin design, but elements ofthe Harvard architecture are
`. Nederlands
`commonly seen as well.
`:
`= Ba
`oo .
`on
`.
`.
`wo
`|
`.
`= Norsk (bakmal)
`Being digital devices, all CPUs deal with discrete states and therefore require some kind of switching elements to differentiate between and
`= Norsk (nynorsk)
`
`change these states. Prior to commercial acceptance ofthe transistor, electrical relays and vacuum tubes(thermionic valves) were commonly
`.
`used as switching elements. Although these haddistinct speed advantages overearlier, purely mechanical designs, they were unreliable for various reasons. For example, building direct current
`= Polski
`sequential logic circuits aut of relays requires additional hardware to cope with the problem of contact bounce. While vacuum tubes do not suffer from contact bounce,they must heat up before
`= Portugués
`bacomingfully operational and eventually stop functioning altogether[2] Usually, when a tubefailed, the CPU would haveto be diagnosed to locate thefailing component so it could be replaced.
`= Pycernit
`Therefore, early electronic (vacuum tube based) computers were generally faster but less reliable than electromechanical(relay based) computers.
`= Caxa Teina
`= Shqip
`Tube computers like EDVAC tended to average eight hours betweenfailures, whereas relay computers like the (slower, but earlier) Harvard Mark|failed very rarely (Weik 1961:238). In the end,
`= Simple English
`= Sloventina
`tube based CPUs became dominant becausethe significant speed advantages afforded generally outweighed thereliability problems. Mast ofthese early synchronous CPUs ran at low clock
`= Slovengéina
`rates compared to modem microelectronic designs (see below for a discussion of clock rate). Clock signal frequencies ranging from 100 kHz to 4 MHz were very commonatthis time,limited
`largely by the speed of the switching devices they were built with.
`m= Cpnexn / Srpski
`= Suomi
`.
`;
`Svenska
`Discrete transistor and IC CPUs.
`= Tagalog
`The design complexity of CPUsincreased as various technologiesfacilitated building smaller and morereliable electronic
`a lug
`devices. Thefirst such improvement camewith the adventofthe transistor. Transistorized CPUs during the 1950s and 1960s
`= Tiéng Viét
`no longerhadto be built out of bulky, unreliable, and fragile switching elements like vacuum tubes and electrical relays. With
`= Tirkee
`this improvement more complex andreliable CPUs were built onto one or several printed circuit boards containing discrete
`= YepaikepKka
`
`(individual) components.
`
`
`
`.
`[edi]
`
`1
`
`APPLE 1044
`
`APPLE 1044
`
`1
`
`
`
`woe
`25
`RSE
`
`
`
`During this period, a method of manufacturing manytransistors in a compact space gained popularity. The integrated circuit
`(IC) allowed a large numberoftransistors to be manufactured on a single semiconductor-baseddie, or “chip.”Atfirst only very
`basic non-specialized digital circuits such as NOR gates were miniaturized into ICs. CPUs based upon these "building block”
`ICs are generally referred to as "small-scale integration” (§SI) devices. SSI ICs, such as the ones used in the Apollo guidance
`computer, usually contained transistor counts numbering in multiples of ten. To build an entire CPU out of SSI ICs required
`thousands ofindividual chips, but still consumed much less space and powerthan earlier discrete transistor designs. As
`microelectronic technology advanced, an increasing numberoftransistors were placed on ICs, thus decreasing the quantity of
`individual ICs needed for a complete CPU. MSI and LSI (medium- and large-scale integration) |(Cs increased transistor counts
`to hundreds, and then thousands.
`CPU, core memory, and externalbusinterface of a DEC PDP-8/l. &4
`In 1964 IBM introduced its System/360 computer architecture which was used in a series of computers that could run the same
`sree OSmcaale Roel
`programs with different speed and performance. This was significantat a time when mostelectronic computers were
`incompatible with one another, even those made by the same manufacturer. Tofacilitate this improvement, IBM utilized the
`concept of a microprogram (often called "microcode”), which still sees widespread usage in modern CPUs (Amdahlef a/. 1964). The System/360 architecture was so popularthatit dominated the
`mainframe computer marketfor the decades andleft a legacy that isstill continued by similar modern computers like the IBM zSeries.
`In the same year (1964), Digital Equipment Corporation
`(DEC) introducedanotherinfluential computer aimed at the scientific and research markets, the PDP-3. DEC wouldlaterintroduce the extremely popular PDP-11 line that originally was built with
`$SI1Cs but was eventually implemented with LSI components once these becamepractical. In stark contrast with its SSI and MSI predecessors,the first LSI implementation ofthe PDP-11
`contained a CPU composed of only four LSI integrated circuits (Digital Equipment Corporatian 1975).
`Transistor-based computers had severaldistinct advantages over their predecessors. Aside from facilitating increasedreliability and lower power consumption,transistors also allowed CPUsto
`operate at much higher speeds because of the short switching time of a transistor in comparison to a tube orrelay. Thanks to both the increased reliability as well as the dramatically increased
`speedof the switching elements (which were almost exclusively transistors by this time), CPU clock rates in the tens of megahertz were obtained during this period. Additionally, while discrete
`transistor and IC CPUs were in heavy usage, new high-performance designslike SIMD (Single Instruction Multiple Data) vector processors began to appear. These early experimental designs
`later gave rise to the era of specialized supercomputers like those made by Cray Inc.
`
`[edit]
`
`Microprocessors
`Main article: Microprocessor
`The introduction ofthe microprocessorin the 1970s significantly affected the design and implementation of CPUs. Since the introduction of the
`first microprocessor(the Intel 4004) in 1970 andthefirst widely used microprocessor(the Intel 6080) in 1974,this class of CPUs has almost
`completely avertakenall other central processing unit implementation methods. Mainframe and minicomputer manufacturers ofthe time
`launched proprietary IC development programs to upgradetheir older computer architectures, and eventually producedinstruction set compatible
`microprocessors that were backward-compatible with their older hardware and software. Combined with the advent and eventual vast success of
`the now ubiquitous personal computer, the term "CPU"is now applied almost exclusively to microprocessors.
`Previous generations of CPUs were implemented as discrete components and numerous small integrated circuits (ICs) on one or more circuit
`boards. Microprocessors, on the other hand, are CPUs manufactured on a very small number of ICs; usually just one. The overall smaller CPU
`size as a result of being implemented on a single die means faster switching time because of physical factors like decreased gate parasitic
`capacitance. This has allowed synchronous microprocessors to have clock rates ranging from tens of megahertz to several gigahertz.
`
`
`von Neumann CPUs usein their operation: fetch, decode, execute, and writeback.
`
`Additionally, as the ability to construct exceedingly small transistors on an IC has increased, the complexity and number of transistors in a single 7SSeLo
`
`‘CPU has increased dramatically. This widely observed trend is described by Moore's law, which has proven to beafairly accurate predictor of bit microcontroller that includes @ CPU runningat
`the growth of CPU (and otherIC) complexity to date.
`42 MHz, 128 bytes of RAM, 2048bytes of
`EPROM,and I/O in the samechip.
`While the complexity, size, construction, and general form of CPUs have changeddrastically over the pastsixty years,it is notable that the basic
`design and function has not changed much at all. Almostall common CPUs today can be very accurately described as von Neumann stored-
`program machines. As the aforementioned Moore's law continuesto hold true, concerns have arisen about the limits of integrated circuit
`transistor technology. Extreme miniaturization of electronic gates is causing the effects of phenomenalike electromigration and subthreshold
`leakage to become much moresignificant. These newer concerns are among the many factors causing researchers to investigate new methods
`of computing such as the quantum computer, as well as to expand the usage of parallelism and other methods that extend the usefulness ofthe
`classical von Neumann model.
`
`
`
`CPU operation
`The fundamental operation of most CPUs, regardless ofthe physical form they take, is ta execute a sequence of stored instructions called a
`program.The program is represented by a series of numbers that are kept in some kind of computer memory. There are four steps that nearly all
`
`[edit]
`
`Intel 80486DX2 microprocessorin a ceramic o
`PGA package.
`
`Thefirst step, fetch, involves retrieving an instruction (which is represented by a numberor sequence of numbers) from program memory. The
`location in program memory is determined by a program counter (PC), which stores a numberthatidentifies the current position in the program.
`In other words, the program counter keeps track of the CPU's place in the current program. After an instruction is fetched, the PC is incremented
`by the length of the instruction word in terms of memory units.!9! Often theinstruction to be fetched mustbe retrieved from relatively slow memory, causing the CPUto stall while waiting for the
`instruction to be returned. This issue is largely addressed in modern processors by caches andpipeline architectures (see below).
`Theinstruction that the CPU fetches fram memory is used ta determine what the CPUis to do. In the decode step,the instruction is broken upinto parts that have significance to otherportions of
`the CPU. The wayin which the numerical instruction valueis interpreted is defined by the CPU'sinstruction set architecture(ISA).[41 Often, one group of numbers in the instruction, called the
`opcode,indicates which operation to perform. The remaining parts of the number usually provide information required for that instruction, such as operands for an addition operation. Such
`operands may be given as a constant value (called an immediate value), or as a place to locate a value: a register or a memory address, as determined by some addressing mode.
`In older
`designs the portions of the CPU responsible forinstruction decoding were unchangeable hardware devices. However, in more abstract and complicated CPUs and ISAs, a microprogramis often
`used to assist in translating instructions into various configuration signals for the CPU. This microprogram is sometimes rewritable so that it can be modified to change the way the CPU decodes:
`instructions even after it has been manufactured.
`After the fetch and decode steps, the executestep is performed. During this step, various portions ofthe CPU are connected so they can perform the desired operation.If, for instance, an
`addition operation was requested, an arithmetic logic unit (ALU)will be connected to a setof inputs and a set of outputs. The inputs provide the numbers to be added, and the outputs will contain
`the final sum. The ALU contains thecircuitry to perform simple arithmetic and logical operations on the inputs (like addition and bitwise operations).
`Ifthe addition operation produces a result too
`large for the CPU to handle, an arithmetic overflow flag in a flags register may also be set _
`Thefinal step, writeback, simply “writes back” the results ofthe execute step to some form of memory. Very often the results are written to some internal CPU register for quick access by
`subsequentinstructions. In other cases results may be written to slower, but cheaper and larger, main memory. Some types of instructions manipulate the program counterrather thandirectly
`produce result data. These are generally called "jumps"andfacilitate behavierlike loops, conditional program execution (throughthe use of a conditional jump), and functions in programs.1
`Many instructions will also change the state of digits in a “flags” register. These flags can be used to influence how a program behaves, since they often indicate the outcome of various.
`operations. For example, one type of “compare”instruction considers two values and sets a number in the flags register according to which oneis greater. This flag could then be usedby a later
`jumpinstruction to determine program flow.
`After the execution of the instruction and writeback ofthe resulting data, the entire process repeats, with the next instruction cycle normally fetching the next-in-sequenceinstruction because of
`the incrementedvalue in the program counter.If the completed instruction was a jump, the program counterwill be modified to contain the address ofthe instruction that wasjumped to, and
`program execution continues normally. In more complex CPUs than the one described here, multiple instructions can be fetched, decoded, and executed simultaneously. This section describes
`whatis generally referred to as the "Classic RISCpipeline,” which in fact is quite common among the simple CPUs used in many electronic devices (often called microcontroller).
`It largely ignores
`the important role of CPU cache, and therefore the access stage ofthe pipeline.
`
`Design and implementation
`
`[edit]
`
`
`
`2
`
`
`
`Main article: GPU design
`
`Prerequisites
`[edit]
`Integer range
`The way a CPU represents numbers is a design choice that affects the most basic ways in which the device functions. Some early digital computers used an electrical|Computer architecture
`model of the common decimal (base ten) numeral system to represent numbers internally. A few other computers have used more exotic numeral systems like ternary
`Digital circuits
`(base three). Nearly all modem CPUs represent numbers in binary form, with each digit being represented by some two-valued physical quantity such as a “high” or
`"low"voltage. 61
`
`Related to numberrepresentation is the size and precision of numbers that a CPU can represent. In the case of a binary CPU,a bit refers to one
`significant place in the numbers a CPU deals with. The numberofbits (or numeral places) a CPU uses to represent numbers is often called
`"word size”, "bit width", "data path width”, or “integer precision” when dealing with strictly integer numbers (as opposedto floating point). This
`numberdiffers between architectures, and often within different parts of the very same CPU. For example, an 8-bit CPU deals with a range of
`numbers that can be represented by ight binary digits (each digit having two possible values), that is, 2° or 256 discrete numbers. In effect,
`integersize sets a hardware limit on the range ofintegers the software run by the CPU canutilize”!
`Integer range can also affect the numberof locations in memory the CPU can address(locate). For example,if a binary CPU uses 32bits to
`represent a memory address, and each memory address represents one octet(8 bits), the maximum quantity of memory that CPU can address
`is 2°2 octets, or 4 GiB.This is a very simple view of CPU address space, and many designs use more complex addressing methodslike paging in orderto locate more memory thantheirinteger
`range would allow with a flat address space.
`Higher levels of integer range require more structures to deal with the additionaldigits, and therefore more complexity, size, power usage, and general expense.It is not at all uncommon,
`therefore, to see 4- or 8-bit microcontrollers used in modern applications, even though CPUs with much higher range (such as 16, 32, 64, even 128-bit) are available. The simpler microcontrollers.
`are usually cheaper, use less power, and therefore dissipate less heat, all of which can be major design considerations for electronic devices. However, in higher-end applications, the benefits
`afforded by the extra range (mostoften the additional address space) are moresignificant and often affect design choices. To gain some of the advantagesafforded by both lower and higherbit
`lengths, many CPUs are designed with different bit widthsfordifferent portions of the device. For example, the IBM System/370 used a CPUthat was primarily 32 bit, but it used 128-bit precision
`inside its floating point units to facilitate greater accuracy and range in floating point numbers (Amdahlet al. 1964}. Many later CPU designs use similar mixed bit width, especially when the
`processoris meant for general-purpose usage where a reasonable balance of integer and floating point capability is required.
`Clock rate
`Main article: Clack rate
`Most CPUs, and indeed most sequential logic devices, are synchronousin nature.!*! Thatis, they are designed and operate on assumptions abouta synchronization signal. This signal, known as
`a clock signal, usually takes the form of a periodic square wave. By calculating the maximumtime thatelectrical signals can move in various branches of a CPU's many circuits, the designers.
`can select an appropriate period for the clock signal
`In setting the clock period to a value well above the worst-case
`This period must be longer than the amountoftime it takes for a signal to move, or propagate,in the worst-case scenario.
`propagation delay,itis possible to design the entire CPU and the wayit moves data around the “edges” of the rising and falling clock signal. This has the advantage of simplifying the CPU
`significantly, both from a design perspective and a component-count perspective. However,it also carries the disadvantage that the entire CPU must wait onits slowest elements, even though
`some portions ofit are much faster. Thislimitation has largely been compensated for by various methods of increasing CPU parallelism (see below).
`Howeverarchitectural improvements alone do notsolveall ofthe drawbacks of globally synchronous CPUs. For example, a clock signal is subject to the delays of any otherelectrical signal.
`Higher clock rates in increasingly complex CPUs makeit moredifficult to keep the clock signal in phase (synchronized) throughout the entire unit. This has led many modern CPUsto require
`multiple identical clock signals to be provided in orderto avoid delaying a single signal significantly enough to cause the CPU to malfunction. Another major issue as clock rates increase
`dramatically is the amount of heatthatis dissipated by the CPU. The constantly changing clock causes many components to switch regardless ofwhether they are being used at thattime. In
`general, a component thatis switching uses more energy than an elementin a static state. Therefore, as clock rate increases, so does heat dissipation, causing the CPU to require more effective
`cooling solutions.
`One method of dealing with the switching of unneeded components is called clock gating. which involves tumingoffthe clock signal to unneeded components (effectively disabling them).
`However. this is often regarded as difficult to implement and therefore does not see common usageoutside of very low-power designs.{l Another method of addressing some of the problems with
`a global clock signalis the removalofthe clock signal altogether. While removing the global clock signal makes the design process considerably more complex in many ways, asynchronous(or
`clockless) designs carry marked advantagesin power consumption and heat dissipation in comparison with similar synchronous designs. While somewhat uncommon,entire asynchronous CPUs
`have been built without utilizing a global clock signal. Two notable examplesofthis are the ARM compliantAMULET and the MIPS R3000 compatible MiniMIPS. Ratherthantotally removing the
`clock signal, some CPU designsallow certain portions ofthe device to be asynchronous, such as using asynchronous ALUs in conjunction with superscalarpipelining to achieve somearithmetic
`performance gains. While it is not altogether clear whethertotally asynchronous designs can perform at a comparable orbetterlevel than their synchronous counterparts,it is evident that they do
`at least excelin simpler math operations. This, combined with their excellent power consumption and heat dissipation properties, makes them very suitable for embedded computers (Garside ef
`al. 1999).
`
` MOS 6502 microprocessorin a dualin-line 5
`
`package, an extremely popular8-bit design.
`
`[edit]
`
`
`
`Parallelism
`Main article: Parallel computing
`The description of the basic operation of a CPU offered in the previous section describes the simplest form that a CPU cantake. This
`type of CPU,usually referred to as subscalar, operates on and executes oneinstruction on one ortwo pieces of data at a time.
`This process givesrise to an inherent inefficiency in subscalar CPUs. Since only oneinstruction is executed at a time, the entire CPU
`must waitfor thatinstruction to complete before proceeding to the next instruction. As a result the subscalar CPU gets “hung up” on
`instructions which take more than one clock cycle to complete execution. Even adding a second execution unit (see below) does not
`improve performance much,rather than one pathway being hung up, now two pathways are hung up and the number of unused
`transistors is increased. This design, wherein the CPU's execution resources can operate on only one instruction at a time, can only possibly reach scalar performance (oneinstruction per clack).
`However, the performance is nearly always subscalar (less than one instruction per cycle).
`Attempts to achieve scalar and better performance have resulted in a variety of design methodologies that cause the CPU to behaveless linearly and morein parallel. Whenreferring to
`parallelism in CPUs, two termsare generally used to classify these design techniques. Instruction level parallelism (ILP) seeks to increase the rate at which instructions are executed within a CPU
`(that is, to increase the utilization of on-die execution resources), and thread level parallelism (TLP) purposes to increase the number of threads (effectively individual programs) that a CPU can
`execute simultaneously. Each methodologydiffers both in the ways in which they are implemented, as well as the relative effectiveness they afford in increasing the CPU's performance for an
`application[19]
`Instruction level parallelism
`Main articles: Instructian pipelining and Superscalar
`
`ae)= EE I FO wen [ws
`[iF [0 J mew [we
`rs
`Model of a subscalar CPU. Notice thal ittakes fifteen
`G2
`cycles fo complete three instructions.
`
`[edit]
`
`[edit]
`
`One of the simplest methods used to accomplish increased parallelism is to begin thefirst steps of instruction fetching and decoding
`before the priorinstruction finishes executing. This is the simplest form of a technique known as instruction pipelining, andis utilized
`in almostall modern general-purpose CPUs. Pipelining allows more than one instruction to be executed at any given time by breaking
`ae we
`down the execution pathway into discrete stages. This separation can be compared to an assemblyline, in which aninstruction is made
`EX MEM WB
`more complete at each stage until it exits the execution pipeline andis retired.
`EX MEM WB
`Basic five-stagepipeline. In the best case scenario, this 5—Pipelining does, however,introducethe possibility for a situation where the result ofthe previous operation is needed to complete the
`Pipeline can sustain a completion rate ofoneinstruction per
`next operation; a condition often termed data dependency conflict. To cope with this, additional care must be taken to check for these
`cycle.
`sorts of conditions and delay a portion ofthe instruction pipelineif this occurs. Naturally, accomplishing this requires additionalcircuitry,
`30 pipelined processors are more complex than subscalar ones (though not very significantly so). A pipelined processor can become
`very nearly scalar, inhibited only by pipeline stalls (an instruction spending more than one clock cycle in a stage).
`
` ID
`
`
`
`3
`
`
`
`
`
`
`
`-
`Further improvement uponthe idea of instruction pipelining led to the development of a method that decreases theidle time of CPU
`ti
`FID | EX MEMES
`components even further. Designs that are said to be superscalarinclude a lang instruction pipeline and multiple identical execution
`IF
`ID
`EX
`f
`WB
`FID | eX MEM
`units, [Huynh 2003]jp 5 superscalarpipeline, multiple instructions are read and passed to a dispatcher, which decides whether ornot the
`
`
`
`
`instructions can be executedin parallel (simultaneously). Ifso they are dispatchedto available execution units, resulting in the ability for iF|iDre | EX WB
`
`.
`IF
`io
`several instructions to be executed simultaneously.
`In general, the mare instructions a superscalar CPU is able to dispatch
`MEM! WB
`
`simultaneously to waiting execution units, the more instructions will be completed in a given cycle.
`IF
`1D
`MEM! WB
`Most ofthe difficulty inthe design ofa superscalar CPUarchitecturelies in creating an effective dispatcher. The dispatcher needs to be
`| t
`& - we
`able to quickly and correctly determine whether instructions can be executed in parallel, as well as dispatch them in such a way as to
`ID|EX MEMWB
`ID|EX MEM) WB
`keep as many execution units busy as possible. This requires that the instruction pipelineis filled as often as possible and gives rise to
`the need in superscalar architectures for significant amounts of CPU cache.It also makes hazard-avoiding techniques like branch
`Simple superscalar pipeline. By fetching and dispatchingd
`prediction, speculative execution, and out-of-order execution crucial to maintaining high levels of performance. By attempting to predict
`‘woinstructions af a lime, a maximum of-heo instructions:
`which branch (or path) a conditionalinstruction will take, the CPU can minimize the numberof times that the entire pipeline must wait
`percycle can be completed.
`until a conditional instruction is campleted. Speculative execution often provides modest performance increases by executing portions of
`code that may or may not be needed after a conditional operation completes. Qut-of-order execution somewhat rearrangesthe order in which instructions are executed to reduce delays due to
`data dependencies.
`In the case where a portion ofthe CPU is superscalar and part is not, the part which is not suffers a performance penalty due to scheduling stalls. The original Intel Pentium (P5) had two
`superscalarALUs which could accept oneinstruction per clock each, but its FPU could not accept one instruction per clock. Thus the P5 was integer superscalar but notfloating point superscalar.
`Intel's successor to the Pentium architecture, P6, added superscalar capabilities to its floating point features, and therefore afforded a significant increase in floating point instruction performance.
`Both simple pipelining and superscalar design increase a CPU's ILP by allowing a single processorto complete execution of instructions at rates surpassing one instruction per cycle apc)!"
`Most modern CPU designs are at least somewhat superscalar, and nearly all general purpose CPUs designedin the last decade are superscalar. In later years some ofthe emphasis in designing
`high-ILP computers has been moved out of the CPU's hardware andintoits software interface, or |SA. The strategy of the very lang instruction word (VLIW) causes someILP to become implied
`directly by the software, reducing the amount of work the CPU mustperform to boost ILP and thereby reducing the design's complexity.
`[edit]
`Thread level parallelism
`Anotherstrategy of achieving performanceis to execute multiple programs or threadsin parallel. This area of research is known as parallel computing. In Flynn's taxonomy, this strate