`WORLD INTELLECTUAL PROPERTY ORGANIZATION
`International Bureau
`INTERNATIONAL APPLICATION PUBLISHED UNDER THE PATENT COOPERATION TREATY (PCT)
`WO 00/62182
`
`(11) International Publication Number:
`
`(51) International Patent Classification 7 :
`GO6F 15/76
`
`A2
`
`(43) International Publication Date:
`
`19 October 2000 (19.10.00)
`
`(21) International Application Number:
`
`PCT/GB00/01332
`
`(22) International Filing Date:
`
`7 April 2000 (07.04.00)
`
`(30) Priority Data:
`9908199.4
`9908201.8
`9908203.4
`9908204.2
`9908205.9
`9908209.1
`9908211.7
`9908214.1
`9908219.0
`9908222.4
`9908225.7
`9908226.5
`9908227.3
`9908228.1
`9908229.9
`9908230.7
`
`9 April 1999 (09.04.99)
`9 April 1999 (09.04.99)
`9 April 1999 (09.04.99)
`9 April 1999 (09.04.99)
`9 April 1999 (09.04.99)
`9 April 1999 (09.04.99)
`9 April 1999 (09.04.99)
`9 April 1999 (09.04.99)
`9 April 1999 (09.04.99)
`9 April 1999 (09.04.99)
`9 April 1999 (09.04.99)
`9 April 1999 (09.04.99)
`9 April 1999 (09.04.99)
`9 April 1999 (09.04.99)
`9 April 1999 (09.04.99)
`9 April 1999 (09.04.99)
`
`GB
`GB
`GB
`GB
`GB
`GB
`GB
`GB
`GB
`GB
`GB
`GB
`GB
`GB
`GB
`GB
`
`(71) Applicant (for all designated States except US): PIXELFU-
`SION LIMITED [GB/GB]; 2440 The Quadrant, Aztec West,
`Almondsbury, Bristol BS32 4AQ (GB).
`
`(72) Inventors; and
`(75) Inventors/Applicants (for US only): STUITARD, Dave
`[GB/GB]; 28 St Georges Avenue, St George, Bristol BS5
`8DD (GB). WILLIAMS, Dave [GB/GB]; 45 Railton Jones
`Close, Stoke Gifford, Gloucestershire BS34 8XY (GB).
`O'DEA, Eamon [IE/GB]; 32 Richmond Park Road, Clifton,
`Bristol BS8 3AP (GB). FAULDS, Gordon [GB/GB]; 21
`High Furlong, Cam, Nr Dursley, Gloucestershire GL11
`5UZ (GB). RHODES, John [GB/GB]; 40 Ormonds Close,
`Bradley Stoke, Bristol BS32 ODX (GB). CAMERON, Ken
`[GB/GB]; 29 Little Meadow, Bristol BS32 8AT (GB).
`ATKIN, Phil [GB/GB]; 2440 The Quadrant, Aztec West,
`Almondsbury, Bristol BS32 4AQ (GB). WINSER, Paul
`[GB/GB]; 8 Oakleigh House, Bridge Road, Leigh Woods,
`Bristol BS8 3PB (GB). DAVID, Russell [GB/GB]; 8
`Rowan Drive, Wootton Bassett, Wiltshire SN4 7ES (GB).
`McCONNELL, Ray [GB/GB]; Flat 11, 1 Clifton Road,
`Clifton, Bristol BS8 1AE (GB). DAY, Tim [GB/GB]; 33
`Rosebery Avenue, St Werburghs, Bristol BS2 9TW (GB).
`GREER, Trey [GB/GB]; 330 Sunset Creek Circle, Chapel
`Hill, NC 27516 (US).
`
`(74) Agent: VIGARS, Christopher, Ian; Haseltine Lake & Co,
`Imperial House, 15-19 Kingsway, London WC2B 6UD
`(GB).
`
`(81) Designated States: AE, AG, AL, AM, AT, AU, AZ, BA, BB,
`BG, BR, BY, CA, CH, CN, CR, CU, CZ, DE, DK, DM,
`DZ, EE, ES, FI, GB, GD, GE, GH, GM, HR, HU, ID, IL,
`IN, IS, JP, KE, KG, KP, KR, KZ, LC, LK, LR, LS, LT, LU,
`LV, MA, MD, MG, MK, MN, MW, MX, NO, NZ, PL, PT,
`RO, RU, SD, SE, SG, SI, SK, SL, TJ, TM, TR, TT, TZ,
`UA, UG, US, UZ, VN, YU, ZA, ZW, ARIPO patent (GH,
`GM, KE, LS, MW, SD, SL, SZ, TZ, UG, ZW), Eurasian
`patent (AM, AZ, BY, KG, KZ, MD, RU, TJ, TM), European
`patent (AT, BE, CH, CY, DE, DK, ES, FI, FR, GB, GR,
`IE, IT, LU, MC, NL, PT, SE), OAPI patent (BF, BJ, CF,
`CG, CI, CM, GA, GN, GW, ML, MR, NE, SN, TD, TG).
`
`Published
`Without international search report and to be republished
`upon receipt of that report.
`
`(54) Title: PARALLEL DATA PROCESSING APPARATUS
`
`(57) Abstract
`
`A data processing apparatus comprises a SIMD (single instruction
`multiple data) array (10) of processing elements. The processing elements are
`operably divided into a plurality of processing blocks, the processing blocks
`being operable to process respective groups of data items.
`
`2
`
`—1
`
`HOST
`SYSTEM
`INTERFACE
`
`3
`y.
`GRAPHICS
`SYSTEM
`
`_130
`
`4, 6
`
`PROCESSING
`CORE
`
`6 —
`
`VIDEO
`OUTPUT
`
`•
`4, 6
`
`14
`
`EPU
`41.
`
`LOCAL
`MEMORY
`
`12
`
`LG Ex. 1004, pg 1
`
`LG Ex. 1004
`LG v. ATI
`IPR2017-01225
`
`
`
`FOR THE PURPOSES OF INFORMATION ONLY
`
`Codes used to identify States party to the PCT on the front pages of pamphlets publishing international applications under the PCT.
`
`AL
`AM
`AT
`AU
`AZ
`BA
`BB
`BE
`BF
`BG
`BJ
`BR
`BY
`CA
`CF
`CG
`CH
`CI
`CM
`CN
`CU
`CZ
`DE
`DK
`EE
`
`Albania
`Armenia
`Austria
`Australia
`Azerbaijan
`Bosnia and Herzegovina
`Barbados
`Belgium
`Burkina Faso
`Bulgaria
`Benin
`Brazil
`Belarus
`Canada
`Central African Republic
`Congo
`Switzerland
`Cote d'Ivoire
`Cameroon
`China
`Cuba
`Czech Republic
`Germany
`Denmark
`Estonia
`
`ES
`FI
`FR
`GA
`GB
`GE
`GH
`GN
`GR
`HU
`IE
`IL
`IS
`IT
`JP
`KE
`KG
`KP
`
`KR
`KZ
`LC
`LI
`LK
`LR
`
`Spain
`Finland
`France
`Gabon
`United Kingdom
`Georgia
`Ghana
`Guinea
`Greece
`Hungary
`Ireland
`Israel
`Iceland
`Italy
`Japan
`Kenya
`Kyrgyzstan
`Democratic People's
`Republic of Korea
`Republic of Korea
`Kazakstan
`Saint Lucia
`Liechtenstein
`Sri Lanka
`Liberia
`
`LS
`LT
`LU
`LV
`MC
`MD
`MG
`MK
`
`ML
`MN
`MR
`MW
`MX
`NE
`NL
`NO
`NZ
`PL
`PT
`RO
`RU
`SD
`SE
`SG
`
`Lesotho
`Lithuania
`Luxembourg
`Latvia
`Monaco
`Republic of Moldova
`Madagascar
`The former Yugoslav
`Republic of Macedonia
`Mali
`Mongolia
`Mauritania
`Malawi
`Mexico
`Niger
`Netherlands
`Norway
`New Zealand
`Poland
`Portugal
`Romania
`Russian Federation
`Sudan
`Sweden
`Singapore
`
`SI
`SK
`SN
`SZ
`TD
`TG
`TJ
`TM
`TR
`TT
`UA
`UG
`US
`UZ
`VN
`YU
`ZW
`
`Slovenia
`Slovakia
`Senegal
`Swaziland
`Chad
`Togo
`Tajikistan
`Turkmenistan
`Turkey
`Trinidad and Tobago
`Ukraine
`Uganda
`United States of America
`Uzbekistan
`Viet Nam
`Yugoslavia
`Zimbabwe
`
`LG Ex. 1004, pg 2
`
`
`
`WO 00/62182
`
`PCT/GB00/01332
`
`PARALLEL DATA PROCESSING APPARATUS
`
`-1-
`
`The present invention relates to parallel data
`processing apparatus, and in particular to SIMD (single
`instruction multiple data) processing apparatus.
`
`BACKGROUND OF THE INVENTION
`
`Increasingly, data processing systems are required to
`process large amounts of data. In addition, users of
`such systems are demanding that the speed of data
`processing is increased. One particular example of the
`need for high speed processing of massive amounts of
`data is in the computer graphics field. In computer
`graphics, large amounts of data are produced that
`relate to, for example, geometry, texture, and colour
`of objects and shapes to be displayed on a screen.
`Users of computer graphics are increasingly demanding
`more lifelike and faster graphical displays which
`increases the amount of data to be processed and
`increases the speed at which the data must be
`processed.
`
`A previously proposed processing architecture for
`processing large amounts of data in a computer system
`uses a Single Instruction Multiple Data (SIMD) array of
`processing elements. In such an array all of the
`processing elements receive the same instruction
`stream, but operate on different respective data items
`Such an architecture can thereby proceA data in
`parallel, but without the need to produce parallel
`instruction streams. This can be an efficient and
`relatively simple way of obtaining good performance
`from a parallel processing machine.
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`LG Ex. 1004, pg 3
`
`
`
`WO 00/62182
`
`PCT/GB00/01332
`
`-2-
`
`However, the SIMD architecture can be inefficient when
`a system has to process a large number of relatively
`small data item groups. For example, for a SIMD array
`processing data relating to a graphical display screen,
`for a small graphical primitive such as a triangle,
`only relatively few processing elements of the array
`will be enabled to process data relating to the
`primitive. In that case, a large proportion of the
`processing elements may remain unused while data is
`being processed for a particular group.
`
`5
`
`10
`
`It is therefore desirable to produce a system which can
`overcome or alleviate this problem.
`
`15
`
`SUMMARY OF THE INVENTION
`
`20
`
`25
`
`30
`
`According to one aspect of the present invention, there
`is provided a data processing apparatus comprising a
`SIMD (single instruction multiple data) array of
`processing elements, wherein the processing elements
`are operably divided into a plurality of processing
`blocks, the processing blocks being operable to process
`respective groups of data items.
`
`According to another aspect of the present invention,
`there is provided a data processing apparatus
`comprising an array of processing elements, which are
`operable to process respective data items in accordance
`with a common received instruction, wherein the
`processing elements are operably dividdt into a
`plurality of processing blocks having at least one
`processing element, the processing blocks being
`operable to process respective groups of data items.
`
`35
`
`Various further aspects of the present invention are
`
`LG Ex. 1004, pg 4
`
`
`
`WO 00/62182
`
`PCT/GB00/01332
`
`exemplified by the attached claims.
`
`-3-
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`Figure 1 is a block diagram illustrating a graphics
`data processing system;
`Figure 2 is a more detailed block diagram illustrating
`the graphics data processing system of Figure 1;
`Figure 3 is a block diagram of a processing core of the
`system of Figure 2;
`Figure 4 is a block diagram of a thread manager of the
`system of Figure 3;
`Figure 5 is a block diagram of a array controller of
`the system of Figure 3;
`Figure 6 is a block diagram of an instruction issue
`state machine of the channel controller of Figure 3;
`Figure 7 is a block diagram of a binning unit of the
`system of Figure 3;
`Figure 8 is a block diagram of a processing block of
`the system of Figure 3;
`Figure 9 is a flowchart illustrating data processing
`using the system of Figures 1 to 8;
`Figure 10 is a more detailed block diagram of a thread
`processor of the thread manager of Figure 4;
`Figure 11 is a block diagram of a processor unit of the
`processing block of Figure 8;
`Figure 12 is a block diagram illustrating a processing
`element interface;
`Figure 13 is a block diagram illustrating a block I/O
`interface;
`Figure 14 is a block diagram of part of the processor
`unit of Figure 11; and
`Figure 15 is a block diagram of another part of the
`processor unit of Figure 11.
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`LG Ex. 1004, pg 5
`
`
`
`WO 00/62182
`
`PCT/GB00/01332
`
`DESCRIPTION OF THE PREFERRED EMBODIMENT
`
`-4-
`
`The data processing system described below is a
`graphics data processing system for producing graphics
`images for display on a screen. However, this
`embodiment is purely exemplary, and it will be readily
`apparent that the techniques and architecture described
`here for processing graphical data are equally
`applicable to other data types, such as video data.
`The system is of course applicable to other signal
`and/or data processing techniques and systems.
`
`An overview of the system will be given, followed by
`brief descriptions of the various functional units of
`the system. A graphics processing method will then be
`described by way of example, followed by detailed
`description of the functional units.
`
`OVERVIEW
`
`Figure 1 is a system level block diagram illustrating a
`graphics data processing system 3. The system 3
`interfaces with a host system (not shown), such as a
`personal computer or workstation, via an interface 2.
`Such a system can be provided with an embedded
`processor unit (EPU) for control purposes. For
`example, the specific graphics system 3 includes an
`embedded processing unit (EPU) 8 for controlling the
`overall function of the graphics processor and for
`interfacing with the host system. The system includes
`a processing core 10 which processes the graphical data
`for output to the display screen via a video output
`interface 14. Local memory 12 is provided for the
`graphics system 3.
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`LG Ex. 1004, pg 6
`
`
`
`WO 00/62182
`
`PCT/GB00/01332
`
`-5-
`
`Such a data processing can be connected for operation
`to a host system or could provide a stand alone
`processing system, without the need for a specific host
`system. Examples of such application include a "set
`top box" for receiving and decoding digital television
`and internet signals.
`
`Figure 2 illustrates the graphics processing system in
`more detail. In one particular example, the graphics
`system connects to the host system via an advanced
`graphics port (AGP) or PCI interface 2. The PCI
`interface and AGP 2 are well known.
`
`The host system can be any type of computer system,
`for example, a PC 99 specification personal computer or
`a workstation.
`
`The AGP 2 provides a high bandwidth path from the
`graphics system to host system memory. This allows
`large texture databases to be held in the host system
`memory, which is generally larger than local memory
`associated with the graphics system. The AGP also
`provides a mechanism for mapping memory between a
`linear address space on the graphics system and a
`number of potentially scattered memory blocks in the
`host system memory. This mechanism is performed by a
`graphics address re-mapping table (GART) as is well
`known.
`
`The graphics system described below is preferably
`implemented as a single integrated circuit which
`provides all of the functions shown in Figure 1.
`However, it will be readily apparent that the system
`may be provided as separate circuit card carrying
`several different components, or as a separate chipset
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`LG Ex. 1004, pg 7
`
`
`
`WO 00/62182
`
`PCT/GB00/01332
`
`-6-
`
`provided on the motherboard of the host, or integrated
`with the host central processing unit (CPU), or in any
`suitable combination of these and other
`implementations.
`
`The graphics system includes several functional units
`which are connected to one another for the transfer of
`data by way of a dedicated bus system. The bus system
`preferably includes a primary bus 4 and a secondary bus
`6. The primary bus is used for connection of latency
`intolerant devices, and the secondary bus is used for
`connection of latency tolerant devices. The bus
`architecture is preferably as described in detail in
`the Applicant's co-pending UK patent applications,
`particularly GB 9820430.8. It will be readily
`appreciated that any number of primary and secondary
`buses can be provided in the bus architecture in the
`system. The specific system shown in Figure 2 includes
`two secondary buses.
`
`Referring mainly to Figure 2, access to the primary bus
`4 is controlled by a primary arbiter 41, and access to
`the secondary buses 6 by a pair of secondary arbiters
`61. Preferably, all data transfers are in packets of
`32 bytes each. The secondary buses 6 are connected
`with the primary bus 4 by way of respective interface
`units (SIP) 62.
`
`An auxiliary control bus 7 is provided in order to
`enable control signals to be communicated to the
`various units in the system.
`
`The AGP/PCI interface is connected to the graphics
`system by way of the secondary buses 6. This interface
`can be connected to any selection of the secondary
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`LG Ex. 1004, pg 8
`
`
`
`WO 00/62182
`
`PCT/GB00/01332
`
`-7-
`
`buses, in the example shown, to both secondary buses 6.
`The graphics systems also includes an embedded
`processing unit (EPU) 8 which is used to control
`operation of the graphics system and to communicate
`with the host system. The host system has direct
`access to the EPU 8 by way of a direct host access
`interface 9 in the AGP/PCI 2. The EPU is connected to
`the primary bus 4 by way of a bus interface unit (EPU
`FBI) 90.
`
`Also connected to the primary bus is a local memory
`system 12. The local memory system 12 includes a
`number, in this example four, of memory interface units
`121 which are used to communicate with the local memory
`itself. The local memory is used to store various
`information for use by the graphics system.
`
`The system also includes a video interface unit 14
`which comprises the hardware needed to interface the
`graphics system to the display screen (not shown), and
`other devices for exchange of data which may include
`video data. The video interface unit is connected to
`the secondary buses 6, via bus interface units (FBI).
`
`The graphics processing capability of the system is
`provided by a processing core 10. The core 10 is
`connected to the secondary buses 6 for the transfer of
`data, and to the primary bus 4 for the transfer of
`instructions. As will be explained in more detail
`below, the secondary bus connections a.±1 made by a core
`bus interface (Core FBI) 107, and a binner bus
`interface (Binner FBI) 111, and the primary bus
`connection is made by a thread manager bus interface
`(Thread Manager FBI) 103.
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`LG Ex. 1004, pg 9
`
`
`
`WO 00/62182
`
`PCT/GB00/01332
`
`-8-
`
`As will be explained in greater detail below, the
`processing core 10 includes a number of control units:
`thread manager 102, array controller 104, channel
`controller 108, a binning unit 1069 per block and a
`microcode store 105. These control units control the
`operation of a number of processing blocks 106 which
`perform the graphics processing itself.
`
`In the example shown in Figure 2, the processing core
`10 is provided with eight processing blocks 106. It
`will be readily appreciated that any number of
`processing blocks can be provided in a graphics system
`using this architecture.
`
`5
`
`10
`
`15
`
`PROCESSING CORE
`
`Figure 3 shows the processing core in more detail. The
`thread manager 102 is connected to receive control
`signals from the EPU 8. The control signals inform the
`thread manager as to when instructions are to be
`fetched and where the instructions are to be found.
`The thread manager 102 is connected to provide these
`instructions to the array controller 104 and to the
`channel controller 108. The array and channel
`controllers 104 and 108 are connected to transfer
`control signals to the processing blocks 106 dependent
`upon the received instructions.
`
`Each processing block 106 comprises an array 1061 of
`processor elements (PEs) and a mathematical expression
`evaluator (MEE) 1062. As will be described in more
`detail below, a path 1064 for MEE coefficient feedback
`is provided from the PE memory, as is an input/output
`channel 1067. Each processing block includes a binning
`unit 1069 unit 1068 and a transfer engine 1069 for
`
`20
`
`25
`
`30
`
`35
`
`LG Ex. 1004, pg 10
`
`
`
`WO 00/62182
`
`PCT/GB00/01332
`
`-9-
`
`controlling data transfers to and from the input/output
`channel under instruction from the channel controller
`108.
`
`The array 1061 of processor elements provides a single
`instruction multiple data (SIMD) processing structure.
`Each PE in the array 1061 is supplied with the same
`instruction, which is used to process data specific to
`the PE concerned.
`
`Each processing element (PE) 1061 includes a processor
`unit 1061a for carrying out the instructions received
`from the array controller, a PE memory unit 1061c for
`storing data for use by the processor unit 1061a, and a
`PE register file 1061b through which data is
`transferred between the processor unit 1061a and the PE
`memory unit 1061c. The PE register file 1061b is also
`used by the processor unit 1061a for temporarily
`storing data that is being processed by the processor
`unit 1061a.
`
`The provision of a large number of processor elements
`can result in a large die size for the manufacture of
`the device in a silicon device. Accordingly, it is
`desirable to reduce the effect of a defective area on
`the device. Therefore, the system is preferably
`provided with redundant PEs, so that if one die area is
`faulty, another can be used in its place.
`
`In particular, for a group of processiTg elements used
`for processing data, additional redundant processing
`elements can be manufactured. In one particular
`example, the processing elements are provided in
`"panels" of 32 PEs. For each panel a redundant PE is
`provided, so that a defect in one of the PEs of the
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`LG Ex. 1004, pg 11
`
`
`
`WO 00/62182
`
`PCT/GB00/01332
`
`-10-
`
`panel can be overcome by using the redundant PE for
`processing of data. This will be described in more
`detail below.
`
`5
`
`THREAD MANAGER
`
`The array of processing elements is controlled to carry
`out a series of instructions in an instruction stream.
`Such instruction streams for the processing blocks 106
`are known as "threads". Each thread works co-
`operatively with other threads to perform a task or
`tasks. The term "multithreading" refers to the use of
`several threads to perform a single task, whereas the
`term "multitasking" refers to the use of several
`threads to perform multiple tasks simultaneously. It
`is the thread manager 102 which manages these
`instruction streams or threads.
`
`There are several reasons for providing multiple
`threads in such a data processing architecture. The
`processing element array can be kept active, by
`processing another thread when the current active
`thread is halted. The threads can be assigned to any
`task as required. For example, by assigning a
`plurality of threads for handling data I/O operations
`for transferring data to and from memory, these
`operations can be performed more efficiently, by
`overlapping I/O operations with processing operations.
`The latency of the memory I/O operations can
`the use of
`effectively be masked from the system
`different threads.
`
`In addition, the system can have a faster response time
`to external events. Assigning particular threads to
`wait on different external events, so that when an
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`LG Ex. 1004, pg 12
`
`
`
`WO 00/62182
`
`PCT/GB00/01332
`
`event happens, it can be handled immediately.
`
`-11-
`
`The thread manager 102 is shown in more detail in
`Figure 4, and comprises a cache memory unit 1024 for
`storing instructions fetched for each thread. The
`cache unit 1024 could be replaced by a series of first-
`in-first-out (FIFO) buffers, one per thread. The
`thread manager also includes an instruction fetch unit
`1023, a thread scheduler 1025, thread processors 1026,
`a semaphore controller 1028 and a status block 1030.
`
`Instructions for a thread are fetched from local memory
`or the EPU 8 by the fetch unit 1023, and supplied to
`the cache memory 1024 via connecting logic.
`
`The threads are assigned priorities relative to one
`another. Of course, although the example described
`here has eight threads, any number of threads can be
`controlled in this manner. At any particular moment in
`time, each thread may be assigned to any one of a
`number of tasks. For example, thread zero may be
`assigned for general system control, thread 1 assigned
`to execute 2D (two dimensional) activities, and threads
`2 to 7 assigned to executing 3D activities (such as
`calculating vertices, primitives or rastering).
`
`In the example shown in Figure 4, the thread manager
`includes one thread processor 1026 for each thread.
`The thread processors 1026 control the issuance of core
`instructions from the thread manager so as to maintain
`processing of simultaneously active program threads, so
`that each the processing blocks 106 can be active for
`as much time as possible. In this particular example
`the same instruction stream is supplied to all of the
`processing blocks in the system.
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`LG Ex. 1004, pg 13
`
`
`
`WO 00/62182
`
`PCT/GB00/01332
`
`-12-
`
`It will be appreciated that the number of threads could
`exceed the number of thread processors, so that each
`thread processor handles control of more than one
`thread. However, providing a thread processor for each
`thread reduces the need for context switching when
`changing the active thread, thereby reducing memory
`accesses and hence increasing the speed of operation.
`
`The semaphore controller 1028 operates to synchronise
`the threads with one other.
`
`Within the thread manager 102, the status block 1030
`receives status information 1036 from each of the
`threads. The status information is transferred to the
`thread scheduler 1025 by the status block 1030. The
`status information is used by the thread scheduler 1025
`to determine which thread should be active at any one
`time.
`
`Core instructions 1032 issued by the thread manager 102
`are sent to the array controller 104 and the channel
`controller 108 (figure 3).
`
`ARRAY CONTROLLER
`
`The array controller 104 directs the operation of the
`processing block 106, and is shown in greater detail in
`Figure 5.
`
`The array controller 104 comprises an instruction
`launcher 1041, connected to receive instructions from
`the thread manager. The instruction launcher 1041
`indexes an instruction table 1042, which provides
`further specific instruction information to the
`instruction launcher.
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`LG Ex. 1004, pg 14
`
`
`
`WO 00/62182
`
`PCT/GB00/01332
`
`-13-
`
`On the basis of the further instruction information,
`the instruction launcher directs instruction
`information to either a PE instruction sequencer 1044
`or a load/store controller 1045. The PE instruction
`sequencer receives instruction information relating to
`data processing, and the load/store controller receives
`information relating to data transfer operations.
`
`The PE instruction sequencer 1044 uses received
`instruction information to index a PE microcode store
`105, for transferring PE microcode instructions to the
`PEs in the processing array.
`
`The array controller also includes a scoreboard unit
`1046 which is used to store information regarding the
`use of PE registers by particular active instructions.
`The score board unit 1046 is functionally divided so as
`to provide information regarding the use of registers
`by instructions transmitted by the PE instruction
`sequencer 1044 and the load/store controller 1045
`respectively.
`
`In general terms, the PE instruction sequencer 1044
`handles instructions that involve data processing in
`the processor unit 1061a. The load/store controller
`1045, on the other hand, handles instructions that
`involve data transfer between the registers of the
`processor unit 1061a and the PE memory unit 1061c. The
`load/store controller 1045 will be described in greater
`detail later.
`
`The instruction launcher 1041 and the score board unit
`1046 maintain the appearance of serial instruction
`execution whilst achieving parallel operation between
`the PE instruction sequencer 1044 and the load/store
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`LG Ex. 1004, pg 15
`
`
`
`WO 00/62182
`
`PCT/GB00/01332
`
`controller 1045.
`
`-14-
`
`The remaining core instructions 1032 issued from the
`thread manager 102 are fed to the channel controller
`108. This controls transfer of data between the PE
`memory units and external memory (either local memory
`or system memory in AGP or PCI space).
`
`CHANNEL CONTROLLER
`
`The channel controller 108 operates asynchronously with
`respect to the execution of instructions by the array
`controller 104. This allows computation and external
`I/O to be performed simultaneously and overlapped as
`much as possible. Computation (PE) operations are
`synchronised with I/O operations by means of semaphores
`in the thread manager, as will be explained in more
`detail below.
`
`The channel controller 108 also controls the binning
`units 1068 which are associated with respective
`processing blocks 106. This is accomplished by way of
`channel controller instructions.
`
`Figure 6 shows the channel controller's instruction
`issue state machine, which lies at the heart of the
`channel controller's operation, and which will be
`described in greater detail later.
`
`Each binning unit 1069 (Figure 3) is canected to the
`I/O channels of its associated processing block 106.
`The purpose of the binning unit 1069 is to sort
`primitive data by region, since the data is generally
`not provided by the host system in the correct order
`for region based processing.
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`LG Ex. 1004, pg 16
`
`
`
`WO 00/62182
`
`PCT/GB00/01332
`
`-15--
`
`The binning units 1068 provide a hardware implemented
`region sorting system, (shown in Figure 7), which
`removes the sorting process from the processing
`elements, thereby releasing the PEs for data
`processing.
`
`MEMORY ACCESS CONSOLIDATION
`
`In a computer system having a large number of elements
`which require access to a single memory, or other
`addressed device, there can be a significant reduction
`in processing speed if accesses to the storage device
`are performed serially for each element.
`
`The graphics system described above is one example of
`such a system. There are a large number of processor
`elements, each of which requires access to data in the
`local memory of the system. Since the number of
`elements requiring memory access exceeds the number of
`memory accesses that can be made at any one time,
`accesses to the local and system memory involves serial
`operation. Thus, performing memory access for each
`element individually would cause a degradation in the
`speed of operation of the processing block.
`
`In order to reduce the effect of this problem on the
`speed of processing of the system, the system of
`Figures 1 and 2 includes a memory access consolidating
`function.
`
`The memory access consolidation is also described below
`with reference to figures 12 and 13. In general,
`however, the processing elements that require access to
`memory indicate that this is the case by setting an
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`LG Ex. 1004, pg 17
`
`
`
`WO 00/62182
`
`PCT/GB00/01332
`
`-16-
`
`indication flag or mark bit. The first such marked PE
`is then selected, and the memory address to which it
`requires access is transmitted to all of the processing
`elements of the processing block. The address is
`transmitted with a corresponding transaction ID. Those
`processing elements which require access (ie. have the
`indication flag set) compare the transmitted address
`with the address to which they require access, and if
`the comparison indicates that the same address is to be
`accessed, those processing elements register the
`transaction ID for that memory access and clear the
`indication flag.
`
`When the transaction ID is returned to the processing
`block, the processing elements compare the stored
`transaction ID with the incoming transaction ID, in
`order to recover the data.
`
`Using transaction IDs in place of simply storing the
`accessed address information enables multiple memory
`accesses to be carried, and then returned in any order.
`Such a "fire and forget" method of recovering data can
`free up processor time, since the processors do not
`have to await return of data before continuing
`processing steps. In addition, the use of transaction
`ID reduces the amount of information that must be
`stored by the processing elements to identify the data
`recovery transaction. Address information is generally
`of larger size than transaction ID information.
`
`Preferably, each memory address can store more data
`than the PEs require access to. Thus, a plurality of
`PEs can require access to the same memory address, even
`though they do not require access to the same data.
`This arrangement can further reduce the number of
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`LG Ex. 1004, pg 18
`
`
`
`WO 00/62182
`
`PCT/GB00/01332
`
`-17-
`
`memory accesses required by the system, by providing a
`hierarchical consolidation technique. For example,
`each memory address may store four quad bytes of data,
`with each PE requiring one quad byte at any one access.
`
`This technique can also allow memory write access
`consolidation for those PEs that require write access
`to different portions of the same memory address.
`
`In this way the system can reduce the number of memory
`accesses required for a processing block, and hence
`increase the speed of operation of the processing
`block.
`
`The indication flag can also be used in another
`technique for writing data to memory. In such a
`technique, the PEs having data to be written to memory
`signal this fact by setting the indication flag. Data
`is written to memory addresses for each of those PEs in
`order, starting at a base address, and stepped at a
`predetermined spacing in memory. For example, if the
`step size is set to one, then consecutive addresses are
`written with data from the flagged PEs.
`
`PROCESSING BLOCKS
`
`One of the processing blocks 106 is shown in more
`detail in Figure 8. The processing block 106 includes
`an array of processor elements 1061 which are arranged
`to operate in parallel on respective data, items but
`carrying out the same instruction (SIMD). Each
`processor element 1061 includes a processor unit 1061a,
`a PE register file 1061b and a PE memory unit 1061c.
`The PE memory unit 1063c is used to store data items
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`LG Ex. 1004, pg 19
`
`
`
`WO 00/62182
`
`PCT/GB00/01332
`
`-18-
`
`for processing by the processor unit 1061a. Each
`processor unit 1061a can transfer data to and from its
`PE memory unit 1061c via the PE register file 1061b.
`The processor unit 1061a also uses the PE register file
`1061b to store data which is being processed. Transfer
`of data items between the processor unit 1061a and the
`memory unit 1061c is controlled by the ar