throbber
PCT
`WORLD INTELLECTUAL PROPERTY ORGANIZATION
`International Bureau
`INTERNATIONAL APPLICATION PUBLISHED UNDER THE PATENT COOPERATION TREATY (PCT)
`WO 00/62182
`
`(11) International Publication Number:
`
`(51) International Patent Classification 7 :
`GO6F 15/76
`
`A2
`
`(43) International Publication Date:
`
`19 October 2000 (19.10.00)
`
`(21) International Application Number:
`
`PCT/GB00/01332
`
`(22) International Filing Date:
`
`7 April 2000 (07.04.00)
`
`(30) Priority Data:
`9908199.4
`9908201.8
`9908203.4
`9908204.2
`9908205.9
`9908209.1
`9908211.7
`9908214.1
`9908219.0
`9908222.4
`9908225.7
`9908226.5
`9908227.3
`9908228.1
`9908229.9
`9908230.7
`
`9 April 1999 (09.04.99)
`9 April 1999 (09.04.99)
`9 April 1999 (09.04.99)
`9 April 1999 (09.04.99)
`9 April 1999 (09.04.99)
`9 April 1999 (09.04.99)
`9 April 1999 (09.04.99)
`9 April 1999 (09.04.99)
`9 April 1999 (09.04.99)
`9 April 1999 (09.04.99)
`9 April 1999 (09.04.99)
`9 April 1999 (09.04.99)
`9 April 1999 (09.04.99)
`9 April 1999 (09.04.99)
`9 April 1999 (09.04.99)
`9 April 1999 (09.04.99)
`
`GB
`GB
`GB
`GB
`GB
`GB
`GB
`GB
`GB
`GB
`GB
`GB
`GB
`GB
`GB
`GB
`
`(71) Applicant (for all designated States except US): PIXELFU-
`SION LIMITED [GB/GB]; 2440 The Quadrant, Aztec West,
`Almondsbury, Bristol BS32 4AQ (GB).
`
`(72) Inventors; and
`(75) Inventors/Applicants (for US only): STUITARD, Dave
`[GB/GB]; 28 St Georges Avenue, St George, Bristol BS5
`8DD (GB). WILLIAMS, Dave [GB/GB]; 45 Railton Jones
`Close, Stoke Gifford, Gloucestershire BS34 8XY (GB).
`O'DEA, Eamon [IE/GB]; 32 Richmond Park Road, Clifton,
`Bristol BS8 3AP (GB). FAULDS, Gordon [GB/GB]; 21
`High Furlong, Cam, Nr Dursley, Gloucestershire GL11
`5UZ (GB). RHODES, John [GB/GB]; 40 Ormonds Close,
`Bradley Stoke, Bristol BS32 ODX (GB). CAMERON, Ken
`[GB/GB]; 29 Little Meadow, Bristol BS32 8AT (GB).
`ATKIN, Phil [GB/GB]; 2440 The Quadrant, Aztec West,
`Almondsbury, Bristol BS32 4AQ (GB). WINSER, Paul
`[GB/GB]; 8 Oakleigh House, Bridge Road, Leigh Woods,
`Bristol BS8 3PB (GB). DAVID, Russell [GB/GB]; 8
`Rowan Drive, Wootton Bassett, Wiltshire SN4 7ES (GB).
`McCONNELL, Ray [GB/GB]; Flat 11, 1 Clifton Road,
`Clifton, Bristol BS8 1AE (GB). DAY, Tim [GB/GB]; 33
`Rosebery Avenue, St Werburghs, Bristol BS2 9TW (GB).
`GREER, Trey [GB/GB]; 330 Sunset Creek Circle, Chapel
`Hill, NC 27516 (US).
`
`(74) Agent: VIGARS, Christopher, Ian; Haseltine Lake & Co,
`Imperial House, 15-19 Kingsway, London WC2B 6UD
`(GB).
`
`(81) Designated States: AE, AG, AL, AM, AT, AU, AZ, BA, BB,
`BG, BR, BY, CA, CH, CN, CR, CU, CZ, DE, DK, DM,
`DZ, EE, ES, FI, GB, GD, GE, GH, GM, HR, HU, ID, IL,
`IN, IS, JP, KE, KG, KP, KR, KZ, LC, LK, LR, LS, LT, LU,
`LV, MA, MD, MG, MK, MN, MW, MX, NO, NZ, PL, PT,
`RO, RU, SD, SE, SG, SI, SK, SL, TJ, TM, TR, TT, TZ,
`UA, UG, US, UZ, VN, YU, ZA, ZW, ARIPO patent (GH,
`GM, KE, LS, MW, SD, SL, SZ, TZ, UG, ZW), Eurasian
`patent (AM, AZ, BY, KG, KZ, MD, RU, TJ, TM), European
`patent (AT, BE, CH, CY, DE, DK, ES, FI, FR, GB, GR,
`IE, IT, LU, MC, NL, PT, SE), OAPI patent (BF, BJ, CF,
`CG, CI, CM, GA, GN, GW, ML, MR, NE, SN, TD, TG).
`
`Published
`Without international search report and to be republished
`upon receipt of that report.
`
`(54) Title: PARALLEL DATA PROCESSING APPARATUS
`
`(57) Abstract
`
`A data processing apparatus comprises a SIMD (single instruction
`multiple data) array (10) of processing elements. The processing elements are
`operably divided into a plurality of processing blocks, the processing blocks
`being operable to process respective groups of data items.
`
`2
`
`—1
`
`HOST
`SYSTEM
`INTERFACE
`
`3
`y.
`GRAPHICS
`SYSTEM
`
`_130
`
`4, 6
`
`PROCESSING
`CORE
`
`6 —
`
`VIDEO
`OUTPUT
`
`•
`4, 6
`
`14
`
`EPU
`41.
`
`LOCAL
`MEMORY
`
`12
`
`LG Ex. 1004, pg 1
`
`LG Ex. 1004
`LG v. ATI
`IPR2017-01225
`
`

`

`FOR THE PURPOSES OF INFORMATION ONLY
`
`Codes used to identify States party to the PCT on the front pages of pamphlets publishing international applications under the PCT.
`
`AL
`AM
`AT
`AU
`AZ
`BA
`BB
`BE
`BF
`BG
`BJ
`BR
`BY
`CA
`CF
`CG
`CH
`CI
`CM
`CN
`CU
`CZ
`DE
`DK
`EE
`
`Albania
`Armenia
`Austria
`Australia
`Azerbaijan
`Bosnia and Herzegovina
`Barbados
`Belgium
`Burkina Faso
`Bulgaria
`Benin
`Brazil
`Belarus
`Canada
`Central African Republic
`Congo
`Switzerland
`Cote d'Ivoire
`Cameroon
`China
`Cuba
`Czech Republic
`Germany
`Denmark
`Estonia
`
`ES
`FI
`FR
`GA
`GB
`GE
`GH
`GN
`GR
`HU
`IE
`IL
`IS
`IT
`JP
`KE
`KG
`KP
`
`KR
`KZ
`LC
`LI
`LK
`LR
`
`Spain
`Finland
`France
`Gabon
`United Kingdom
`Georgia
`Ghana
`Guinea
`Greece
`Hungary
`Ireland
`Israel
`Iceland
`Italy
`Japan
`Kenya
`Kyrgyzstan
`Democratic People's
`Republic of Korea
`Republic of Korea
`Kazakstan
`Saint Lucia
`Liechtenstein
`Sri Lanka
`Liberia
`
`LS
`LT
`LU
`LV
`MC
`MD
`MG
`MK
`
`ML
`MN
`MR
`MW
`MX
`NE
`NL
`NO
`NZ
`PL
`PT
`RO
`RU
`SD
`SE
`SG
`
`Lesotho
`Lithuania
`Luxembourg
`Latvia
`Monaco
`Republic of Moldova
`Madagascar
`The former Yugoslav
`Republic of Macedonia
`Mali
`Mongolia
`Mauritania
`Malawi
`Mexico
`Niger
`Netherlands
`Norway
`New Zealand
`Poland
`Portugal
`Romania
`Russian Federation
`Sudan
`Sweden
`Singapore
`
`SI
`SK
`SN
`SZ
`TD
`TG
`TJ
`TM
`TR
`TT
`UA
`UG
`US
`UZ
`VN
`YU
`ZW
`
`Slovenia
`Slovakia
`Senegal
`Swaziland
`Chad
`Togo
`Tajikistan
`Turkmenistan
`Turkey
`Trinidad and Tobago
`Ukraine
`Uganda
`United States of America
`Uzbekistan
`Viet Nam
`Yugoslavia
`Zimbabwe
`
`LG Ex. 1004, pg 2
`
`

`

`WO 00/62182
`
`PCT/GB00/01332
`
`PARALLEL DATA PROCESSING APPARATUS
`
`-1-
`
`The present invention relates to parallel data
`processing apparatus, and in particular to SIMD (single
`instruction multiple data) processing apparatus.
`
`BACKGROUND OF THE INVENTION
`
`Increasingly, data processing systems are required to
`process large amounts of data. In addition, users of
`such systems are demanding that the speed of data
`processing is increased. One particular example of the
`need for high speed processing of massive amounts of
`data is in the computer graphics field. In computer
`graphics, large amounts of data are produced that
`relate to, for example, geometry, texture, and colour
`of objects and shapes to be displayed on a screen.
`Users of computer graphics are increasingly demanding
`more lifelike and faster graphical displays which
`increases the amount of data to be processed and
`increases the speed at which the data must be
`processed.
`
`A previously proposed processing architecture for
`processing large amounts of data in a computer system
`uses a Single Instruction Multiple Data (SIMD) array of
`processing elements. In such an array all of the
`processing elements receive the same instruction
`stream, but operate on different respective data items
`Such an architecture can thereby proceA data in
`parallel, but without the need to produce parallel
`instruction streams. This can be an efficient and
`relatively simple way of obtaining good performance
`from a parallel processing machine.
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`LG Ex. 1004, pg 3
`
`

`

`WO 00/62182
`
`PCT/GB00/01332
`
`-2-
`
`However, the SIMD architecture can be inefficient when
`a system has to process a large number of relatively
`small data item groups. For example, for a SIMD array
`processing data relating to a graphical display screen,
`for a small graphical primitive such as a triangle,
`only relatively few processing elements of the array
`will be enabled to process data relating to the
`primitive. In that case, a large proportion of the
`processing elements may remain unused while data is
`being processed for a particular group.
`
`5
`
`10
`
`It is therefore desirable to produce a system which can
`overcome or alleviate this problem.
`
`15
`
`SUMMARY OF THE INVENTION
`
`20
`
`25
`
`30
`
`According to one aspect of the present invention, there
`is provided a data processing apparatus comprising a
`SIMD (single instruction multiple data) array of
`processing elements, wherein the processing elements
`are operably divided into a plurality of processing
`blocks, the processing blocks being operable to process
`respective groups of data items.
`
`According to another aspect of the present invention,
`there is provided a data processing apparatus
`comprising an array of processing elements, which are
`operable to process respective data items in accordance
`with a common received instruction, wherein the
`processing elements are operably dividdt into a
`plurality of processing blocks having at least one
`processing element, the processing blocks being
`operable to process respective groups of data items.
`
`35
`
`Various further aspects of the present invention are
`
`LG Ex. 1004, pg 4
`
`

`

`WO 00/62182
`
`PCT/GB00/01332
`
`exemplified by the attached claims.
`
`-3-
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`Figure 1 is a block diagram illustrating a graphics
`data processing system;
`Figure 2 is a more detailed block diagram illustrating
`the graphics data processing system of Figure 1;
`Figure 3 is a block diagram of a processing core of the
`system of Figure 2;
`Figure 4 is a block diagram of a thread manager of the
`system of Figure 3;
`Figure 5 is a block diagram of a array controller of
`the system of Figure 3;
`Figure 6 is a block diagram of an instruction issue
`state machine of the channel controller of Figure 3;
`Figure 7 is a block diagram of a binning unit of the
`system of Figure 3;
`Figure 8 is a block diagram of a processing block of
`the system of Figure 3;
`Figure 9 is a flowchart illustrating data processing
`using the system of Figures 1 to 8;
`Figure 10 is a more detailed block diagram of a thread
`processor of the thread manager of Figure 4;
`Figure 11 is a block diagram of a processor unit of the
`processing block of Figure 8;
`Figure 12 is a block diagram illustrating a processing
`element interface;
`Figure 13 is a block diagram illustrating a block I/O
`interface;
`Figure 14 is a block diagram of part of the processor
`unit of Figure 11; and
`Figure 15 is a block diagram of another part of the
`processor unit of Figure 11.
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`LG Ex. 1004, pg 5
`
`

`

`WO 00/62182
`
`PCT/GB00/01332
`
`DESCRIPTION OF THE PREFERRED EMBODIMENT
`
`-4-
`
`The data processing system described below is a
`graphics data processing system for producing graphics
`images for display on a screen. However, this
`embodiment is purely exemplary, and it will be readily
`apparent that the techniques and architecture described
`here for processing graphical data are equally
`applicable to other data types, such as video data.
`The system is of course applicable to other signal
`and/or data processing techniques and systems.
`
`An overview of the system will be given, followed by
`brief descriptions of the various functional units of
`the system. A graphics processing method will then be
`described by way of example, followed by detailed
`description of the functional units.
`
`OVERVIEW
`
`Figure 1 is a system level block diagram illustrating a
`graphics data processing system 3. The system 3
`interfaces with a host system (not shown), such as a
`personal computer or workstation, via an interface 2.
`Such a system can be provided with an embedded
`processor unit (EPU) for control purposes. For
`example, the specific graphics system 3 includes an
`embedded processing unit (EPU) 8 for controlling the
`overall function of the graphics processor and for
`interfacing with the host system. The system includes
`a processing core 10 which processes the graphical data
`for output to the display screen via a video output
`interface 14. Local memory 12 is provided for the
`graphics system 3.
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`LG Ex. 1004, pg 6
`
`

`

`WO 00/62182
`
`PCT/GB00/01332
`
`-5-
`
`Such a data processing can be connected for operation
`to a host system or could provide a stand alone
`processing system, without the need for a specific host
`system. Examples of such application include a "set
`top box" for receiving and decoding digital television
`and internet signals.
`
`Figure 2 illustrates the graphics processing system in
`more detail. In one particular example, the graphics
`system connects to the host system via an advanced
`graphics port (AGP) or PCI interface 2. The PCI
`interface and AGP 2 are well known.
`
`The host system can be any type of computer system,
`for example, a PC 99 specification personal computer or
`a workstation.
`
`The AGP 2 provides a high bandwidth path from the
`graphics system to host system memory. This allows
`large texture databases to be held in the host system
`memory, which is generally larger than local memory
`associated with the graphics system. The AGP also
`provides a mechanism for mapping memory between a
`linear address space on the graphics system and a
`number of potentially scattered memory blocks in the
`host system memory. This mechanism is performed by a
`graphics address re-mapping table (GART) as is well
`known.
`
`The graphics system described below is preferably
`implemented as a single integrated circuit which
`provides all of the functions shown in Figure 1.
`However, it will be readily apparent that the system
`may be provided as separate circuit card carrying
`several different components, or as a separate chipset
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`LG Ex. 1004, pg 7
`
`

`

`WO 00/62182
`
`PCT/GB00/01332
`
`-6-
`
`provided on the motherboard of the host, or integrated
`with the host central processing unit (CPU), or in any
`suitable combination of these and other
`implementations.
`
`The graphics system includes several functional units
`which are connected to one another for the transfer of
`data by way of a dedicated bus system. The bus system
`preferably includes a primary bus 4 and a secondary bus
`6. The primary bus is used for connection of latency
`intolerant devices, and the secondary bus is used for
`connection of latency tolerant devices. The bus
`architecture is preferably as described in detail in
`the Applicant's co-pending UK patent applications,
`particularly GB 9820430.8. It will be readily
`appreciated that any number of primary and secondary
`buses can be provided in the bus architecture in the
`system. The specific system shown in Figure 2 includes
`two secondary buses.
`
`Referring mainly to Figure 2, access to the primary bus
`4 is controlled by a primary arbiter 41, and access to
`the secondary buses 6 by a pair of secondary arbiters
`61. Preferably, all data transfers are in packets of
`32 bytes each. The secondary buses 6 are connected
`with the primary bus 4 by way of respective interface
`units (SIP) 62.
`
`An auxiliary control bus 7 is provided in order to
`enable control signals to be communicated to the
`various units in the system.
`
`The AGP/PCI interface is connected to the graphics
`system by way of the secondary buses 6. This interface
`can be connected to any selection of the secondary
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`LG Ex. 1004, pg 8
`
`

`

`WO 00/62182
`
`PCT/GB00/01332
`
`-7-
`
`buses, in the example shown, to both secondary buses 6.
`The graphics systems also includes an embedded
`processing unit (EPU) 8 which is used to control
`operation of the graphics system and to communicate
`with the host system. The host system has direct
`access to the EPU 8 by way of a direct host access
`interface 9 in the AGP/PCI 2. The EPU is connected to
`the primary bus 4 by way of a bus interface unit (EPU
`FBI) 90.
`
`Also connected to the primary bus is a local memory
`system 12. The local memory system 12 includes a
`number, in this example four, of memory interface units
`121 which are used to communicate with the local memory
`itself. The local memory is used to store various
`information for use by the graphics system.
`
`The system also includes a video interface unit 14
`which comprises the hardware needed to interface the
`graphics system to the display screen (not shown), and
`other devices for exchange of data which may include
`video data. The video interface unit is connected to
`the secondary buses 6, via bus interface units (FBI).
`
`The graphics processing capability of the system is
`provided by a processing core 10. The core 10 is
`connected to the secondary buses 6 for the transfer of
`data, and to the primary bus 4 for the transfer of
`instructions. As will be explained in more detail
`below, the secondary bus connections a.±1 made by a core
`bus interface (Core FBI) 107, and a binner bus
`interface (Binner FBI) 111, and the primary bus
`connection is made by a thread manager bus interface
`(Thread Manager FBI) 103.
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`LG Ex. 1004, pg 9
`
`

`

`WO 00/62182
`
`PCT/GB00/01332
`
`-8-
`
`As will be explained in greater detail below, the
`processing core 10 includes a number of control units:
`thread manager 102, array controller 104, channel
`controller 108, a binning unit 1069 per block and a
`microcode store 105. These control units control the
`operation of a number of processing blocks 106 which
`perform the graphics processing itself.
`
`In the example shown in Figure 2, the processing core
`10 is provided with eight processing blocks 106. It
`will be readily appreciated that any number of
`processing blocks can be provided in a graphics system
`using this architecture.
`
`5
`
`10
`
`15
`
`PROCESSING CORE
`
`Figure 3 shows the processing core in more detail. The
`thread manager 102 is connected to receive control
`signals from the EPU 8. The control signals inform the
`thread manager as to when instructions are to be
`fetched and where the instructions are to be found.
`The thread manager 102 is connected to provide these
`instructions to the array controller 104 and to the
`channel controller 108. The array and channel
`controllers 104 and 108 are connected to transfer
`control signals to the processing blocks 106 dependent
`upon the received instructions.
`
`Each processing block 106 comprises an array 1061 of
`processor elements (PEs) and a mathematical expression
`evaluator (MEE) 1062. As will be described in more
`detail below, a path 1064 for MEE coefficient feedback
`is provided from the PE memory, as is an input/output
`channel 1067. Each processing block includes a binning
`unit 1069 unit 1068 and a transfer engine 1069 for
`
`20
`
`25
`
`30
`
`35
`
`LG Ex. 1004, pg 10
`
`

`

`WO 00/62182
`
`PCT/GB00/01332
`
`-9-
`
`controlling data transfers to and from the input/output
`channel under instruction from the channel controller
`108.
`
`The array 1061 of processor elements provides a single
`instruction multiple data (SIMD) processing structure.
`Each PE in the array 1061 is supplied with the same
`instruction, which is used to process data specific to
`the PE concerned.
`
`Each processing element (PE) 1061 includes a processor
`unit 1061a for carrying out the instructions received
`from the array controller, a PE memory unit 1061c for
`storing data for use by the processor unit 1061a, and a
`PE register file 1061b through which data is
`transferred between the processor unit 1061a and the PE
`memory unit 1061c. The PE register file 1061b is also
`used by the processor unit 1061a for temporarily
`storing data that is being processed by the processor
`unit 1061a.
`
`The provision of a large number of processor elements
`can result in a large die size for the manufacture of
`the device in a silicon device. Accordingly, it is
`desirable to reduce the effect of a defective area on
`the device. Therefore, the system is preferably
`provided with redundant PEs, so that if one die area is
`faulty, another can be used in its place.
`
`In particular, for a group of processiTg elements used
`for processing data, additional redundant processing
`elements can be manufactured. In one particular
`example, the processing elements are provided in
`"panels" of 32 PEs. For each panel a redundant PE is
`provided, so that a defect in one of the PEs of the
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`LG Ex. 1004, pg 11
`
`

`

`WO 00/62182
`
`PCT/GB00/01332
`
`-10-
`
`panel can be overcome by using the redundant PE for
`processing of data. This will be described in more
`detail below.
`
`5
`
`THREAD MANAGER
`
`The array of processing elements is controlled to carry
`out a series of instructions in an instruction stream.
`Such instruction streams for the processing blocks 106
`are known as "threads". Each thread works co-
`operatively with other threads to perform a task or
`tasks. The term "multithreading" refers to the use of
`several threads to perform a single task, whereas the
`term "multitasking" refers to the use of several
`threads to perform multiple tasks simultaneously. It
`is the thread manager 102 which manages these
`instruction streams or threads.
`
`There are several reasons for providing multiple
`threads in such a data processing architecture. The
`processing element array can be kept active, by
`processing another thread when the current active
`thread is halted. The threads can be assigned to any
`task as required. For example, by assigning a
`plurality of threads for handling data I/O operations
`for transferring data to and from memory, these
`operations can be performed more efficiently, by
`overlapping I/O operations with processing operations.
`The latency of the memory I/O operations can
`the use of
`effectively be masked from the system
`different threads.
`
`In addition, the system can have a faster response time
`to external events. Assigning particular threads to
`wait on different external events, so that when an
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`LG Ex. 1004, pg 12
`
`

`

`WO 00/62182
`
`PCT/GB00/01332
`
`event happens, it can be handled immediately.
`
`-11-
`
`The thread manager 102 is shown in more detail in
`Figure 4, and comprises a cache memory unit 1024 for
`storing instructions fetched for each thread. The
`cache unit 1024 could be replaced by a series of first-
`in-first-out (FIFO) buffers, one per thread. The
`thread manager also includes an instruction fetch unit
`1023, a thread scheduler 1025, thread processors 1026,
`a semaphore controller 1028 and a status block 1030.
`
`Instructions for a thread are fetched from local memory
`or the EPU 8 by the fetch unit 1023, and supplied to
`the cache memory 1024 via connecting logic.
`
`The threads are assigned priorities relative to one
`another. Of course, although the example described
`here has eight threads, any number of threads can be
`controlled in this manner. At any particular moment in
`time, each thread may be assigned to any one of a
`number of tasks. For example, thread zero may be
`assigned for general system control, thread 1 assigned
`to execute 2D (two dimensional) activities, and threads
`2 to 7 assigned to executing 3D activities (such as
`calculating vertices, primitives or rastering).
`
`In the example shown in Figure 4, the thread manager
`includes one thread processor 1026 for each thread.
`The thread processors 1026 control the issuance of core
`instructions from the thread manager so as to maintain
`processing of simultaneously active program threads, so
`that each the processing blocks 106 can be active for
`as much time as possible. In this particular example
`the same instruction stream is supplied to all of the
`processing blocks in the system.
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`LG Ex. 1004, pg 13
`
`

`

`WO 00/62182
`
`PCT/GB00/01332
`
`-12-
`
`It will be appreciated that the number of threads could
`exceed the number of thread processors, so that each
`thread processor handles control of more than one
`thread. However, providing a thread processor for each
`thread reduces the need for context switching when
`changing the active thread, thereby reducing memory
`accesses and hence increasing the speed of operation.
`
`The semaphore controller 1028 operates to synchronise
`the threads with one other.
`
`Within the thread manager 102, the status block 1030
`receives status information 1036 from each of the
`threads. The status information is transferred to the
`thread scheduler 1025 by the status block 1030. The
`status information is used by the thread scheduler 1025
`to determine which thread should be active at any one
`time.
`
`Core instructions 1032 issued by the thread manager 102
`are sent to the array controller 104 and the channel
`controller 108 (figure 3).
`
`ARRAY CONTROLLER
`
`The array controller 104 directs the operation of the
`processing block 106, and is shown in greater detail in
`Figure 5.
`
`The array controller 104 comprises an instruction
`launcher 1041, connected to receive instructions from
`the thread manager. The instruction launcher 1041
`indexes an instruction table 1042, which provides
`further specific instruction information to the
`instruction launcher.
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`LG Ex. 1004, pg 14
`
`

`

`WO 00/62182
`
`PCT/GB00/01332
`
`-13-
`
`On the basis of the further instruction information,
`the instruction launcher directs instruction
`information to either a PE instruction sequencer 1044
`or a load/store controller 1045. The PE instruction
`sequencer receives instruction information relating to
`data processing, and the load/store controller receives
`information relating to data transfer operations.
`
`The PE instruction sequencer 1044 uses received
`instruction information to index a PE microcode store
`105, for transferring PE microcode instructions to the
`PEs in the processing array.
`
`The array controller also includes a scoreboard unit
`1046 which is used to store information regarding the
`use of PE registers by particular active instructions.
`The score board unit 1046 is functionally divided so as
`to provide information regarding the use of registers
`by instructions transmitted by the PE instruction
`sequencer 1044 and the load/store controller 1045
`respectively.
`
`In general terms, the PE instruction sequencer 1044
`handles instructions that involve data processing in
`the processor unit 1061a. The load/store controller
`1045, on the other hand, handles instructions that
`involve data transfer between the registers of the
`processor unit 1061a and the PE memory unit 1061c. The
`load/store controller 1045 will be described in greater
`detail later.
`
`The instruction launcher 1041 and the score board unit
`1046 maintain the appearance of serial instruction
`execution whilst achieving parallel operation between
`the PE instruction sequencer 1044 and the load/store
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`LG Ex. 1004, pg 15
`
`

`

`WO 00/62182
`
`PCT/GB00/01332
`
`controller 1045.
`
`-14-
`
`The remaining core instructions 1032 issued from the
`thread manager 102 are fed to the channel controller
`108. This controls transfer of data between the PE
`memory units and external memory (either local memory
`or system memory in AGP or PCI space).
`
`CHANNEL CONTROLLER
`
`The channel controller 108 operates asynchronously with
`respect to the execution of instructions by the array
`controller 104. This allows computation and external
`I/O to be performed simultaneously and overlapped as
`much as possible. Computation (PE) operations are
`synchronised with I/O operations by means of semaphores
`in the thread manager, as will be explained in more
`detail below.
`
`The channel controller 108 also controls the binning
`units 1068 which are associated with respective
`processing blocks 106. This is accomplished by way of
`channel controller instructions.
`
`Figure 6 shows the channel controller's instruction
`issue state machine, which lies at the heart of the
`channel controller's operation, and which will be
`described in greater detail later.
`
`Each binning unit 1069 (Figure 3) is canected to the
`I/O channels of its associated processing block 106.
`The purpose of the binning unit 1069 is to sort
`primitive data by region, since the data is generally
`not provided by the host system in the correct order
`for region based processing.
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`LG Ex. 1004, pg 16
`
`

`

`WO 00/62182
`
`PCT/GB00/01332
`
`-15--
`
`The binning units 1068 provide a hardware implemented
`region sorting system, (shown in Figure 7), which
`removes the sorting process from the processing
`elements, thereby releasing the PEs for data
`processing.
`
`MEMORY ACCESS CONSOLIDATION
`
`In a computer system having a large number of elements
`which require access to a single memory, or other
`addressed device, there can be a significant reduction
`in processing speed if accesses to the storage device
`are performed serially for each element.
`
`The graphics system described above is one example of
`such a system. There are a large number of processor
`elements, each of which requires access to data in the
`local memory of the system. Since the number of
`elements requiring memory access exceeds the number of
`memory accesses that can be made at any one time,
`accesses to the local and system memory involves serial
`operation. Thus, performing memory access for each
`element individually would cause a degradation in the
`speed of operation of the processing block.
`
`In order to reduce the effect of this problem on the
`speed of processing of the system, the system of
`Figures 1 and 2 includes a memory access consolidating
`function.
`
`The memory access consolidation is also described below
`with reference to figures 12 and 13. In general,
`however, the processing elements that require access to
`memory indicate that this is the case by setting an
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`LG Ex. 1004, pg 17
`
`

`

`WO 00/62182
`
`PCT/GB00/01332
`
`-16-
`
`indication flag or mark bit. The first such marked PE
`is then selected, and the memory address to which it
`requires access is transmitted to all of the processing
`elements of the processing block. The address is
`transmitted with a corresponding transaction ID. Those
`processing elements which require access (ie. have the
`indication flag set) compare the transmitted address
`with the address to which they require access, and if
`the comparison indicates that the same address is to be
`accessed, those processing elements register the
`transaction ID for that memory access and clear the
`indication flag.
`
`When the transaction ID is returned to the processing
`block, the processing elements compare the stored
`transaction ID with the incoming transaction ID, in
`order to recover the data.
`
`Using transaction IDs in place of simply storing the
`accessed address information enables multiple memory
`accesses to be carried, and then returned in any order.
`Such a "fire and forget" method of recovering data can
`free up processor time, since the processors do not
`have to await return of data before continuing
`processing steps. In addition, the use of transaction
`ID reduces the amount of information that must be
`stored by the processing elements to identify the data
`recovery transaction. Address information is generally
`of larger size than transaction ID information.
`
`Preferably, each memory address can store more data
`than the PEs require access to. Thus, a plurality of
`PEs can require access to the same memory address, even
`though they do not require access to the same data.
`This arrangement can further reduce the number of
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`LG Ex. 1004, pg 18
`
`

`

`WO 00/62182
`
`PCT/GB00/01332
`
`-17-
`
`memory accesses required by the system, by providing a
`hierarchical consolidation technique. For example,
`each memory address may store four quad bytes of data,
`with each PE requiring one quad byte at any one access.
`
`This technique can also allow memory write access
`consolidation for those PEs that require write access
`to different portions of the same memory address.
`
`In this way the system can reduce the number of memory
`accesses required for a processing block, and hence
`increase the speed of operation of the processing
`block.
`
`The indication flag can also be used in another
`technique for writing data to memory. In such a
`technique, the PEs having data to be written to memory
`signal this fact by setting the indication flag. Data
`is written to memory addresses for each of those PEs in
`order, starting at a base address, and stepped at a
`predetermined spacing in memory. For example, if the
`step size is set to one, then consecutive addresses are
`written with data from the flagged PEs.
`
`PROCESSING BLOCKS
`
`One of the processing blocks 106 is shown in more
`detail in Figure 8. The processing block 106 includes
`an array of processor elements 1061 which are arranged
`to operate in parallel on respective data, items but
`carrying out the same instruction (SIMD). Each
`processor element 1061 includes a processor unit 1061a,
`a PE register file 1061b and a PE memory unit 1061c.
`The PE memory unit 1063c is used to store data items
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`LG Ex. 1004, pg 19
`
`

`

`WO 00/62182
`
`PCT/GB00/01332
`
`-18-
`
`for processing by the processor unit 1061a. Each
`processor unit 1061a can transfer data to and from its
`PE memory unit 1061c via the PE register file 1061b.
`The processor unit 1061a also uses the PE register file
`1061b to store data which is being processed. Transfer
`of data items between the processor unit 1061a and the
`memory unit 1061c is controlled by the ar

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket