throbber
(12) United States Patent
`Abdallah et al.
`
`I 1111111111111111 11111 lllll 111111111111111 lllll lllll 111111111111111 11111111
`US006192467Bl
`US 6,192,467 Bl
`Feb.20,2001
`
`(10) Patent No.:
`(45) Date of Patent:
`
`(54) EXECUTING PARTIAL-WIDTH PACKED
`DATA INSTRUCTIONS
`
`(75)
`
`Inventors: Mohammad A. Abdallah; Vladimir
`Pentkovski, both of Folsom; James
`Coke, Shingle Springs, all of CA (US)
`
`(73) Assignee: Intel Corporation, Santa Clara, CA
`(US)
`
`( *) Notice:
`
`Under 35 U.S.C. 154(b), the term of this
`patent shall be extended for O days.
`
`(21) Appl. No.: 09/053,000
`
`(22) Filed:
`
`Mar. 31, 1998
`
`(51)
`
`Int. Cl.7 ............................... G06F 9/22; G06F 9/302
`
`(52) U.S. Cl. .............................................................. 712/222
`
`(58) Field of Search ...................................... 712/222, 221
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`3,675,001
`3,723,715
`4,890,218
`5,210,711
`5,311,508
`5,426,598
`5,673,427
`5,721,892
`5,793,661
`5,852,726
`5,936,872
`6,018,351
`6,041,403
`
`7/1972 Singh.
`3/1973 Chen et al..
`12/1989 Bram.
`5/1993 Rossmere et al. .
`5/1994 Buda et al..
`6/1995 Hagihara.
`9/1997 Brown et al..
`2/1998 Peleg et al..
`8/1998 Dulong et al..
`12/1998 Lin et al..
`8/1999 Fischer et al. .
`1/2000 Mennemeier et al. .
`5/2000 Parker et al. .
`
`FOREIGN PATENT DOCUMENTS
`
`10/1999 (GB).
`9907221
`WO 97/08608 * 3/1997 (WO).
`WO 97/22921
`6/1997 (WO).
`WO 97/22923
`6/1997 (WO).
`WO 97/22924
`6/1997 (WO).
`
`OTHER PUBLICATIONS
`Abbott et al., "Broadband Algorithms with the Micro Unity
`Mediaprocessor", Proceedings of COMPCON '96, 1996,
`pp. 349-354.
`Hayes et al., "MicroUnity Software Development Environ(cid:173)
`ment", Proceedings of COMPCON '96, 1996, pp. 341-348.
`International Search Report PCT/US99/04718, Jun. 28,
`1999, 4 pages.
`"TM 1000 Preliminary Data Book", Philips Semiconduc(cid:173)
`tors, 1997.
`"21164 Alpha™ Microprocessor Data Sheet", Samsung
`Electronics, 1997.
`"Silicon Graphics Introduces Enhanced MIPS® Architec(cid:173)
`ture to lead the Interactive Digital Revolution, Silicon
`Graphics", Oct. 21, 1996, donwloaded from Website
`webmaster@www.sgi.com, pp. 1-2.
`(List continued on next page.)
`Primary Examiner-William M. Treat
`(74) Attorney, Agent, or Firm-Blakely, Sokoloff, Taylor &
`Zafman LLP
`(57)
`
`ABSTRACT
`
`A method and apparatus are provided for executing scalar
`packed data instructions. According to one aspect of the
`invention, a processor includes a plurality of registers, a
`register renaming unit coupled to the plurality of registers,
`a decoder coupled to the register renaming unit, and a
`partial-width execution unit coupled to the decoder. The
`register renaming unit provides an architectural register file
`to store packed data operands each of which include a
`plurality of data elements. The decoder is configured to
`decode a first and second set of instructions that each specify
`one or more registers in the architectural register file. Each
`of the instructions in the first set of instructions specify
`operations to be performed on all of the data elements stored
`in the one or more specified registers. In contrast, each of the
`instructions in the second set of instructions specify opera(cid:173)
`tions to be performed on only a subset of the data element
`stored in the one or more specified registers. The partial(cid:173)
`width execution unit is configured to execute operations
`specified by either of the first or the second set of instruc(cid:173)
`tions.
`
`43 Claims, 13 Drawing Sheets
`
`INSTRUCTION PROCESSING
`
`FUll-WIDTH
`PACKED
`
`DETERMINEARESULTBY
`PERFORMING THE OPERATION
`SPECIFIED BY THE
`INSTRUCTION ON EACH PAIR
`OF CORRESPONDING DATA
`ELEMENTS
`
`FILLING THE REMAINING
`PORTIONOFTI-IERESULT
`WITl-!ONEORMORE
`PREDETERMINED VALUES
`
`Oracle-1007 p. 1
`Oracle v. Teleputers
`IPR2021-00078
`
`

`

`US 6,192,467 Bl
`Page 2
`
`OIBER PUBLICATIONS
`
`"Silicon Graphics Introduces Compact MIPS® RISC
`Microprocessor Code for High Performance at a low Cost",
`Oct.
`21,
`1996,
`donwloaded
`from Website
`webmaster@www.sgi.com, pp. 1-2.
`Killian, Earl, "MIPS Extension for Digital Media", Silicon
`Graphics, pp. 1-10.
`"MIPS V Instruction Set", pp. Bl- to B-37.
`"MIPS Digital Media Extension", pp. Cl to C40.
`"MIPS Extension for Digital Media with 3D", MIPS Tech(cid:173)
`nologies, Inc., Mar. 12, 1997, pp. 1-26.
`"64-Bit and Multimedia Extensions in the PA-RISC 2.0
`Architecture", Helett Packard, donwloaded from Website
`rblee@cup.hp.com.huck@cup.hp.com,pp. 1-18.
`"The VIS™ Instruction Set", Sun Microsystems, Inc., 1997,
`pp. 1-2.
`"ULTRASPARC™ The Visual Instruction Set (VIS™): On
`Chip Support for New-Media Processing", Sun Microsys(cid:173)
`tems, Inc., 1996, pp. 1-7.
`ULTRASPARC™ and New Media Support Real-Time
`MPEG2 Decode with the Visual Instruction Set (VIS™),
`Sun Microsystems, Inc., 1996, pp. 1-8.
`ULTRASPARC™ Ultra Port Architecture (UPA): The
`New-Media System Architecture, Sun Microsystems, Inc.,
`1996, pp. 1-4.
`
`ULTRASPARC™ Turbocharges Network Operations on
`New Media Computing, Sun Microsystem, Inc., 1996, pp.
`1-5.
`
`The UltraSPARC Processor-Technology White Paper, Sun
`Microsystems, Inc., 1995, 37 pages.
`
`AMD-3D™ Technology Manual,Advanced Micro Devices,
`Feb. 1998.
`
`Hansen, Craig, Architecture of a Broadband Mediaproces(cid:173)
`sor, MicroUnity Systems Engineering, Inc., 1996, pp.
`334-354.
`
`Levinthal, Adam, et al., Parallel Computers for Graphics
`Applications, Pixar, San Rafael, CA, 1987, pp. 193-198.
`
`Levinthal, Adam; Porter, Thomas, "Chap-A SIMD Graphics
`Processor", Computer Grahics Project, Lucasfilm Ltd.,
`1984, pp. 77-82.
`
`Wang, Mangaser, Shrinivan, A processor Architecture for
`3D Graphics Calculations, Computer Motion, Inc., pp. 1-23.
`
`Visual Instruction Set (VIS™), User's Guide, Sun Micro(cid:173)
`systems, Inc., version 1.1 Mar., 1997.
`
`* cited by examiner
`
`Oracle-1007 p. 2
`Oracle v. Teleputers
`IPR2021-00078
`
`

`

`U.S. Patent
`
`Feb.20,2001
`
`Sheet 1 of 13
`
`US 6,192,467 Bl
`
`0 ><
`
`T"" ><
`
`N ><
`
`C"') ><
`
`Cl
`
`Cl <
`
`0 >-
`
`T"" >-
`
`N >-
`
`C"') >-
`
`>
`
`0
`N
`
`T""
`
`N
`
`N
`N
`
`C"')
`
`N
`
`Oracle-1007 p. 3
`Oracle v. Teleputers
`IPR2021-00078
`
`

`

`i,-
`~
`-...,l
`a-...
`~
`N
`\0
`i,(cid:173)
`_,.a-...
`rJ'J.
`e
`
`"""" ~
`0 ....,
`N
`~ ....
`'JJ. =(cid:173)~
`
`N ~=
`?'
`~
`"'!"j
`
`""""
`0
`0
`N
`
`~ = ......
`~ ......
`~
`•
`r:JJ.
`d •
`
`210
`
`DEVICE
`STORAGE
`
`240
`
`-
`
`ROUTINE
`
`DATA
`PACKED
`
`CJ 5
`
`/"'--200
`
`+-+
`
`.......
`
`12ao
`~
`275
`
`1260
`
`205
`
`INSTRUCTION SET
`
`DECODE/EXECTION UNIT
`
`285
`
`MEMORY
`
`UNIT
`
`INSTRUCTION SET UNIT
`
`PROCESSOR
`
`~ NETWORK
`
`r230
`
`..
`
`225
`
`DISPLAY
`
`+-+
`
`220
`
`KEYBOARD
`
`+-
`
`5
`
`FIG. 2A
`
`GENERAL PURPOSE INSTRUCTIONS
`FP INSTRUCTIONS
`PARTIAL-WIDTH PACKED DATA INSTR
`FULL-WIDTH PACKED DATA INSTR
`
`Oracle-1007 p. 4
`Oracle v. Teleputers
`IPR2021-00078
`
`

`

`i,-
`~
`-...,l
`a-...
`~
`N
`\0
`i,(cid:173)
`_,.a-...
`rJ'J.
`e
`
`~
`
`'"""' ~
`0 ....,
`~ ....
`'JJ. =(cid:173)~
`
`N ~=
`?'
`~
`"'!"j
`
`'"""'
`0
`0
`N
`
`~ = ......
`~ ......
`~
`•
`r:JJ.
`d •
`
`FIG. 2C
`
`292
`
`~
`
`RO
`
`R1
`
`R2
`
`R3
`
`R4
`
`RS
`
`R6
`
`R7
`
`FIG. 2B
`
`XMMO
`
`XMM1
`
`XMM2
`
`XMM3
`
`XMM4
`
`XMMS
`
`XMM6
`
`XMM7
`
`RO
`
`R1
`
`R2
`
`R3
`
`R4
`
`RS
`
`R6
`
`R7
`
`294~ LOW
`
`HIGH
`
`293~
`~
`285
`
`291"}
`
`(292
`
`291+
`
`285
`
`Oracle-1007 p. 5
`Oracle v. Teleputers
`IPR2021-00078
`
`

`

`U.S. Patent
`
`Feb.20,2001
`
`Sheet 4 of 13
`
`US 6,192,467 Bl
`
`INSTRUCTION PROCESSING
`
`RECEIVE INSTRUCTION
`
`310
`
`FULL-WIDTH
`PACKED
`
`PARTIAL-WIDTH
`PACKED
`
`330
`
`DETERMINE A RESULT BY
`PERFORMING THE OPERATION
`SPECIFIED BY THE
`INSTRUCTION ON EACH PAIR
`OF CORRESPONDING DATA
`ELEMENTS
`
`DETERMINE A PORTION OF
`THE RESULT BY PERFORMING
`THE OPERATION SPECIFIED BY
`THE INSTRUCTION ON A
`SUBSET OF CORRESPONDING
`DATA ELEMENTS
`
`FILLING THE REMAINING
`PORTION OF THE RESULT
`WITH ONE OR MORE
`PREDETERMINED VALUES
`
`340
`
`350
`
`END
`
`FIG. 3
`
`Oracle-1007 p. 6
`Oracle v. Teleputers
`IPR2021-00078
`
`

`

`U.S. Patent
`
`Feb.20,2001
`
`Sheet 5 of 13
`
`US 6,192,467 Bl
`
`410_r--
`
`X3
`
`X2
`
`42aF
`
`Y3
`
`Y2
`
`X1
`
`Y1
`
`Xo
`
`Yo
`
`1,
`
`,,
`
`, I,
`
`1 I,
`
`,,
`
`, I,
`
`1f
`
`, .
`
`440 ...r
`
`EXECUTION UNIT
`
`1.
`
`,1,
`
`,,
`
`430J X3OR0 X2OR0 X1 ORO
`
`,,
`
`Zo
`
`FIG. 4
`
`Oracle-1007 p. 7
`Oracle v. Teleputers
`IPR2021-00078
`
`

`

`U.S. Patent
`
`Feb.20,2001
`
`Sheet 6 of 13
`
`US 6,192,467 Bl
`
`Xo
`
`570
`
`FIG. SA
`
`540
`
`FIG. SB
`
`IDENTITY
`
`FUNCTION ----------+----+----.
`
`Xo
`
`\ VALUE..----+-------t-+----+--+---+---+--+---+-+----+-----,
`590
`
`572
`
`572
`
`572
`
`FIG. SC
`
`580
`
`Oracle-1007 p. 8
`Oracle v. Teleputers
`IPR2021-00078
`
`

`

`U.S. Patent
`
`Feb.20,2001
`
`Sheet 7 of 13
`
`US 6,192,467 Bl
`
`I-
`:z
`::::,
`:z
`0
`I-
`::::,
`u
`w
`X
`Lf ~
`
`Cl
`<C
`
`Cl
`Cl
`<C
`
`Cl
`Cl
`<C
`
`Cl
`Cl
`<C
`
`Cl
`Cl
`<C
`
`I-
`:z
`::::,
`:z
`0
`I-
`::::,
`u
`w
`X
`LJ- w
`
`.....J
`::::,
`~
`
`.....J
`::::,
`~
`
`.....J
`::::,
`~
`
`.....J
`::::,
`~
`
`.....J
`::::,
`~
`
`LO
`0
`c.o
`Ii::
`0 a..
`w
`::::,
`"(./)
`(./)
`"'
`
`Oracle-1007 p. 9
`Oracle v. Teleputers
`IPR2021-00078
`
`

`

`i,-
`~
`-...,l
`a-...
`~
`N
`\0
`i,(cid:173)
`_,.a-...
`rJ'J.
`e
`
`'"""' ~
`0 ....,
`00
`~ ....
`'JJ. =(cid:173)~
`
`N ~=
`?'
`~
`"'!"j
`
`'"""'
`0
`0
`N
`
`~ = ......
`~ ......
`~
`•
`r:JJ.
`d •
`
`FIG. 7B
`
`* = CORRESPONDING DATA ELEMENT
`
`PREDETERMINED VALUE
`IN X ORY, NaN, 0, OR OTHER
`
`* I * I * lxo+Yol
`
`RESULT
`
`*
`
`Xo+Yo
`
`T ADD
`
`lv3lv2IY1 lvol
`OPERANDY
`
`I X3IX2 I X1 jxol
`OPERANDX
`
`FIG. 7A
`
`I X3+Y3I X2+Y2I X1+Y1 I Xo+Yol
`
`RESULT
`
`lv3lv2IY1 I val
`OPERANDY
`
`T
`
`ADD
`
`ADD
`
`TIME
`
`I X3 I X2 I X1 I Xo I
`
`OPERAND X
`
`Oracle-1007 p. 10
`Oracle v. Teleputers
`IPR2021-00078
`
`

`

`U.S. Patent
`
`Feb.20,2001
`
`Sheet 9 of 13
`
`US 6,192,467 Bl
`
`PORT 1
`
`PORT2
`
`128
`
`64
`
`128
`
`64
`
`i
`
`REGISTER
`FILE
`
`800
`
`128
`
`M3
`
`-7
`
`804
`
`EXECUTION
`UNIT
`
`64
`(ADD) .........,_,i---.....
`64
`
`802
`
`806
`
`L _________ _
`
`EXECUTION
`UNIT
`
`64
`(MUL) 1-----.,,____,
`64
`
`-
`
`-
`
`_J
`
`FIG. BA
`
`Oracle-1007 p. 11
`Oracle v. Teleputers
`IPR2021-00078
`
`

`

`U.S. Patent
`
`Feb.20,2001
`
`Sheet 10 of 13
`
`US 6,192,467 Bl
`
`T"""
`
`T"""
`
`>-
`X
`0
`0
`<C
`
`N
`
`>-
`N
`X
`0
`0
`<C
`
`C"')
`
`C"')
`
`>-
`X
`0
`0
`<C
`
`0
`
`>-
`0
`X
`_J
`:::::,
`::::E
`
`T"""
`
`T"""
`
`>-
`><
`_J
`:::::,
`::::E
`
`N
`
`N
`
`>-
`><
`::::>
`::::E
`
`_J
`
`C"')
`
`C"')
`
`>-
`><
`_J
`:::::,
`::::E
`
`I~
`
`I~
`
`I~
`
`I~
`
`>-
`><
`0
`0
`<C
`
`>-
`><
`_J
`:::::,
`::::E
`
`0
`
`z
`0
`0~ >-
`w~
`0
`~- 0
`X
`::::E I-
`Oen
`LL
`I
`~"o::t
`wc.o
`a..
`
`0
`<C
`
`z
`0
`I-
`(.)
`:::::,
`~
`I-en
`z
`-
`en I co
`
`-I
`
`N
`T'""
`
`w
`::::E
`I-
`
`I-
`
`T'""
`+
`I-
`
`T'""
`+
`I-
`
`N
`+
`I-
`
`Oracle-1007 p. 12
`Oracle v. Teleputers
`IPR2021-00078
`
`

`

`i,-
`~
`-...,l
`a-...
`~
`N
`\0
`i,(cid:173)
`_,.a-...
`rJ'J.
`e
`
`'""" ~
`'""" 0 ....,
`'"""
`~ ....
`'JJ. =(cid:173)~
`
`N ~=
`?'
`~
`"'!"j
`
`'"""
`0
`0
`N
`
`~ = ......
`~ ......
`~
`•
`r:JJ.
`d •
`
`', MICRO INSTRUCTIONS
`'
`1
`,' EXECUTION HARDWARE'\
`,_-
`' ---~----
`\
`
`,. ,,
`/
`1
`
`FIG. 9
`
`J
`I
`I
`I
`\
`
`I
`
`I
`I
`
`1
`
`/
`
`/
`
`I
`
`------
`
`,
`' '
`',
`-.. -
`'
`
`'
`
`PROCESSES BOTH
`
`HALFWIDTH ',
`
`\
`
`' ' ' '
`
`'
`
`\
`
`\
`
`\
`
`BACK
`WRITE
`
`-
`
`-EXECUTING
`
`-
`
`PIPELINE
`
`' ' , ,
`
`-,,
`
`----
`
`HYSICAL REGISTERS (RoB)
`p
`---
`
`L _______ .J
`I
`►
`: SCHEDULING
`7
`r-------
`
`1
`
`\
`
`\
`
`RENAMING I
`
`REGISTER
`
`/
`
`,,
`,,
`2X REGISTERS
`
`,,
`
`-- -- -,,.
`
`~
`
`•
`•
`•
`
`i-----_J R0L
`i-------l
`
`..._
`~ ",
`
`' ,
`
`-~L
`
`ENT REGISTER
`
`RETIREM
`
`_____ .,,.,. __ .,,,,,.
`
`---------
`
`-----
`
`.... .... .... .... ... _
`
`' .....
`
`.....
`
`2XENTRIES
`
`----~
`
`' ' ' ' .....
`
`•
`•
`•
`
`/
`
`i------_J RoH
`_---=-=~
`ALIAS TABLE
`REGISTER
`
`~=======~ RoL
`
`R1H
`
`•
`
`•
`•
`
`1,,
`"
`
`'
`
`\
`\
`\
`\
`I
`I
`,
`I
`
`t
`1
`I
`
`_J----------~
`
`\
`
`\
`
`I
`1
`
`MICRO INSTR.
`HAI!= WlnTH
`
`I MICRO INSTR. I :
`
`HALF WIDTH
`
`DECODING
`
`I
`I
`
`, , /
`
`--
`
`----
`
`---
`
`REGISTERS)
`(X LOGICAL
`INSTRUCTION
`FULL WIDTH I
`
`lAAl"'□I"\
`
`Oracle-1007 p. 13
`Oracle v. Teleputers
`IPR2021-00078
`
`

`

`U.S. Patent
`
`Feb.20,2001
`
`Sheet 12 of 13
`
`US 6,192,467 Bl
`
`z
`0 -I-
`(_)
`=>
`er::
`1--
`en
`z
`-
`-cc I
`1--
`
`~
`co
`z
`0
`I-
`(_)
`:::> er::
`1--en
`z
`-1--
`-cc I co
`
`N
`-.r-
`
`::r:
`>-
`..
`::r:
`><
`Cl
`Cl
`<C
`
`I j\
`
`_J
`
`....J
`
`>-
`-
`><
`Cl
`Cl
`<(
`
`I j\
`
`>-
`-><
`
`Cl
`Cl
`<(
`
`w
`-1--
`~
`
`1--
`
`z
`+
`1--
`
`Oracle-1007 p. 14
`Oracle v. Teleputers
`IPR2021-00078
`
`

`

`U.S. Patent
`
`Feb.20,2001
`
`Sheet 13 of 13
`
`US 6,192,467 Bl
`
`z
`0
`i==
`CJ
`::::,
`0:::
`I-
`en
`~
`0
`0:::
`CJ
`::E
`
`N
`z
`0
`F
`CJ
`::::,
`0:::
`I-
`en
`~
`0
`0:::
`CJ
`::E
`
`z
`z
`0
`F
`CJ
`::::,
`0:::
`I-en
`~
`~
`CJ
`::E
`
`• • •
`
`z
`0
`i==
`CJ
`UJ
`...J
`UJ en
`z
`0
`i==
`CJ
`::::,
`0:::
`I-en
`~
`0
`0:::
`CJ
`::E
`
`r---------- --------7
`~L./~
`.....
`
`lu
`
`I C!> ,g
`IZ
`IQ
`!.c
`le_,
`I~
`I UJ
`Io:::
`
`0 ,..._
`.....
`.....
`
`z
`0
`
`~ ~I
`
`0
`0
`~
`
`I
`L ___________ - - - - - - -~
`
`0:::
`UJ
`0
`0
`CJ
`UJ
`0
`
`~I
`
`0:::
`UJ
`0
`0
`u
`UJ
`0
`
`~I
`
`• • •
`
`0:::
`UJ
`0
`0 u
`UJ
`0
`
`sl
`
`z
`
`0 t o::::,
`
`0::: 0:::
`(_)I(cid:173)
`<( en
`~~
`
`z
`0
`
`z
`0
`-
`
`~t~~~ ~I
`
`~~~~8
`~ I- UJ
`en -' UJ
`z
`I-
`-
`UJ
`0
`
`...J
`
`Oracle-1007 p. 15
`Oracle v. Teleputers
`IPR2021-00078
`
`

`

`US 6,192,467 Bl
`
`1
`EXECUTING PARTIAL-WIDTH PACKED
`DATA INSTRUCTIONS
`
`FIELD OF THE INVENTION
`
`The invention relates generally to the field of computer
`systems. More particularly, the invention relates to a method
`and apparatus for efficiently executing partial-width packed
`data instructions, such as scalar packed data instructions, by
`a processor that makes use of SIMD technology, for
`example.
`
`BACKGROUND OF THE INVENTION
`
`5
`
`2
`decode a first and second set of instructions that each specify
`one or more registers in the architectural register file. Each
`of the instructions in the first set of instructions specify
`operations to be performed on all of the data elements stored
`in the one or more specified registers. In contrast, each of the
`instructions in the second set of instructions specify opera(cid:173)
`tions to be performed on only a subset of the data element
`stored in the one or more specified registers. The partial(cid:173)
`width execution unit is configured to execute operations
`10 specified by either of the first or the second set of instruc(cid:173)
`tions.
`Other features and advantages of the invention will be
`apparent from the accompanying drawings and from the
`detailed description.
`
`Multimedia applications such as 2D/3D graphics, image
`processing, video compression/decompression, voice recog- 15
`nition algorithms and audio manipulation, often require the
`same operation to be performed on a large number of data
`items (referred to as "data parallelism"). Each type of
`multimedia application typically implements one or more
`algorithms requiring a number of floating point or integer 20
`operations, such as ADD or MULTIPLY (hereafter MUL).
`By providing macro instructions whose execution causes a
`processor to perform the same operation on multiple data
`items in parallel, Single Instruction Multiple Data (SIMD)
`technology, such as that employed by the Pentium® pro- 25
`cessor architecture and the MMx™ instruction set, has
`enabled a significant improvement in multimedia applica(cid:173)
`tion performance (Pentium® and MMx™ are registered
`trademarks or trademarks of Intel Corporation of Santa
`Clara, Calif.).
`SIMD technology is especially suited to systems that
`provide packed data formats. A packed data format is one in
`which the bits in a register are logically divided into a
`number of fixed-sized data elements, each of which repre(cid:173)
`sents a separate value. For example, a 64-bit register may be
`broken into four 16-bit elements, each of which represents
`a separate 16-bit value. Packed data instructions may then
`separately manipulate each element in these packed data
`types in parallel.
`Referring to FIG. 1, an exemplary packed data instruction
`is illustrated. In this example, a packed ADD instruction
`(e.g., a SIMD ADD) adds corresponding data elements of a
`first packed data operand, X, and a second packed data
`operand, Y, to produce a packed data result, Z, i.e., X0 + Y 0=
`Z0 , X1 + Y 1 =Z1 , X2 + Y 2 =Z2 , and X3 + Y 3 =Z3 . Packing many
`data elements within one register or memory location and
`employing parallel hardware execution allows SIMD archi(cid:173)
`tectures to perform multiple operations at a time, resulting in
`significant performance improvement. For instance, in this
`example, four individual results may be obtained in the time
`previously required to obtain a single result.
`While the advantages achieved by SIMD architectures are
`evident, there remain situations in which it is desirable to
`return individual results for only a subset of the packed data 55
`elements.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`The invention is described by way of example and not by
`way of limitation with reference to the figures of the
`accompanying drawings in which like reference numerals
`refer to similar elements and in which:
`FIG. 1 illustrates a packed ADD instruction adding
`together corresponding data elements from a first packed
`data operand and a second packed data operand.
`FIG. 2A is a simplified block diagram illustrating an
`exemplary computer system according to one embodiment
`of the invention.
`FIG. 2B is a simplified block diagram illustrating exem(cid:173)
`plary sets of logical registers according to one embodiment
`30 of the invention.
`FIG. 2C is a simplified block diagram illustrating exem(cid:173)
`plary sets of logical registers according to another embodi(cid:173)
`ment of the invention.
`FIG. 3 is a flow diagram illustrating instruction execution
`35 according to one embodiment of the invention.
`FIG. 4 conceptually illustrates the result of executing a
`partial-width packed data instruction according to various
`embodiments of the invention.
`FIG. SA conceptually illustrates circuitry for executing
`40 full-width packed data instructions and partial-width packed
`data instructions according to one embodiment of the inven(cid:173)
`tion.
`FIG. SB conceptually illustrates circuitry for executing
`full-width packed data and partial-width packed data
`45 instructions according to another embodiment of the inven(cid:173)
`tion.
`FIG. SC conceptually illustrates circuitry for executing
`full-width packed data and partial-width packed data
`instructions according to yet another embodiment of the
`50 invention.
`FIG. 6 illustrates an ADD execution unit and a MUL
`execution unit capable of operating as four separate ADD
`execution units and four separate MUL execution units,
`respectively, according to an exemplary processor imple(cid:173)
`mentation of SIMD.
`FIGS. 7A-7B conceptually illustrate a full-width packed
`data operation and a partial-width packed data operation
`being performed in a "staggered" manner, respectively.
`FIG. SA conceptually illustrates circuitry within a pro(cid:173)
`cessor that accesses full width operands from logical regis(cid:173)
`ters while performing operations on half of the width of the
`operands at a time.
`FIG. SB is a timing chart that further illustrates the
`65 circuitry of FIG. SA
`FIG. 9 conceptually illustrates one embodiment of an
`out-of-order pipeline to perform operations on operands in a
`
`SUMMARY OF THE INVENTION
`
`A method and apparatus are described for executing
`partial-width packed data instructions. According to one 60
`aspect of the invention, a processor includes a plurality of
`registers, a register renaming unit coupled to the plurality of
`registers, a decoder coupled to the register renaming unit,
`and a partial-width execution unit coupled to the decoder.
`The register renaming unit provides an architectural register
`file to store packed data operands each of which include a
`plurality of data elements. The decoder is configured to
`
`Oracle-1007 p. 16
`Oracle v. Teleputers
`IPR2021-00078
`
`

`

`US 6,192,467 Bl
`
`3
`"staggered" manner by converting a macro instruction into
`a plurality of micro instructions that each processes a portion
`of the full width of the operands.
`FIG. 10 is a timing chart that further illustrates the
`embodiment described in FIG. 9.
`FIG. 11 is a block diagram illustrating decoding logic that
`may be employed to accomplish the decoding processing
`according to one embodiment of the invention.
`
`DETAILED DESCRIPTION
`
`A method and apparatus are described for performing
`partial-width packed data instructions. Herein the term "full(cid:173)
`width packed data instruction" is meant to refer to a packed
`data instruction ( e.g., a SIMD instruction) that operates upon
`all of the data elements of one or more packed data operands.
`In contrast, the term "partial-width packed data instruction"
`is meant to broadly refer to a packed data instruction that is
`designed to operate upon only a subset of the data elements
`of one or more packed data operands and return a packed
`data result (to a packed data register file, for example). For
`instance, a scalar SIMD instruction may require only a result
`of an operation between the least significant pair of packed
`data operands. In this example, the remaining data elements
`of the packed data result are disregarded as they are of no
`consequence to the scalar SIMD instruction (e.g., the
`remaining data elements are don't cares). According to the
`various embodiments of the invention, execution units may
`be configured in such a way to efficiently accommodate both
`full-width packed data instructions (e.g., SIMD instructions) 30
`and a set of partial-width packed data instructions ( e.g.,
`scalar SIMD instructions).
`In the following detailed description, for purposes of
`explanation, numerous specific details are set forth in order
`to provide a thorough understanding of the invention. It will 35
`be apparent, however, to one of ordinary skill in the art that
`these specific details need not be used to practice the
`invention. In other instances, well-known devices,
`structures, interfaces, and processes have not been shown or
`are shown in block diagram form.
`
`40
`
`4
`achieved by operating on only a subset of a full-width
`operand, including reduced power consumption, increased
`speed, a clean exception model, and increased storage. As
`illustrated below, based on an indication provided with the
`5 partial-width packed data instruction, power savings may be
`achieved by selectively shutting down those of the hardware
`units that are unnecessary for performing the current opera(cid:173)
`tion.
`Another situation in which it is undesirable to force a
`10 packed data instruction to return individual results for each
`pair of data elements includes arithmetic operations in an
`environment providing partial-width hardware. Due to cost
`and/or die limitations, it is common not to provide full
`support for certain arithmetic operations, such as divide. By
`15 its nature, the divide operation is very long, even when
`full-width hardware ( e.g., a one-to-one correspondence
`between execution units and data elements) is implemented.
`Therefore, in an environment that supports only full-width
`packed data operations while providing partial-width
`20 hardware, the latency becomes even longer. As will be
`illustrated further below, a partial-width packed data
`operation, such as a partial-width packed data divide
`operation, may selectively allow certain portions of its
`operands to bypass the divide hardware. In this manner, no
`25 performance penalty is incurred by operating upon only a
`subset of the data elements in the packed data operands.
`Additionally, exceptions raised in connection with extra(cid:173)
`neous data elements may cause confusion to the developer
`and/or incompatibility between SISD and SIMD machines.
`Therefore, it is advantageous to report exceptions for only
`those data elements upon which the instruction is meant to
`operate. Partial-width packed data instruction support allows
`a predictable exception model to be achieved by limiting the
`triggering of exceptional conditions to those raised in con(cid:173)
`nection with the data elements being operated upon, or in
`which exceptions produced by extraneous data elements
`would be likely to cause confusion or incompatibility
`between SISD and SIMD machines.
`Finally, in embodiments where portions of destination
`packed data operand is not corrupted as a result of perform(cid:173)
`ing a partial-width packed data operation, partial-width
`packed data instructions effectively provide extra register
`space for storing data. For instance, if the lower portion of
`45 the packed data operand is being operated upon, data may be
`stored in the upper portion and vice versa.
`
`Justification of Partial-Width Packed Data
`Instructions
`
`Considering the amount of software that has been written
`for scalar architectures ( e.g., single instruction single data
`(SISD) architectures) employing scalar operations on single
`precision floating point data, double precision floating point
`data, and integer data, it is desirable to provide developers
`with the option of porting their software to architectures that
`support packed data instructions, such as SIMD 50
`architectures, without having to rewrite their software and/or
`learn new instructions. By providing partial-width packed
`data instructions, a simple translation can transform old
`scalar code into scalar packed data code. For example, it
`would be very easy for a compiler to produce scalar SIMD 55
`instructions from scalar code. Then, as developers recognize
`portions of their software that can be optimized using SIMD
`instructions, they may gradually take advantage of the
`packed data instructions. Of course, computer systems
`employing SIMD technology are likely to also remain 60
`backwards compatible by supporting SISD instructions as
`well. However, the many recent architectural improvements
`and other factors discussed herein make it advantageous for
`developers to transition to and exploit SIMD technology,
`even if only scalar SIMD instructions are employed at first. 65
`Another justification for providing partial-width packed
`data instructions is the many benefits which may be
`
`An Exemplary Computer System
`
`FIG. 2A is a simplified block diagram illustrating an
`exemplary computer system according to one embodiment
`of the invention. In the embodiment depicted, computer
`system 200 includes a processor 205, a storage device 210,
`and a bus 215. The processor 205 is coupled to the storage
`device 210 by the bus 215. In addition, a number of user
`input/output devices, such as a keyboard 220 and a display
`225 are also coupled to bus 215. The computer system 200
`may also be coupled to a network 230 via bus 215. The
`processor 205 represents a central processing unit of any
`type of architecture, such as a CISC, RISC, VLIW, or hybrid
`architecture. In addition, the processor 205 may be imple(cid:173)
`mented on one or more chips. The storage device 210
`represents one or more mechanisms for storing data. For
`example, the storage device 210 may include read only
`memory (ROM), random access memory (RAM), magnetic
`disk storage mediums, optical storage mediums, flash
`memory devices, and/or other machine-readable mediums.
`The bus 215 represents one or more buses (e.g., AGP, PCI,
`
`Oracle-1007 p. 17
`Oracle v. Teleputers
`IPR2021-00078
`
`

`

`US 6,192,467 Bl
`
`5
`ISA, X-Bus, EISA, VESA, etc.) and bridges (also termed as
`bus controllers). While this embodiment is described in
`relation to a single processor computer system, it is appre(cid:173)
`ciated that the invention may be implemented in a multi(cid:173)
`processor computer system. In addition while the present 5
`embodiment is described in relation to a 32-bit and a 64-bit
`computer system, the invention is not limited to such com(cid:173)
`puter systems.
`FIG. 2A additionally illustrates that the processor 205
`includes an instruction set unit 260. Of course, processor
`205 contains additional circuitry; however, such additional
`circuitry is not necessary to understanding the invention. An
`any rate, the instruction set unit 260 includes the hardware
`and/or firmware to decode and execute one or more instruc(cid:173)
`tion sets. In the embodiment depicted, the instruction set unit
`260 includes a decode/execution unit 275. The decode unit
`decodes instructions received by processor 205 into one or
`more micro instructions. The execution unit performs appro(cid:173)
`priate operations in response to the micro instructions
`received from the decode unit. The decode unit may be
`implemented using a number of different mechanisms (e.g.,
`a look-up table, a hardware implementation, a PLA, etc.).
`In the present example, the decode/execution unit 275 is
`shown containing an instruction set 280 that includes both
`full-width packed data instructions and partial-width packed
`data instructions. These packed data instructions, when
`executed, may cause the processor 205 to perform full-/
`partial-width packed floating point operations and/or full-/
`partial-width packed integer operations. In addition to the
`packed data instructions, the instruction set 280 may include
`other instructions found in existing micro processors. By
`way of example, in one embodiment the processor 205
`supports an instruction set which is compatible with Intel
`32-bit architecture (IA-32) and/or Intel 64-bit architecture
`(IA-64).
`A memory unit 285 is also included in the instruction set
`unit 260. The memory unit 285 may include one or more sets
`of architectural registers ( also referred to as logical registers)
`utilized by the processor 205 for storing information includ(cid:173)
`ing floating point data and packed floating point data.
`Additionally, other logical registers may be included for
`storing integer data, packed integer data, and various control
`data, such as a top of stack indication and the like. The terms
`architectural register and logical register are used herein to
`refer to the concept of the manner in which instructions
`specify a storage area that contains a single operand. Thus,
`a logical register may be implemented in hardware using any
`number of well known techniques, including a dedicated
`physical register, one or more dynamically allocated physi(cid:173)
`cal registers using a register renaming mechanism
`( described in further detail below), etc. In any event, a 50
`logical register represents the smallest unit of storage
`addressable by a packed data instruction.
`In the embodiment depicted, the storage device 210 has
`stored therein an operating system 235 and a packed data
`routine 240 for execution by the computer system 200. The 55
`packed data routine 240 is a sequence of instructions that
`may include one or more packed data instructions, such as
`scalar SIMD instructions or SIMD instructions. As dis(cid:173)
`cussed further below, there are situations, including speed,
`power consumption and exception handling, where it is
`desirable to perform an operation on ( or return individual
`results for) only a subset of data elements in a packed data
`operand or a pair of packed data operands. Therefore, it is
`advantageous for processor 205 to be able to differentiate
`between full-width packed data instructions and partial(cid:173)
`width packed data instructions and to execute them accord(cid:173)
`ingly.
`
`6
`FIG. 2B is a simplified block diagram illustrating exem(cid:173)
`plary sets of logical registers according to one embodiment
`of the invention. In this example, the memory unit 285
`includes a plurality of scalar floating point registers 291 ( a
`scalar register file) and a plurality of packed floating point
`registers 292 ( a packed data register file). The scalar floating
`point registers 291 (e.g., registers R0-R7 ) may be imple(cid:173)
`mented as a stack referenced register file when floating point
`instructions are executed so as to be compatible with exist-
`10 ing software written for the Intel Architecture. In alternative
`embodiments, however, the registers 291 may be treated as
`a flat register file. In the embodiment depicted, each of the
`packed floating point registers (e.g., XMM0-XMM7 ) are
`implemented as a single 128-bit logical register. It is
`15 appreciated, however, wider or narrower registers may be
`employed to conform to an implementation that uses more
`or less data elements or larger or smaller data elements.
`Additionally, more or less packed floating point registers
`292 may be provided. Similar to the scalar floating point
`20 registers 291, the packed floating point registers 292 may be
`implemented as either a stack referenced register file or a flat
`register file when packed floating point instructions are
`executed.
`FIG. 2C is a simplified block diagram illustrating exem-
`25 plary sets of logical registers according to another embodi(cid:173)
`ment of the invention. In this example, the memory un

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket