`Abdallah et al.
`
`I 1111111111111111 11111 lllll 111111111111111 lllll lllll 111111111111111 11111111
`US006192467Bl
`US 6,192,467 Bl
`Feb.20,2001
`
`(10) Patent No.:
`(45) Date of Patent:
`
`(54) EXECUTING PARTIAL-WIDTH PACKED
`DATA INSTRUCTIONS
`
`(75)
`
`Inventors: Mohammad A. Abdallah; Vladimir
`Pentkovski, both of Folsom; James
`Coke, Shingle Springs, all of CA (US)
`
`(73) Assignee: Intel Corporation, Santa Clara, CA
`(US)
`
`( *) Notice:
`
`Under 35 U.S.C. 154(b), the term of this
`patent shall be extended for O days.
`
`(21) Appl. No.: 09/053,000
`
`(22) Filed:
`
`Mar. 31, 1998
`
`(51)
`
`Int. Cl.7 ............................... G06F 9/22; G06F 9/302
`
`(52) U.S. Cl. .............................................................. 712/222
`
`(58) Field of Search ...................................... 712/222, 221
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`3,675,001
`3,723,715
`4,890,218
`5,210,711
`5,311,508
`5,426,598
`5,673,427
`5,721,892
`5,793,661
`5,852,726
`5,936,872
`6,018,351
`6,041,403
`
`7/1972 Singh.
`3/1973 Chen et al..
`12/1989 Bram.
`5/1993 Rossmere et al. .
`5/1994 Buda et al..
`6/1995 Hagihara.
`9/1997 Brown et al..
`2/1998 Peleg et al..
`8/1998 Dulong et al..
`12/1998 Lin et al..
`8/1999 Fischer et al. .
`1/2000 Mennemeier et al. .
`5/2000 Parker et al. .
`
`FOREIGN PATENT DOCUMENTS
`
`10/1999 (GB).
`9907221
`WO 97/08608 * 3/1997 (WO).
`WO 97/22921
`6/1997 (WO).
`WO 97/22923
`6/1997 (WO).
`WO 97/22924
`6/1997 (WO).
`
`OTHER PUBLICATIONS
`Abbott et al., "Broadband Algorithms with the Micro Unity
`Mediaprocessor", Proceedings of COMPCON '96, 1996,
`pp. 349-354.
`Hayes et al., "MicroUnity Software Development Environ(cid:173)
`ment", Proceedings of COMPCON '96, 1996, pp. 341-348.
`International Search Report PCT/US99/04718, Jun. 28,
`1999, 4 pages.
`"TM 1000 Preliminary Data Book", Philips Semiconduc(cid:173)
`tors, 1997.
`"21164 Alpha™ Microprocessor Data Sheet", Samsung
`Electronics, 1997.
`"Silicon Graphics Introduces Enhanced MIPS® Architec(cid:173)
`ture to lead the Interactive Digital Revolution, Silicon
`Graphics", Oct. 21, 1996, donwloaded from Website
`webmaster@www.sgi.com, pp. 1-2.
`(List continued on next page.)
`Primary Examiner-William M. Treat
`(74) Attorney, Agent, or Firm-Blakely, Sokoloff, Taylor &
`Zafman LLP
`(57)
`
`ABSTRACT
`
`A method and apparatus are provided for executing scalar
`packed data instructions. According to one aspect of the
`invention, a processor includes a plurality of registers, a
`register renaming unit coupled to the plurality of registers,
`a decoder coupled to the register renaming unit, and a
`partial-width execution unit coupled to the decoder. The
`register renaming unit provides an architectural register file
`to store packed data operands each of which include a
`plurality of data elements. The decoder is configured to
`decode a first and second set of instructions that each specify
`one or more registers in the architectural register file. Each
`of the instructions in the first set of instructions specify
`operations to be performed on all of the data elements stored
`in the one or more specified registers. In contrast, each of the
`instructions in the second set of instructions specify opera(cid:173)
`tions to be performed on only a subset of the data element
`stored in the one or more specified registers. The partial(cid:173)
`width execution unit is configured to execute operations
`specified by either of the first or the second set of instruc(cid:173)
`tions.
`
`43 Claims, 13 Drawing Sheets
`
`INSTRUCTION PROCESSING
`
`FUll-WIDTH
`PACKED
`
`DETERMINEARESULTBY
`PERFORMING THE OPERATION
`SPECIFIED BY THE
`INSTRUCTION ON EACH PAIR
`OF CORRESPONDING DATA
`ELEMENTS
`
`FILLING THE REMAINING
`PORTIONOFTI-IERESULT
`WITl-!ONEORMORE
`PREDETERMINED VALUES
`
`Oracle-1007 p. 1
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`US 6,192,467 Bl
`Page 2
`
`OIBER PUBLICATIONS
`
`"Silicon Graphics Introduces Compact MIPS® RISC
`Microprocessor Code for High Performance at a low Cost",
`Oct.
`21,
`1996,
`donwloaded
`from Website
`webmaster@www.sgi.com, pp. 1-2.
`Killian, Earl, "MIPS Extension for Digital Media", Silicon
`Graphics, pp. 1-10.
`"MIPS V Instruction Set", pp. Bl- to B-37.
`"MIPS Digital Media Extension", pp. Cl to C40.
`"MIPS Extension for Digital Media with 3D", MIPS Tech(cid:173)
`nologies, Inc., Mar. 12, 1997, pp. 1-26.
`"64-Bit and Multimedia Extensions in the PA-RISC 2.0
`Architecture", Helett Packard, donwloaded from Website
`rblee@cup.hp.com.huck@cup.hp.com,pp. 1-18.
`"The VIS™ Instruction Set", Sun Microsystems, Inc., 1997,
`pp. 1-2.
`"ULTRASPARC™ The Visual Instruction Set (VIS™): On
`Chip Support for New-Media Processing", Sun Microsys(cid:173)
`tems, Inc., 1996, pp. 1-7.
`ULTRASPARC™ and New Media Support Real-Time
`MPEG2 Decode with the Visual Instruction Set (VIS™),
`Sun Microsystems, Inc., 1996, pp. 1-8.
`ULTRASPARC™ Ultra Port Architecture (UPA): The
`New-Media System Architecture, Sun Microsystems, Inc.,
`1996, pp. 1-4.
`
`ULTRASPARC™ Turbocharges Network Operations on
`New Media Computing, Sun Microsystem, Inc., 1996, pp.
`1-5.
`
`The UltraSPARC Processor-Technology White Paper, Sun
`Microsystems, Inc., 1995, 37 pages.
`
`AMD-3D™ Technology Manual,Advanced Micro Devices,
`Feb. 1998.
`
`Hansen, Craig, Architecture of a Broadband Mediaproces(cid:173)
`sor, MicroUnity Systems Engineering, Inc., 1996, pp.
`334-354.
`
`Levinthal, Adam, et al., Parallel Computers for Graphics
`Applications, Pixar, San Rafael, CA, 1987, pp. 193-198.
`
`Levinthal, Adam; Porter, Thomas, "Chap-A SIMD Graphics
`Processor", Computer Grahics Project, Lucasfilm Ltd.,
`1984, pp. 77-82.
`
`Wang, Mangaser, Shrinivan, A processor Architecture for
`3D Graphics Calculations, Computer Motion, Inc., pp. 1-23.
`
`Visual Instruction Set (VIS™), User's Guide, Sun Micro(cid:173)
`systems, Inc., version 1.1 Mar., 1997.
`
`* cited by examiner
`
`Oracle-1007 p. 2
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`U.S. Patent
`
`Feb.20,2001
`
`Sheet 1 of 13
`
`US 6,192,467 Bl
`
`0 ><
`
`T"" ><
`
`N ><
`
`C"') ><
`
`Cl
`
`Cl <
`
`0 >-
`
`T"" >-
`
`N >-
`
`C"') >-
`
`>
`
`0
`N
`
`T""
`
`N
`
`N
`N
`
`C"')
`
`N
`
`Oracle-1007 p. 3
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`i,-
`~
`-...,l
`a-...
`~
`N
`\0
`i,(cid:173)
`_,.a-...
`rJ'J.
`e
`
`"""" ~
`0 ....,
`N
`~ ....
`'JJ. =(cid:173)~
`
`N ~=
`?'
`~
`"'!"j
`
`""""
`0
`0
`N
`
`~ = ......
`~ ......
`~
`•
`r:JJ.
`d •
`
`210
`
`DEVICE
`STORAGE
`
`240
`
`-
`
`ROUTINE
`
`DATA
`PACKED
`
`CJ 5
`
`/"'--200
`
`+-+
`
`.......
`
`12ao
`~
`275
`
`1260
`
`205
`
`INSTRUCTION SET
`
`DECODE/EXECTION UNIT
`
`285
`
`MEMORY
`
`UNIT
`
`INSTRUCTION SET UNIT
`
`PROCESSOR
`
`~ NETWORK
`
`r230
`
`..
`
`225
`
`DISPLAY
`
`+-+
`
`220
`
`KEYBOARD
`
`+-
`
`5
`
`FIG. 2A
`
`GENERAL PURPOSE INSTRUCTIONS
`FP INSTRUCTIONS
`PARTIAL-WIDTH PACKED DATA INSTR
`FULL-WIDTH PACKED DATA INSTR
`
`Oracle-1007 p. 4
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`i,-
`~
`-...,l
`a-...
`~
`N
`\0
`i,(cid:173)
`_,.a-...
`rJ'J.
`e
`
`~
`
`'"""' ~
`0 ....,
`~ ....
`'JJ. =(cid:173)~
`
`N ~=
`?'
`~
`"'!"j
`
`'"""'
`0
`0
`N
`
`~ = ......
`~ ......
`~
`•
`r:JJ.
`d •
`
`FIG. 2C
`
`292
`
`~
`
`RO
`
`R1
`
`R2
`
`R3
`
`R4
`
`RS
`
`R6
`
`R7
`
`FIG. 2B
`
`XMMO
`
`XMM1
`
`XMM2
`
`XMM3
`
`XMM4
`
`XMMS
`
`XMM6
`
`XMM7
`
`RO
`
`R1
`
`R2
`
`R3
`
`R4
`
`RS
`
`R6
`
`R7
`
`294~ LOW
`
`HIGH
`
`293~
`~
`285
`
`291"}
`
`(292
`
`291+
`
`285
`
`Oracle-1007 p. 5
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`U.S. Patent
`
`Feb.20,2001
`
`Sheet 4 of 13
`
`US 6,192,467 Bl
`
`INSTRUCTION PROCESSING
`
`RECEIVE INSTRUCTION
`
`310
`
`FULL-WIDTH
`PACKED
`
`PARTIAL-WIDTH
`PACKED
`
`330
`
`DETERMINE A RESULT BY
`PERFORMING THE OPERATION
`SPECIFIED BY THE
`INSTRUCTION ON EACH PAIR
`OF CORRESPONDING DATA
`ELEMENTS
`
`DETERMINE A PORTION OF
`THE RESULT BY PERFORMING
`THE OPERATION SPECIFIED BY
`THE INSTRUCTION ON A
`SUBSET OF CORRESPONDING
`DATA ELEMENTS
`
`FILLING THE REMAINING
`PORTION OF THE RESULT
`WITH ONE OR MORE
`PREDETERMINED VALUES
`
`340
`
`350
`
`END
`
`FIG. 3
`
`Oracle-1007 p. 6
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`U.S. Patent
`
`Feb.20,2001
`
`Sheet 5 of 13
`
`US 6,192,467 Bl
`
`410_r--
`
`X3
`
`X2
`
`42aF
`
`Y3
`
`Y2
`
`X1
`
`Y1
`
`Xo
`
`Yo
`
`1,
`
`,,
`
`, I,
`
`1 I,
`
`,,
`
`, I,
`
`1f
`
`, .
`
`440 ...r
`
`EXECUTION UNIT
`
`1.
`
`,1,
`
`,,
`
`430J X3OR0 X2OR0 X1 ORO
`
`,,
`
`Zo
`
`FIG. 4
`
`Oracle-1007 p. 7
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`U.S. Patent
`
`Feb.20,2001
`
`Sheet 6 of 13
`
`US 6,192,467 Bl
`
`Xo
`
`570
`
`FIG. SA
`
`540
`
`FIG. SB
`
`IDENTITY
`
`FUNCTION ----------+----+----.
`
`Xo
`
`\ VALUE..----+-------t-+----+--+---+---+--+---+-+----+-----,
`590
`
`572
`
`572
`
`572
`
`FIG. SC
`
`580
`
`Oracle-1007 p. 8
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`U.S. Patent
`
`Feb.20,2001
`
`Sheet 7 of 13
`
`US 6,192,467 Bl
`
`I-
`:z
`::::,
`:z
`0
`I-
`::::,
`u
`w
`X
`Lf ~
`
`Cl
`<C
`
`Cl
`Cl
`<C
`
`Cl
`Cl
`<C
`
`Cl
`Cl
`<C
`
`Cl
`Cl
`<C
`
`I-
`:z
`::::,
`:z
`0
`I-
`::::,
`u
`w
`X
`LJ- w
`
`.....J
`::::,
`~
`
`.....J
`::::,
`~
`
`.....J
`::::,
`~
`
`.....J
`::::,
`~
`
`.....J
`::::,
`~
`
`LO
`0
`c.o
`Ii::
`0 a..
`w
`::::,
`"(./)
`(./)
`"'
`
`Oracle-1007 p. 9
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`i,-
`~
`-...,l
`a-...
`~
`N
`\0
`i,(cid:173)
`_,.a-...
`rJ'J.
`e
`
`'"""' ~
`0 ....,
`00
`~ ....
`'JJ. =(cid:173)~
`
`N ~=
`?'
`~
`"'!"j
`
`'"""'
`0
`0
`N
`
`~ = ......
`~ ......
`~
`•
`r:JJ.
`d •
`
`FIG. 7B
`
`* = CORRESPONDING DATA ELEMENT
`
`PREDETERMINED VALUE
`IN X ORY, NaN, 0, OR OTHER
`
`* I * I * lxo+Yol
`
`RESULT
`
`*
`
`Xo+Yo
`
`T ADD
`
`lv3lv2IY1 lvol
`OPERANDY
`
`I X3IX2 I X1 jxol
`OPERANDX
`
`FIG. 7A
`
`I X3+Y3I X2+Y2I X1+Y1 I Xo+Yol
`
`RESULT
`
`lv3lv2IY1 I val
`OPERANDY
`
`T
`
`ADD
`
`ADD
`
`TIME
`
`I X3 I X2 I X1 I Xo I
`
`OPERAND X
`
`Oracle-1007 p. 10
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`U.S. Patent
`
`Feb.20,2001
`
`Sheet 9 of 13
`
`US 6,192,467 Bl
`
`PORT 1
`
`PORT2
`
`128
`
`64
`
`128
`
`64
`
`i
`
`REGISTER
`FILE
`
`800
`
`128
`
`M3
`
`-7
`
`804
`
`EXECUTION
`UNIT
`
`64
`(ADD) .........,_,i---.....
`64
`
`802
`
`806
`
`L _________ _
`
`EXECUTION
`UNIT
`
`64
`(MUL) 1-----.,,____,
`64
`
`-
`
`-
`
`_J
`
`FIG. BA
`
`Oracle-1007 p. 11
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`U.S. Patent
`
`Feb.20,2001
`
`Sheet 10 of 13
`
`US 6,192,467 Bl
`
`T"""
`
`T"""
`
`>-
`X
`0
`0
`<C
`
`N
`
`>-
`N
`X
`0
`0
`<C
`
`C"')
`
`C"')
`
`>-
`X
`0
`0
`<C
`
`0
`
`>-
`0
`X
`_J
`:::::,
`::::E
`
`T"""
`
`T"""
`
`>-
`><
`_J
`:::::,
`::::E
`
`N
`
`N
`
`>-
`><
`::::>
`::::E
`
`_J
`
`C"')
`
`C"')
`
`>-
`><
`_J
`:::::,
`::::E
`
`I~
`
`I~
`
`I~
`
`I~
`
`>-
`><
`0
`0
`<C
`
`>-
`><
`_J
`:::::,
`::::E
`
`0
`
`z
`0
`0~ >-
`w~
`0
`~- 0
`X
`::::E I-
`Oen
`LL
`I
`~"o::t
`wc.o
`a..
`
`0
`<C
`
`z
`0
`I-
`(.)
`:::::,
`~
`I-en
`z
`-
`en I co
`
`-I
`
`N
`T'""
`
`w
`::::E
`I-
`
`I-
`
`T'""
`+
`I-
`
`T'""
`+
`I-
`
`N
`+
`I-
`
`Oracle-1007 p. 12
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`i,-
`~
`-...,l
`a-...
`~
`N
`\0
`i,(cid:173)
`_,.a-...
`rJ'J.
`e
`
`'""" ~
`'""" 0 ....,
`'"""
`~ ....
`'JJ. =(cid:173)~
`
`N ~=
`?'
`~
`"'!"j
`
`'"""
`0
`0
`N
`
`~ = ......
`~ ......
`~
`•
`r:JJ.
`d •
`
`', MICRO INSTRUCTIONS
`'
`1
`,' EXECUTION HARDWARE'\
`,_-
`' ---~----
`\
`
`,. ,,
`/
`1
`
`FIG. 9
`
`J
`I
`I
`I
`\
`
`I
`
`I
`I
`
`1
`
`/
`
`/
`
`I
`
`------
`
`,
`' '
`',
`-.. -
`'
`
`'
`
`PROCESSES BOTH
`
`HALFWIDTH ',
`
`\
`
`' ' ' '
`
`'
`
`\
`
`\
`
`\
`
`BACK
`WRITE
`
`-
`
`-EXECUTING
`
`-
`
`PIPELINE
`
`' ' , ,
`
`-,,
`
`----
`
`HYSICAL REGISTERS (RoB)
`p
`---
`
`L _______ .J
`I
`►
`: SCHEDULING
`7
`r-------
`
`1
`
`\
`
`\
`
`RENAMING I
`
`REGISTER
`
`/
`
`,,
`,,
`2X REGISTERS
`
`,,
`
`-- -- -,,.
`
`~
`
`•
`•
`•
`
`i-----_J R0L
`i-------l
`
`..._
`~ ",
`
`' ,
`
`-~L
`
`ENT REGISTER
`
`RETIREM
`
`_____ .,,.,. __ .,,,,,.
`
`---------
`
`-----
`
`.... .... .... .... ... _
`
`' .....
`
`.....
`
`2XENTRIES
`
`----~
`
`' ' ' ' .....
`
`•
`•
`•
`
`/
`
`i------_J RoH
`_---=-=~
`ALIAS TABLE
`REGISTER
`
`~=======~ RoL
`
`R1H
`
`•
`
`•
`•
`
`1,,
`"
`
`'
`
`\
`\
`\
`\
`I
`I
`,
`I
`
`t
`1
`I
`
`_J----------~
`
`\
`
`\
`
`I
`1
`
`MICRO INSTR.
`HAI!= WlnTH
`
`I MICRO INSTR. I :
`
`HALF WIDTH
`
`DECODING
`
`I
`I
`
`, , /
`
`--
`
`----
`
`---
`
`REGISTERS)
`(X LOGICAL
`INSTRUCTION
`FULL WIDTH I
`
`lAAl"'□I"\
`
`Oracle-1007 p. 13
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`U.S. Patent
`
`Feb.20,2001
`
`Sheet 12 of 13
`
`US 6,192,467 Bl
`
`z
`0 -I-
`(_)
`=>
`er::
`1--
`en
`z
`-
`-cc I
`1--
`
`~
`co
`z
`0
`I-
`(_)
`:::> er::
`1--en
`z
`-1--
`-cc I co
`
`N
`-.r-
`
`::r:
`>-
`..
`::r:
`><
`Cl
`Cl
`<C
`
`I j\
`
`_J
`
`....J
`
`>-
`-
`><
`Cl
`Cl
`<(
`
`I j\
`
`>-
`-><
`
`Cl
`Cl
`<(
`
`w
`-1--
`~
`
`1--
`
`z
`+
`1--
`
`Oracle-1007 p. 14
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`U.S. Patent
`
`Feb.20,2001
`
`Sheet 13 of 13
`
`US 6,192,467 Bl
`
`z
`0
`i==
`CJ
`::::,
`0:::
`I-
`en
`~
`0
`0:::
`CJ
`::E
`
`N
`z
`0
`F
`CJ
`::::,
`0:::
`I-
`en
`~
`0
`0:::
`CJ
`::E
`
`z
`z
`0
`F
`CJ
`::::,
`0:::
`I-en
`~
`~
`CJ
`::E
`
`• • •
`
`z
`0
`i==
`CJ
`UJ
`...J
`UJ en
`z
`0
`i==
`CJ
`::::,
`0:::
`I-en
`~
`0
`0:::
`CJ
`::E
`
`r---------- --------7
`~L./~
`.....
`
`lu
`
`I C!> ,g
`IZ
`IQ
`!.c
`le_,
`I~
`I UJ
`Io:::
`
`0 ,..._
`.....
`.....
`
`z
`0
`
`~ ~I
`
`0
`0
`~
`
`I
`L ___________ - - - - - - -~
`
`0:::
`UJ
`0
`0
`CJ
`UJ
`0
`
`~I
`
`0:::
`UJ
`0
`0
`u
`UJ
`0
`
`~I
`
`• • •
`
`0:::
`UJ
`0
`0 u
`UJ
`0
`
`sl
`
`z
`
`0 t o::::,
`
`0::: 0:::
`(_)I(cid:173)
`<( en
`~~
`
`z
`0
`
`z
`0
`-
`
`~t~~~ ~I
`
`~~~~8
`~ I- UJ
`en -' UJ
`z
`I-
`-
`UJ
`0
`
`...J
`
`Oracle-1007 p. 15
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`US 6,192,467 Bl
`
`1
`EXECUTING PARTIAL-WIDTH PACKED
`DATA INSTRUCTIONS
`
`FIELD OF THE INVENTION
`
`The invention relates generally to the field of computer
`systems. More particularly, the invention relates to a method
`and apparatus for efficiently executing partial-width packed
`data instructions, such as scalar packed data instructions, by
`a processor that makes use of SIMD technology, for
`example.
`
`BACKGROUND OF THE INVENTION
`
`5
`
`2
`decode a first and second set of instructions that each specify
`one or more registers in the architectural register file. Each
`of the instructions in the first set of instructions specify
`operations to be performed on all of the data elements stored
`in the one or more specified registers. In contrast, each of the
`instructions in the second set of instructions specify opera(cid:173)
`tions to be performed on only a subset of the data element
`stored in the one or more specified registers. The partial(cid:173)
`width execution unit is configured to execute operations
`10 specified by either of the first or the second set of instruc(cid:173)
`tions.
`Other features and advantages of the invention will be
`apparent from the accompanying drawings and from the
`detailed description.
`
`Multimedia applications such as 2D/3D graphics, image
`processing, video compression/decompression, voice recog- 15
`nition algorithms and audio manipulation, often require the
`same operation to be performed on a large number of data
`items (referred to as "data parallelism"). Each type of
`multimedia application typically implements one or more
`algorithms requiring a number of floating point or integer 20
`operations, such as ADD or MULTIPLY (hereafter MUL).
`By providing macro instructions whose execution causes a
`processor to perform the same operation on multiple data
`items in parallel, Single Instruction Multiple Data (SIMD)
`technology, such as that employed by the Pentium® pro- 25
`cessor architecture and the MMx™ instruction set, has
`enabled a significant improvement in multimedia applica(cid:173)
`tion performance (Pentium® and MMx™ are registered
`trademarks or trademarks of Intel Corporation of Santa
`Clara, Calif.).
`SIMD technology is especially suited to systems that
`provide packed data formats. A packed data format is one in
`which the bits in a register are logically divided into a
`number of fixed-sized data elements, each of which repre(cid:173)
`sents a separate value. For example, a 64-bit register may be
`broken into four 16-bit elements, each of which represents
`a separate 16-bit value. Packed data instructions may then
`separately manipulate each element in these packed data
`types in parallel.
`Referring to FIG. 1, an exemplary packed data instruction
`is illustrated. In this example, a packed ADD instruction
`(e.g., a SIMD ADD) adds corresponding data elements of a
`first packed data operand, X, and a second packed data
`operand, Y, to produce a packed data result, Z, i.e., X0 + Y 0=
`Z0 , X1 + Y 1 =Z1 , X2 + Y 2 =Z2 , and X3 + Y 3 =Z3 . Packing many
`data elements within one register or memory location and
`employing parallel hardware execution allows SIMD archi(cid:173)
`tectures to perform multiple operations at a time, resulting in
`significant performance improvement. For instance, in this
`example, four individual results may be obtained in the time
`previously required to obtain a single result.
`While the advantages achieved by SIMD architectures are
`evident, there remain situations in which it is desirable to
`return individual results for only a subset of the packed data 55
`elements.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`The invention is described by way of example and not by
`way of limitation with reference to the figures of the
`accompanying drawings in which like reference numerals
`refer to similar elements and in which:
`FIG. 1 illustrates a packed ADD instruction adding
`together corresponding data elements from a first packed
`data operand and a second packed data operand.
`FIG. 2A is a simplified block diagram illustrating an
`exemplary computer system according to one embodiment
`of the invention.
`FIG. 2B is a simplified block diagram illustrating exem(cid:173)
`plary sets of logical registers according to one embodiment
`30 of the invention.
`FIG. 2C is a simplified block diagram illustrating exem(cid:173)
`plary sets of logical registers according to another embodi(cid:173)
`ment of the invention.
`FIG. 3 is a flow diagram illustrating instruction execution
`35 according to one embodiment of the invention.
`FIG. 4 conceptually illustrates the result of executing a
`partial-width packed data instruction according to various
`embodiments of the invention.
`FIG. SA conceptually illustrates circuitry for executing
`40 full-width packed data instructions and partial-width packed
`data instructions according to one embodiment of the inven(cid:173)
`tion.
`FIG. SB conceptually illustrates circuitry for executing
`full-width packed data and partial-width packed data
`45 instructions according to another embodiment of the inven(cid:173)
`tion.
`FIG. SC conceptually illustrates circuitry for executing
`full-width packed data and partial-width packed data
`instructions according to yet another embodiment of the
`50 invention.
`FIG. 6 illustrates an ADD execution unit and a MUL
`execution unit capable of operating as four separate ADD
`execution units and four separate MUL execution units,
`respectively, according to an exemplary processor imple(cid:173)
`mentation of SIMD.
`FIGS. 7A-7B conceptually illustrate a full-width packed
`data operation and a partial-width packed data operation
`being performed in a "staggered" manner, respectively.
`FIG. SA conceptually illustrates circuitry within a pro(cid:173)
`cessor that accesses full width operands from logical regis(cid:173)
`ters while performing operations on half of the width of the
`operands at a time.
`FIG. SB is a timing chart that further illustrates the
`65 circuitry of FIG. SA
`FIG. 9 conceptually illustrates one embodiment of an
`out-of-order pipeline to perform operations on operands in a
`
`SUMMARY OF THE INVENTION
`
`A method and apparatus are described for executing
`partial-width packed data instructions. According to one 60
`aspect of the invention, a processor includes a plurality of
`registers, a register renaming unit coupled to the plurality of
`registers, a decoder coupled to the register renaming unit,
`and a partial-width execution unit coupled to the decoder.
`The register renaming unit provides an architectural register
`file to store packed data operands each of which include a
`plurality of data elements. The decoder is configured to
`
`Oracle-1007 p. 16
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`US 6,192,467 Bl
`
`3
`"staggered" manner by converting a macro instruction into
`a plurality of micro instructions that each processes a portion
`of the full width of the operands.
`FIG. 10 is a timing chart that further illustrates the
`embodiment described in FIG. 9.
`FIG. 11 is a block diagram illustrating decoding logic that
`may be employed to accomplish the decoding processing
`according to one embodiment of the invention.
`
`DETAILED DESCRIPTION
`
`A method and apparatus are described for performing
`partial-width packed data instructions. Herein the term "full(cid:173)
`width packed data instruction" is meant to refer to a packed
`data instruction ( e.g., a SIMD instruction) that operates upon
`all of the data elements of one or more packed data operands.
`In contrast, the term "partial-width packed data instruction"
`is meant to broadly refer to a packed data instruction that is
`designed to operate upon only a subset of the data elements
`of one or more packed data operands and return a packed
`data result (to a packed data register file, for example). For
`instance, a scalar SIMD instruction may require only a result
`of an operation between the least significant pair of packed
`data operands. In this example, the remaining data elements
`of the packed data result are disregarded as they are of no
`consequence to the scalar SIMD instruction (e.g., the
`remaining data elements are don't cares). According to the
`various embodiments of the invention, execution units may
`be configured in such a way to efficiently accommodate both
`full-width packed data instructions (e.g., SIMD instructions) 30
`and a set of partial-width packed data instructions ( e.g.,
`scalar SIMD instructions).
`In the following detailed description, for purposes of
`explanation, numerous specific details are set forth in order
`to provide a thorough understanding of the invention. It will 35
`be apparent, however, to one of ordinary skill in the art that
`these specific details need not be used to practice the
`invention. In other instances, well-known devices,
`structures, interfaces, and processes have not been shown or
`are shown in block diagram form.
`
`40
`
`4
`achieved by operating on only a subset of a full-width
`operand, including reduced power consumption, increased
`speed, a clean exception model, and increased storage. As
`illustrated below, based on an indication provided with the
`5 partial-width packed data instruction, power savings may be
`achieved by selectively shutting down those of the hardware
`units that are unnecessary for performing the current opera(cid:173)
`tion.
`Another situation in which it is undesirable to force a
`10 packed data instruction to return individual results for each
`pair of data elements includes arithmetic operations in an
`environment providing partial-width hardware. Due to cost
`and/or die limitations, it is common not to provide full
`support for certain arithmetic operations, such as divide. By
`15 its nature, the divide operation is very long, even when
`full-width hardware ( e.g., a one-to-one correspondence
`between execution units and data elements) is implemented.
`Therefore, in an environment that supports only full-width
`packed data operations while providing partial-width
`20 hardware, the latency becomes even longer. As will be
`illustrated further below, a partial-width packed data
`operation, such as a partial-width packed data divide
`operation, may selectively allow certain portions of its
`operands to bypass the divide hardware. In this manner, no
`25 performance penalty is incurred by operating upon only a
`subset of the data elements in the packed data operands.
`Additionally, exceptions raised in connection with extra(cid:173)
`neous data elements may cause confusion to the developer
`and/or incompatibility between SISD and SIMD machines.
`Therefore, it is advantageous to report exceptions for only
`those data elements upon which the instruction is meant to
`operate. Partial-width packed data instruction support allows
`a predictable exception model to be achieved by limiting the
`triggering of exceptional conditions to those raised in con(cid:173)
`nection with the data elements being operated upon, or in
`which exceptions produced by extraneous data elements
`would be likely to cause confusion or incompatibility
`between SISD and SIMD machines.
`Finally, in embodiments where portions of destination
`packed data operand is not corrupted as a result of perform(cid:173)
`ing a partial-width packed data operation, partial-width
`packed data instructions effectively provide extra register
`space for storing data. For instance, if the lower portion of
`45 the packed data operand is being operated upon, data may be
`stored in the upper portion and vice versa.
`
`Justification of Partial-Width Packed Data
`Instructions
`
`Considering the amount of software that has been written
`for scalar architectures ( e.g., single instruction single data
`(SISD) architectures) employing scalar operations on single
`precision floating point data, double precision floating point
`data, and integer data, it is desirable to provide developers
`with the option of porting their software to architectures that
`support packed data instructions, such as SIMD 50
`architectures, without having to rewrite their software and/or
`learn new instructions. By providing partial-width packed
`data instructions, a simple translation can transform old
`scalar code into scalar packed data code. For example, it
`would be very easy for a compiler to produce scalar SIMD 55
`instructions from scalar code. Then, as developers recognize
`portions of their software that can be optimized using SIMD
`instructions, they may gradually take advantage of the
`packed data instructions. Of course, computer systems
`employing SIMD technology are likely to also remain 60
`backwards compatible by supporting SISD instructions as
`well. However, the many recent architectural improvements
`and other factors discussed herein make it advantageous for
`developers to transition to and exploit SIMD technology,
`even if only scalar SIMD instructions are employed at first. 65
`Another justification for providing partial-width packed
`data instructions is the many benefits which may be
`
`An Exemplary Computer System
`
`FIG. 2A is a simplified block diagram illustrating an
`exemplary computer system according to one embodiment
`of the invention. In the embodiment depicted, computer
`system 200 includes a processor 205, a storage device 210,
`and a bus 215. The processor 205 is coupled to the storage
`device 210 by the bus 215. In addition, a number of user
`input/output devices, such as a keyboard 220 and a display
`225 are also coupled to bus 215. The computer system 200
`may also be coupled to a network 230 via bus 215. The
`processor 205 represents a central processing unit of any
`type of architecture, such as a CISC, RISC, VLIW, or hybrid
`architecture. In addition, the processor 205 may be imple(cid:173)
`mented on one or more chips. The storage device 210
`represents one or more mechanisms for storing data. For
`example, the storage device 210 may include read only
`memory (ROM), random access memory (RAM), magnetic
`disk storage mediums, optical storage mediums, flash
`memory devices, and/or other machine-readable mediums.
`The bus 215 represents one or more buses (e.g., AGP, PCI,
`
`Oracle-1007 p. 17
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`US 6,192,467 Bl
`
`5
`ISA, X-Bus, EISA, VESA, etc.) and bridges (also termed as
`bus controllers). While this embodiment is described in
`relation to a single processor computer system, it is appre(cid:173)
`ciated that the invention may be implemented in a multi(cid:173)
`processor computer system. In addition while the present 5
`embodiment is described in relation to a 32-bit and a 64-bit
`computer system, the invention is not limited to such com(cid:173)
`puter systems.
`FIG. 2A additionally illustrates that the processor 205
`includes an instruction set unit 260. Of course, processor
`205 contains additional circuitry; however, such additional
`circuitry is not necessary to understanding the invention. An
`any rate, the instruction set unit 260 includes the hardware
`and/or firmware to decode and execute one or more instruc(cid:173)
`tion sets. In the embodiment depicted, the instruction set unit
`260 includes a decode/execution unit 275. The decode unit
`decodes instructions received by processor 205 into one or
`more micro instructions. The execution unit performs appro(cid:173)
`priate operations in response to the micro instructions
`received from the decode unit. The decode unit may be
`implemented using a number of different mechanisms (e.g.,
`a look-up table, a hardware implementation, a PLA, etc.).
`In the present example, the decode/execution unit 275 is
`shown containing an instruction set 280 that includes both
`full-width packed data instructions and partial-width packed
`data instructions. These packed data instructions, when
`executed, may cause the processor 205 to perform full-/
`partial-width packed floating point operations and/or full-/
`partial-width packed integer operations. In addition to the
`packed data instructions, the instruction set 280 may include
`other instructions found in existing micro processors. By
`way of example, in one embodiment the processor 205
`supports an instruction set which is compatible with Intel
`32-bit architecture (IA-32) and/or Intel 64-bit architecture
`(IA-64).
`A memory unit 285 is also included in the instruction set
`unit 260. The memory unit 285 may include one or more sets
`of architectural registers ( also referred to as logical registers)
`utilized by the processor 205 for storing information includ(cid:173)
`ing floating point data and packed floating point data.
`Additionally, other logical registers may be included for
`storing integer data, packed integer data, and various control
`data, such as a top of stack indication and the like. The terms
`architectural register and logical register are used herein to
`refer to the concept of the manner in which instructions
`specify a storage area that contains a single operand. Thus,
`a logical register may be implemented in hardware using any
`number of well known techniques, including a dedicated
`physical register, one or more dynamically allocated physi(cid:173)
`cal registers using a register renaming mechanism
`( described in further detail below), etc. In any event, a 50
`logical register represents the smallest unit of storage
`addressable by a packed data instruction.
`In the embodiment depicted, the storage device 210 has
`stored therein an operating system 235 and a packed data
`routine 240 for execution by the computer system 200. The 55
`packed data routine 240 is a sequence of instructions that
`may include one or more packed data instructions, such as
`scalar SIMD instructions or SIMD instructions. As dis(cid:173)
`cussed further below, there are situations, including speed,
`power consumption and exception handling, where it is
`desirable to perform an operation on ( or return individual
`results for) only a subset of data elements in a packed data
`operand or a pair of packed data operands. Therefore, it is
`advantageous for processor 205 to be able to differentiate
`between full-width packed data instructions and partial(cid:173)
`width packed data instructions and to execute them accord(cid:173)
`ingly.
`
`6
`FIG. 2B is a simplified block diagram illustrating exem(cid:173)
`plary sets of logical registers according to one embodiment
`of the invention. In this example, the memory unit 285
`includes a plurality of scalar floating point registers 291 ( a
`scalar register file) and a plurality of packed floating point
`registers 292 ( a packed data register file). The scalar floating
`point registers 291 (e.g., registers R0-R7 ) may be imple(cid:173)
`mented as a stack referenced register file when floating point
`instructions are executed so as to be compatible with exist-
`10 ing software written for the Intel Architecture. In alternative
`embodiments, however, the registers 291 may be treated as
`a flat register file. In the embodiment depicted, each of the
`packed floating point registers (e.g., XMM0-XMM7 ) are
`implemented as a single 128-bit logical register. It is
`15 appreciated, however, wider or narrower registers may be
`employed to conform to an implementation that uses more
`or less data elements or larger or smaller data elements.
`Additionally, more or less packed floating point registers
`292 may be provided. Similar to the scalar floating point
`20 registers 291, the packed floating point registers 292 may be
`implemented as either a stack referenced register file or a flat
`register file when packed floating point instructions are
`executed.
`FIG. 2C is a simplified block diagram illustrating exem-
`25 plary sets of logical registers according to another embodi(cid:173)
`ment of the invention. In this example, the memory un