`
`UNITED STATES PATENT AND TRADEMARK OFFICE
`
`__________________
`
`BEFORE THE PATENT TRIAL AND APPEAL BOARD
`
`__________________________________________________________________
`
`SONY CORPORATION; SONY MOBILE COMMUNICATIONS AB; SONY
`MOBILE COMMUNICATIONS (USA) INC; AND SONY ELECTRONICS INC.
`Petitioner
`
`
`Patent No. 7,296,121
`Issue Date: Nov. 13, 2007
`Title: REDUCING PROBE TRAFFIC IN MULTIPROCESSOR SYSTEMS
`__________________________________________________________________
`
`EXPERT DECLARATION OF PROFESSOR DANIEL J. SORIN
`
`No. IPR2015-158
`
`__________________________________________________________________
`
`
`
`
`
`
`
`Petition for Inter Partes Review of
`U.S. Pat. No. 7,296,121
`IPR2015‐00158
`EXHIBIT
`Sony‐
`
`
`
`No. IPR2015-158
`Expert Declaration of Professor Daniel J. Sorin
`
`I.
`
`INTRODUCTION
`
`1. I, Professor Daniel J. Sorin, have been retained by counsel for Sony
`
`Corporation, Sony Mobile Communications AB, Sony Mobile
`
`Communications (USA) Inc., and Sony Electronics Inc. (collectively,
`
`“Sony”).
`
`2. I submit this declaration in support of Sony’s Petition for Inter Partes Review
`
`of U.S. Pat. No. 7,296,121, No. IPR2015-158.
`
`II. QUALIFICATIONS
`
`3. I hold a Ph.D. in Electrical and Computer Engineering from the University
`
`of Wisconsin—Madison (awarded in 2002). My doctoral dissertation
`
`focused on checkpointing/recovery of multiprocessors with cache-coherent
`
`shared memory.
`
`4. I am an Associate Professor in the Department of Electrical and Computer
`
`Engineering at Duke University. Prior to being an Associate Professor, I
`
`was an Assistant Professor in the Department of Electrical and Computer
`
`Engineering at Duke University (2002-2009), a Research Assistant in the
`
`Computer Sciences Department at the University of Wisconsin—Madison
`
`(1996-2002), and a Teaching Assistant in the Department of Electrical and
`
`Computer Engineering at the University of Wisconsin—Madison (1996-
`
`1997).
`
`
`
`2
`
`
`
`No. IPR2015-158
`Expert Declaration of Professor Daniel J. Sorin
`
`
`5. I am the author or co-author of two books: “A Primer on Memory
`
`Consistency and Cache Coherence” Synthesis Lectures on Computer Architecture,
`
`Morgan & Claypool Publishers, 2011; and “Fault Tolerant Computer
`
`Architecture” Synthesis Lectures on Computer Architecture, Morgan & Claypool
`
`Publishers, 2009.
`
`6. I have published over 70 technical articles, including over 20 related to cache
`
`coherence technology.
`
`7. I have over 16 years of experience in the design and implementation of
`
`cache coherency systems and protocols in multi-processor computer
`
`systems.
`
`8. My curriculum vitae more fully describes my education, professional
`
`experience, and relevant publications. See Sony-1014.
`
`III. MATERIALS CONSIDERED
`
`9. I have reviewed U.S. Pat. No. 7,296,121 (the “’121 Patent”) including its
`
`claims.
`
`10. I have reviewed U.S. Patent No. 7,698,509 to Koster (“Koster”). I
`
`understand Koster is prior art to the ’121 Patent.
`
`11. I have reviewed Jeffrey Kuskin, et al., The Stanford FLASH Multiprocessor,
`
`PROCEEDINGS ON THE 21ST ANNUAL INTERNATIONAL SYMPOSIUM ON
`
`
`
`3
`
`
`
`No. IPR2015-158
`Expert Declaration of Professor Daniel J. Sorin
`
`
`COMPUTER ARCHITECTURE, IEEE (1994) (“Kuskin”). I understand Kuskin
`
`is prior art to the ’121 Patent.
`
`12. I have reviewed S. Park et al., Verification of Cache Coherence Protocols by
`
`Aggregation of Distributed Transactions, THEORY OF COMPUTING SYSTEMS 31
`
`(1998) (“Park”). I understand Park is prior art to the ’121 Patent.
`
`13. I have reviewed U.S. Patent No. 6,088,769 to Luick (“Luick”). I
`
`understand Luick is prior art to the ’121 Patent.
`
`14. I have reviewed U.S. Pat. Pub. No. 2002/0073261 (“Kosaraju”). I
`
`understand Kosaraju is prior art to the ’121 Patent.
`
`IV. PERSON OF ORDINARY SKILL IN THE ART
`
`15. Generally, the ’121 Patent is in the field cache coherency in multi-processor
`
`computer systems.
`
`16. I understand that the focus of this Inter Partes Review is the subject matter
`
`of claims 1-3, 8, 11-12, and 14-25 of the ’121 Patent. Generally, these
`
`claims describe a “probe filtering unit” (“PFU”) which increases the
`
`efficiency of memory transactions in a point-to-point multi-processor
`
`computer system having multiple cache memories. The claims further
`
`describe how the PFU receives probes from processing nodes
`
`corresponding to memory lines, and how the PFU transmits the probes
`
`only to selected ones of the processing nodes with references to probe
`
`
`
`4
`
`
`
`No. IPR2015-158
`Expert Declaration of Professor Daniel J. Sorin
`
`
`filtering information representative of states associated with selected ones of
`
`the cache memories.
`
`17. In the 2002–2004 timeframe, a person with ordinary skill in the art with
`
`respect to the technology disclosed by the ’121 Patent would have a PhD
`
`degree in Electrical Engineering, Computer Engineering, or Computer
`
`Science or a MS degree and two to three years of industry experience in the
`
`area of cache coherency in multi-processor computer systems.
`
`18. Based upon my education and experience, I consider myself to be a person
`
`of at least ordinary skill in the field of technology disclosed by the ’121
`
`Patent.
`
`V. KOSTER
`
`19. Koster describes a point-to-point multi-processor architecture with a
`
`“snoop filter” having a “shadow tag memory” that stores the tags of data
`
`stored in the local cache memories of several microprocessors. By having
`
`the shadow tag memory, the snoop filter forwards a received broadcast for
`
`requested data (by one microprocessor) to a specific other microprocessor
`
`only if the shadow tag memory indicates that the other microprocessor has
`
`a copy of the requested data.
`
`20. At the time Koster was filed (July 13, 2004), it would have been obvious to
`
`one of ordinary skill in the art to implement the “snoop filter” disclosed in
`
`
`
`5
`
`
`
`No. IPR2015-158
`Expert Declaration of Professor Daniel J. Sorin
`
`
`Koster on an integrated circuit, and more specifically, on an application-
`
`specific integrated circuit. This is so because such an implementation was
`
`the most common, most performant, and least burdensome of the known
`
`methods at the time.
`
`21. Furthermore, at the time Koster was filed (July 13, 2004), integrated circuits
`
`were necessarily created with a set of semiconductor processing masks.
`
`Accordingly, it thus would have been obvious to one of ordinary skill in the
`
`art at this time to use a set of semiconductor processing masks
`
`representative of at least a portion of the “snoop filter” disclosed in Koster
`
`to create an integrated circuit implementing the “snoop filter” of Koster.
`
`VI. KOSTER AND KUSKIN
`
`22. At the time Koster was filed (July 13, 2004), it would have been obvious to
`
`one of ordinary skill in the art to combine the teachings of Koster and
`
`Kuskin. Both Koster and Kuskin disclose solutions to solving cache
`
`coherency problems. Koster discloses a cache coherent computer system
`
`with a plurality of microprocessors that are integrated circuits. See Koster at
`
`2:26-3:5. Kuskin, similarly, discloses a cache coherency architecture, but
`
`with a “programmable protocol processor for flexibility.” Kuskin at 303.
`
`Furthermore, prior to July 13, 2004, it was known to those of ordinary skill
`
`in the art that implementing cache coherence using a programmable
`
`
`
`6
`
`
`
`No. IPR2015-158
`Expert Declaration of Professor Daniel J. Sorin
`
`
`microprocessor afforded more flexibility than, for example, an ASIC.
`
`Accordingly, it would have been obvious to combine Koster with Kuskin in
`
`order to obtain the flexibility taught by Kuskin while retaining the increased
`
`performance taught by Koster.
`
`23. As used in claim 21 of the ’121 Patent, the term “netlist” would be
`
`understood by one of ordinary skill in the art to mean “a computer
`
`representation of a collection of logic units and how they are to be
`
`connected.”
`
`24. Verilog is a hardware description language that was well known prior to
`
`2000.
`
`25. Kuskin teaches building and simulating a cache coherency controller using a
`
`hardware description language for a tangible chip. See Kuskin at 302
`
`(describing “our Verilog code”), at 311 (“We currently have a detailed
`
`system-level simulator up and running. . . . On the hardware design front
`
`we are busily coding the Verilog description of the MAGIC chip.”). Use of
`
`a Verilog description to create a tangible chip necessarily requires the
`
`creation of a simulatable representation comprising a netlist
`
`26. Prior to July 13, 2004, it would have been obvious to a person of ordinary
`
`skill in the art to implement the “snoop filter” of Koster using a hardware
`
`
`
`7
`
`
`
`No. IPR2015-158
`Expert Declaration of Professor Daniel J. Sorin
`
`
`description language (as shown in Kuskin) because that was the only
`
`commonly used method in the industry for designing hardware.
`
`VII. KOSTER, KUSKIN, AND PARK
`
`27. At the time Koster was filed (July 13, 2004), it would have been obvious to
`
`combine the teachings of Kuskin with Park. Kuskin discloses the Stanford
`
`FLASH multiprocessor, which implements its own cache-coherence
`
`protocol. Kuskin at 304 (“This section presents a base cache-coherence
`
`protocol and a base block-transfer protocol we have designed for
`
`FLASH.”). Park includes a more detailed description of this same Stanford
`
`FLASH cache coherency protocol. Park at § 4. Therefore, it would have
`
`been obvious to one of ordinary skill in the art to combine the teachings of
`
`Kuskin and Park because anyone studying the FLASH cache coherency
`
`protocol disclosed in Kuskin would look to Park for a more detailed
`
`description of the same system. As previously described, it would have
`
`been obvious to combine the teachings of Koster and Kuskin. Accordingly,
`
`at the time Koster was filed (July 13, 2004), it would have been obvious to
`
`combine the teachings of Koster, Kuskin, and Park.
`
`28. Koster discloses a solution to solving cache coherency problems. Cache
`
`coherency problems do not exist in a “read-only” system where the data
`
`cannot be changed, i.e., it is not possible for the cached data to become
`
`
`
`8
`
`
`
`No. IPR2015-158
`Expert Declaration of Professor Daniel J. Sorin
`
`
`incoherent. Accordingly, although Koster primarily concerns the scenario
`
`of a node requesting “read” access, anyone of ordinary skill in the art would
`
`understand the system disclosed in Koster to allow a node to request
`
`“read/write” access as well.
`
`29. In the Stanford FLASH cache coherency protocol, when a node issues a
`
`request for “read/write” access to a memory line, the protocol may use a
`
`“delayed” mode where the home node (i.e., the node with the coherence
`
`directory) accumulates invalidation acknowledgements from other nodes
`
`(having shared copies of the memory line) before sending an “exclusive
`
`copy” of the memory line to the requesting node. Park at 362. Thus, in this
`
`“delayed mode,” the home node accumulates a plurality of responses, not
`
`just one, before sending a response back to the requesting node.
`
`VIII. LUICK AND KOSARAJU
`
`30. Prior to November 4, 2001 many architectures were well known to those
`
`with ordinary skill in the art for interconnecting multiple processing nodes
`
`in a computer system. Two of these architectures are shown in Figures 1
`
`and 5 of Kosaraju:
`
`
`
`9
`
`
`
`No. IPR2015-158
`Expert Declaration of Professor Daniel J. Sorin
`
`
`Kosaraju, Fig. 1
`
`
`
`Kosaraju, Fig. 5
`
`
`
`31. A “Bus” architecture is where all communications of the processing nodes
`
`flow over a common bus. The Bus architecture does not use point-to-point
`
`links between any pair of processing nodes. The advantages to using a Bus
`
`architecture are its ability to provide totally ordered broadcasts/multicasts,
`
`low latency, and a simple design. However, the disadvantage to using a Bus
`
`
`
`10
`
`
`
`No. IPR2015-158
`Expert Declaration of Professor Daniel J. Sorin
`
`
`architecture is the lack of scalability to a large number of processing nodes.
`
`A designer typically uses a bus architecture when the computer system
`
`requires a broadcast that must reach all of the processing nodes in the
`
`system, e.g., in a computer system using a snooping coherence protocol.
`
`32. A “Fully Connected” architecture (which Kosaraju shows in Figure 5) is
`
`where each processing node is adapted to be connected and communicate
`
`directly with each other processing node. The Fully Connected architecture
`
`uses point-to-point links between a processing node and every other
`
`processing node. The advantages to using a Fully Connected architecture
`
`are its low-latency paths between processing nodes and its ability to carry
`
`multiple messages concurrently. However, the disadvantages to using a
`
`Fully Connected architecture are lack of scalability to a large number of
`
`processing nodes and implementation cost.
`
`33. At the time Kosarju was filed (December 13, 2000), it would have been
`
`obvious to interchange the bus architecture shown in Figure 1 of Luick with
`
`the fully connected point-to-point architecture shown in Figure 5 of
`
`Kosaraju. Figure 1 of Luick discloses a multi-processor computer system
`
`having three processing nodes and a “global coherence unit”:
`
`
`
`11
`
`
`
`No. IPR2015-158
`Expert Declaration of Professor Daniel J. Sorin
`
`
`
`34. The three processing nodes (Node 1, Node 2, and Node 3) in Figure 1 of
`
`Luick are interconnected using a Bus architecture. However, at the time
`
`Kosarju was filed (December 13, 2000), it would have been obvious to one
`
`of ordinary skill in the art to alternatively employ a Fully Connected
`
`architecture to interconnect the three processing nodes (and the GCU).
`
`This is so because as described above, the Fully Connected architecture has
`
`advantages over the Bus architecture. In a system having only three
`
`processing nodes, use of the Fully Connected architecture would be the
`
`most efficient because it offers low latency for small-scale systems with
`
`relatively few processing nodes. Indeed, use of a Bus architecture in Luick
`
`results in needless broadcasting of messages that need only be sent from
`
`one node to another.
`
`12
`
`
`
`
`
`No. IPR2015-158
`Expert Declaration of Professor Daniel J. Sorin
`
`
`35. The ’121 Patent defines the term “probe” as “[a] mechanism for eliciting a
`
`response from a node to maintain cache coherency in a system.” ’121
`
`patent at 5:45-48.
`
`36. Luick discloses a “request” for data that is received by the GCU. Luick at
`
`7:16. The GCU then determines which node has the most current copy of
`
`the data to be read and determines which node is to respond to the request.
`
`Luick at 7:17-21. The GCU then sends the request to the node which the
`
`GCU has determined should respond. Luick at 7:23-25. This functionality
`
`of the system disclosed in Luick maintains cache coherency because it
`
`prevents a node with a stale copy of the requested data from responding to
`
`the request. Accordingly, the “request” disclosed in Luick (either received
`
`by the GCU or sent from the GCU) is a “mechanism for eliciting a
`
`response from a node to maintain cache coherency in a system.” Therefore,
`
`the “request” disclosed in Luick meets the definition of “probe” as used in
`
`the ’121 Patent.
`
`37. At the time Luick was issued (July 11, 2000), it would have been obvious to
`
`one of ordinary skill in the art to implement the GCU disclosed in Luick on
`
`an integrated circuit, and more specifically, on an application-specific
`
`integrated circuit. This is so because such an implementation was the most
`
`
`
`13
`
`
`
`No. IPR2015-158
`Expert Declaration of Professor Daniel J. Sorin
`
`
`common, most performant, and least burdensome of the known methods at
`
`the time.
`
`38. Furthermore, at the time Luick was issued (July 11, 2000), integrated
`
`circuits were necessarily created with a set of semiconductor processing
`
`masks. Accordingly, it thus would have been obvious to one of ordinary
`
`skill in the art at this time to use a set of semiconductor processing masks
`
`representative of at least a portion of the GCU disclosed in Luick to create
`
`an integrated circuit implementing the GCU of Luick.
`
`IX. LUICK, KOSARAJU, AND KUSKIN
`39. At the time Luick was issued (July 11, 2000), it would have been obvious to
`
`one of ordinary skill in the art to combine the teachings of Luick and
`
`Kuskin. Both Luick and Kuskin disclose solutions to solving cache
`
`coherency problems. Luick discloses a cache coherent computer system
`
`with a plurality of microprocessors. Luick at Fig. 1, 3:65-4:4, 5:14-35.
`
`Kuskin, similarly, discloses a cache coherency architecture, but with a
`
`“programmable protocol processor for flexibility.” Kuskin at 303.
`
`Furthermore, prior to July 11, 2000, it was known to those of ordinary skill
`
`in the art that implementing cache coherence using a programmable
`
`microprocessor afforded more flexibility than, for example, an ASIC.
`
`Accordingly, it would have been obvious to combine Luick with Kuskin in
`
`
`
`14
`
`
`
`No. IPR2015-158
`Expert Declaration of Professor Daniel J. Sorin
`
`
`order to obtain the flexibility taught by Kuskin while retaining the increased
`
`performance taught by Luick. As previously described, it would have been
`
`obvious to combine the teachings of Luick and Kosaraju. Accordingly, at
`
`the time Luick was issued (July 11, 2000), it would have been obvious to
`
`combine the teachings of Luick, Kosaraju, and Kuskin.
`
`40. Prior to July 11, 2000, it would have been obvious to a person of ordinary
`
`skill in the art to implement the GCU of Luick in a hardware description
`
`language (as shown in Kuskin) because that was the only commonly used
`
`method in the industry for programming hardware.
`
`X. LUICK, KOSARAJU, KUSKIN, AND PARK
`41. At the time Luick was issued (July 11, 2000), it would have been obvious to
`
`combine the teachings of Kuskin with Park. Kuskin discloses the Stanford
`
`FLASH multiprocessor, which implements its own cache-coherence
`
`protocol. Kuskin at 304 (“This section presents a base cache-coherence
`
`protocol and a base block-transfer protocol we have designed for
`
`FLASH.”). Park includes a more detailed description of this same Stanford
`
`FLASH cache coherency protocol. Park at § 4. Therefore, it would have
`
`been obvious to one of ordinary skill in the art to combine the teachings of
`
`Kuskin and Park because anyone studying the FLASH cache coherency
`
`protocol disclosed in Kuskin would look to Park for a more detailed
`
`
`
`15
`
`
`
`No. IPR2015-158
`Expert Declaration of Professor Daniel J. Sorin
`
`description of the same system. As previously described, it would have
`
`been obvious to combine the teachings of Luick, Kosaraju, and Kuskin.
`
`Accordingly, at the time Luick was issued Guly 11, 2000), it would have
`
`been obvious to combine the teachings of Luick, Kosaraju, Kuskin, and
`
`Park.
`
`42. Luick discloses a solution to solving cache coherency problems. Cache
`
`coherency problems do not exist in a "read-only" system where the data
`
`cannot be changed, i.e., it is not possible for the cached data to become
`
`incoherent. Indeed, Luick discloses embodiments concerning the writing of
`
`data. Luick at 7:65-9:51. Accordingly, anyone of ordinary skill in the art
`
`would understand the system disclosed in Luick to allow a node to request
`
`"read/write" access as well as "read" access.
`
`I declare under penalty of perjury that the foregoing is true and correct.
`
`Dated November __ , 2014
`
`Daniel]. Sorin
`
`16
`
`