`Perego
`
`(10) Patent N0.:
`(45) Date of Patent:
`
`US 6,864,896 B2
`Mar. 8, 2005
`
`US006864896B2
`
`345/542
`345/542
`.. 341/51
`......... .. 375/240.12
`
`
`
`2/1999 Katayamaetal.
`8/ZOOIJ Nielsen el al.
`3/2001 Dye et al.
`..
`2/2004 Yttvits et ul.
`
`5,867,180 A +
`f),1[l4,417 A *
`6,208,273 B1 *
`6,690,726 B1 *‘
`
`* Cited by examiner
`
`C R 11
`_M h
`-
`F
`P -
`C 3
`”'."””-V "m'"”.'”
`“I. W
`Asslsmnt Exmmn,er—Dal1p K. Singh
`
`(74) Attorney, Agent, or Firm—Lee & Hayes, PLLC
`(57)
`ABSTRACT
`
`A memory architecture includes a memory controller
`coupled to multiple modules. Each module includes a coni-
`puting engine coupled to a shared memory. Each computing
`engine is capable of receiving instructions from the memory
`controller and processing the received instructions. The
`shared memory is configured to store main memory data and
`graphical data. Certain computing engines are capable of
`processing graphical data. The memorv controller mav
`include a graphics controller that providesiinstructions to the
`computing engine. An interconnect on each module allows
`multiple modules to be coupled to the memory controller.
`
`(54) SCALABLE UNIFIED MEMORY
`ARCHITECTURE
`
`(75)
`
`Inventor: Richard E. Perego, San Jose, CA (US)
`
`(73) Assignee: Rambus Inc., Los Altos, CA (US)
`(_ * ) Notice:
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.s.C. 154(b) by 396 days.
`'
`
`(21) Appl. No‘: 09/858,836
`('22)
`Filed:
`May 15, 2001
`
`(65)
`
`Prior Publication Data
`US 2[l02/0171652 A1 Nov. 21, 2002
`
`............................... .. GUGF 15/167
`Int. Cl.7 ..... ..
`(51)
`(52) U.S.Cl.
`..... ..
`345/542; 345/543; 345/544
`.. 345/421, 542,
`('58) Field of Search .
`s
`
`5 505; 531, 532;
`345/'51 5
`~ ;
`520’ 535’ 543’ 5449 710/269 712/201? 711/209’
`120’ 170’ 111’ 105’ 1035 341/51; 375”240'12
`R f
`C't d
`e erences
`I e
`U.S. PATENT DOCUMENTS
`
`'56
`(‘
`
`)
`
`
`
`5,325,493 A *
`
`6/1994 l-lerrell et al.
`
`............ .. 712/201
`
`34 Clflillls; 7 Dr-“Wing Sheets
`
`300 ~\
`
`\
`
`304 “\
`
`
`RENDERING
`SHARED MAIN
`AND GRAPHICS
`ENGINE
`MEMORY g
`
`
`
`:
`
`316 \
`‘
`
`MEMORY
`CoNTRoLLER/
`GRAPHICS
`CONTROLLER
`
`I
`_ _ _ _ A .1
`
`VIDEO INPUT
`DISPLAY
`INTERFACE
`
`CPU/MEMORY
`CONTROLLER SUBSYSTEM
`\‘_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
`
`I ,
`
`302 ._\
`
`/,
`
`305
`
`CPU
`
`:
`.
`:
`I
`,
`310
`I
`I _ _ , _ _ W _ _ _ _ V _ _ _ _ _ _ _ _ _ _ _
`
`306
`
`CoNTRoLLER
`
`0001
`
`Volkswagen 1006
`
`0001
`
`Volkswagen 1006
`
`
`
`U.S. Patent
`
`Mar. 8,2005
`
`Sheet 1 of 7
`
`US 6,864,896 B2
`
`GRAPHICS
`MEMORY
`
`MEMORY
`CONTROLLER
`
`GRAPHICS
`CONTROLLER
`
`DISPLAY
`INTERFACE
`
`I/O
`CONTROLLER
`
`76¢. 1
`(Prior Art)
`
`0002
`
`
`
`U.S. Patent
`
`Mar. 8,2005
`
`Sheet 2 of 7
`
`US 6,864,896 B2
`
`CPU/M EMORY
`CONTROLLER SUBSYSTEM
`
`202
`
`VIDEO INPUT
`
`DBPLAY
`INTERFACE
`
`SHARED MAIN
`MEMORY AND
`GRAPHICS MEMORY
`
`MEMORY
`CONTROLLE R/
`GRAPHICS
`CONTROLLER
`
`CONTROLLER
`
`76¢. 2‘
`(Prior Art)
`
`0003
`
`
`
`U.S. Patent
`
`Mar. 8,2005
`
`Sheet 3 of 7
`
`US 6,864,896 B2
`
`CPU/MEMORY
`
`302 . CONTROLLER SUBSYSTEM
`
`RENDERING
`ENGINE
`312
`
`SHARED MAIN
`AND GRAPHICS
`MEMORY 314
`
`VIDEO INPUT
`
`D'SPLAY
`INTERFACE
`
`MEMORY
`CONTROLLE R/
`GRAPHICS
`CONTROLLER
`
`CONTROLLER
`
`0004
`
`
`
`U.S. Patent
`
`Mar. 8, 2005
`
`Sheet 4 of 7
`
`US 6,864,896 B2
`
`THE MEMORY CONTROLLER/GRAPHICS CONTROLLER
`RECEIVES A PROCESSING TASK
`
`THE MEMORY CONTROLLER/GRAPHICS CONTROLLER
`PARTITIONS THE PROCESSING TASK INTO
`MULTIPLE PORTIONS
`
`THE MEMORY CONTROLLER/GRAPHICS CONTROLLER
`DISTRIBUTES EACH PORTION OF THE PROCESSING
`TASK TO A RENDERING ENGINE
`
`ALL RENDERING ENGINES FINISHED?
`
`THE MEMORY CONTROLLER/GRAPHICS CONTROLLER
`REASSEMBLES THE FINISHED PORTIONS OF THE TASK
`
`0005
`
`
`
`U.S. Patent
`
`r..aM
`
`50028:
`
`Sheet 5 of 7
`
`US 6,864,896 B2
`
`HORIZONTAL
`
`V
`
`VERTICAL
`
`0006
`
`
`
`U.S. Patent
`
`Mar. 8,2005
`
`Sheet 6 of 7
`
`US 6,864,896 B2
`
`MEMORY MODULE
`
`MEMORY MODULE
`
`RENDERING
`ENGINE
`
`MEMORY DEVICES 704
`
`MEMORY DEVICES 704
`
`706
`
`|NTERCONNECT
`
`INTERCONNECT
`
`708
`
`0007
`
`
`
`U.S. Patent
`
`Mar. 8,2005
`
`Sheet 7 of 7
`
`US 6,864,896 B2
`
`MEMORY MODULE
`
`RENDERING
`
`RENDEWNG
`ENGWE
`
`MEMORY DEVICES 804
`
`806 x
`
`LE1
`MEMORY DEVICES 804
`
`MEMORY DEVICES 812
`
`808
`
`INTERCONNECT
`
`0008
`
`
`
`US 6,864,896 B2
`
`1
`SCALABLE UNIFIED MEMORY
`ARCHITECTURE
`
`TECHNICAL FIELD
`
`The present invention relates to memory systems and, in
`particular, to scalable unified memory architecture systems
`that support parallel processing of data.
`BACKGROUND
`
`Various systems are available for storing data in memory
`devices and retrieving data from those memory devices.
`FIG. 1 illustrates a computer architecture 100 in which a
`discrete graphics controller supports a local graphics
`memory. Computer architecture 100 includes a central pro-
`cessing unit (CPU) 102 coupled to a memory controller 104.
`Memory controller 104 is coupled to a main memory 108, an
`I/O controller 106, and a graphics controller 110. The main
`memory is used to store program instructions which are
`executed by the CPU and data which are referenced during
`the execution of these programs. The graphics controller 110
`is coupled to a discrete graphics memory 112. Graphics
`controller 110 receives video data through a video input 114
`and transmits video data to other devices through a display
`interface 116.
`
`The architecture of FIG. 1 includes two separate memo-
`ries (main memory 108 and graphics memory 112), each
`controlled by a different controller (memory controller 104
`and graphics controller 110, respectively). Typically, graph-
`ics memory 112 includes faster and more expensive memory
`devices, while main memory 108 has a larger storage
`capacity, but uses slower, less expensive memory devices.
`Improvements in integrated circuit design and manufac-
`turing technologies allow higher levels of integration,
`thereby allowing an increasing number of subsystems to be
`integrated into a single device.
`'lhis increased integration
`reduces the total number of components in a system, such as
`a computer system. As subsystems with high memory per-
`formance requirements (such as graphics subsystems) are
`combined with the traditional main memory controller, the
`resulting architecture may provide a single high-
`performance main memory interface.
`Another type of computer memory architecture is referred
`to as a unified memory architecture (UMA). In a UMA
`system, the graphics memory is statically or dynamically
`partitioned off from the main memory pool, thereby saving
`the cost associated with dedicated graphics memory. UMA
`systems often employ less total memory capacity than
`systems using discrete graphics memory to achieve similar
`levels of graphics performance. UMA systems typically
`realize additional cost savings due to the higher levels of
`integration between the memory controller and the graphics
`controller.
`
`FIG. 2 illustrates another prior art memory system 200
`that uses a unified memory architecture. A CPU/Memory
`Controller subsystem 202 includes a CPU 208 and a
`memory controller and a graphics controller combined into
`a single device 210. The subsystem 202 represents an
`increased level of integration as compared to the architecture
`of FIG. 1. Subsystem 202 is coupled to a shared memory
`204, which is used as both the main memory and the
`graphics memory. Subsystem 202 is also coupled to an I/O
`controller 206, a video input 212, and a display interface
`214.
`
`2
`portion of shared memory 204 and for data stored in the
`graphics memory portion of the shared memory. The shared
`memory 204 may be partitioned statically or dynamically. A
`static partition allocates a fixed portion of the shared
`memory 204 as “main memory” and the remaining portion
`is the “graphics memory.” A dynamic partition allows the
`allocation of shared memory 204 between main memory and
`graphics memory to change depending on the needs of the
`system. For example, if the graphics memory portion is full,
`and the graphics controller needs additional memory, the
`graphics memory portion may be expanded if a portion of
`the shared memory 204 is not currently in use or if the main
`memory allocation can be reduced.
`Regardless of the system architecture, graphics rendering
`performance is often constrained by the memory bandwidth
`available to the graphics subsystem. In the system of FIG. 1,
`graphics controller 110 interfaces to a dedicated graphics
`memory 112. Cost constraints for the graphics subsystem
`generally dictate that a limited capacity of dedicated graph-
`ics memory 112 must be used. This limited amount of
`memory, in turn, dictates a maximum number of memory
`devices that can be supported. In such a memory system, the
`maximum graphics memory bandwidth is the product of the
`number of memory devices and the bandwidth of each
`memory device. Device-level cost constraints and technol-
`ogy limitations typically set the maximum memory device
`bandwidth. Consequently, graphics memory bandwidth, and
`therefore graphics performance, are generally bound by the
`small number of devices that can reasonably be supported in
`this type of system configuration.
`Unified memory architectures such as that shown in FIG.
`2, help alleviate cost constraints as described above, and
`generally provide lower cost relative to systems such as that
`shown in FIG. 1. Ilowever, memory bandwidth for the
`system of FIG. 2 is generally bound by cost constraints on
`the memory controller/graphics controller 210. Peak
`memory bandwidth for this system is the product of the
`number of conductors on the memory data interface and the
`communication bandwidth per conductor. The communica-
`tion bandwidth per conductor is often limited by the choice
`of memory technology and the topology of the main
`memory interconnect. The number of conductors that can be
`used is generally bound by cost constraints on the memory
`controller/graphics controller package or system board
`, design. However, the system of FIG. 2 allows theoretical
`aggregate bandwidth to and from the memory devices to
`scale linearly with system memory capacity, which is typi-
`cally much larger than the capacity of dedicated graphics
`memory. The problem is that
`this aggregate bandwidth
`cannot be exploited due to the limiting factors described
`above relating to bandwidth limitations at
`the memory
`controller/graphics controller.
`A system architecture which could offer the cost savings
`advantages of a unified memory architecture, while provid-
`ing scalability options to higher levels of aggregate memory
`bandwidth (and therefore graphics performance) relative to
`systems using dedicated graphics memory would be advan-
`tageous.
`
`SUMMARY
`
`The systems and methods described herein achieve these
`goals by supporting the capability of locating certain pro-
`cessing fiinctions on the memory modules, while providing
`the capability of partitioning tasks among multiple parallel
`functional units or modules.
`
`The memory controller/graphics controller 210 controls
`all memory access, both for data stored in the main memory
`
`In one embodiment, an apparatus includes a memory
`controller coupled to one or more modules. Each installed
`
`0009
`
`
`
`US 6,864,896 B2
`
`3
`module includes a computing engine coupled to a shared
`memory. The computing engine is configured to receive
`information from the memory controller. The shared
`memory includes storage for i11str11ctions or data, which
`includes a portion of the main memory for a central pro-
`cessing 1Ii1it (CPU).
`In another embodiment, the computing engine is a graph-
`ics rendering engine.
`In a particular implementation of the system, the shared
`memory is configured to store main memory data and I
`graphical data.
`Another embodiment provides that the computing engine
`is coupled between the memory controller and the shared
`memory.
`In a particular embodiment,
`includes a graphics controller.
`In a described implementation, one or more modules are
`coupled to a memory controller. Each installed module
`includes a computing engine and a shared memory coupled
`to the computing engine.
`
`the memory controller
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`FIG. 1 illustrates a prior art computer architecture in
`which a discrete graphics controller supports a local graph-
`ics memory.
`FIG. 2 illustrates another prior art memory system that
`uses a unified memory architecture.
`FIG. 3 illustrates an embodiment of a scalable unified
`memory architecture that supports parallel processing of
`graphical data.
`FIG. 4 is a flow diagram illustrating a procedure for
`partitioning a processing task into multiple portions and
`distributing the multiple portions to different rendering
`engines.
`FIG. 5 illustrates a graphical rendering surface divided
`into sixteen different sections.
`
`FIGS. 6 and 7 illustrate different memory modules con-
`taining a rendering engine and multiple memory devices.
`FIG. 8 illustrates a memory module containing two dif-
`ferent rendering engines and multiple memory devices asso-
`ciated with reach rendering engine.
`DETAILED DESCRIPTION
`
`The architecture described herein provides one or more
`discrete memory modules coupled to a con1i11on memory
`controller. Each memory module includes a computing
`engine and a shared memory. Thus, the data processing tasks
`can be distributed among the various computing engines and
`the memory controller to allow parallel processing of dif-
`ferent portions of the various processing tasks.
`Various examples are provided herein for purposes of
`explanation. These examples include the processing of
`graphical data by one or more rendering engines on one or
`more memory modules.
`It will be understood that
`the
`systems described herein are not
`limited to processing
`graphical data. Instead, the systems described herein may be
`applied to process any type of data from any data source.
`FIG. 3 illustrates an embodiment of a scalable unified
`memory architecture 300 that supports parallel processing of
`graphical data and/or graphical instructions. ACPU/memory
`controller subsystem 302 includes a CPU 308 coupled to a
`memory controller/graphics controller 310. One or more
`memory modules 304 are coupled to memory controller,’
`graphics controller 310 in subsystem 302. Each memory
`
`4
`module 304 includes a rendering engine 312 and a shared
`memory 314 (i.e., main memory and graphics memory). The
`main memory typically contains instructions and/or data
`used by the CPU. The graphics memory typically contains
`instructions and/or data used to process, render, or otherwise
`handle graphical images or graphical information.
`Rendering engine 312 may also be referred to as a
`“compute engine” or a “computing engine.” The shared
`memory 314 typically includes multiple memory devices
`coupled together to form a block of storage space. Each
`rendering engine 312 is capable of performing various
`memory access and/or data processing functions. For the
`embodiment shown in FIG. 3, memory controller/graphics
`controller 310 is also coupled to an I/O controller 306 which
`controls the flow of data into and out of the system. An
`optional video input port 316 provides data to memory
`controller/graphics controller 310 and a display interface
`318 provides data output to one or more devices (such as
`display devices or storage devices). For systems which
`support video input or capture capability, a video input port
`on the memory controller/graphics controller 310 is one way
`to handle the delivery of video source data. Another means
`of delivery of video input data to the system would include
`delivering the data from a peripheral module through the I/
`controller 306 to device 310.
`
`In the example of FIG. 3, CPU/memory controller sub-
`system 302 is coupled to four distinct memory modules 304.
`Each memory module includes a rendering engine and a
`shared memory. Each rendering engine 312 is capable of
`performing various data processing functions. Thus, the four
`rendering engines are capable of performing four different
`processing functions simultaneously (i.e., parallel
`processing). Further, each rendering engine 312 is capable of
`communicating with other rendering engines on other
`memory modules 304.
`The memory controller/graphics controller 310 distributes
`particular processing tasks (such as graphical processing
`tasks)
`to the different rendering engines, and performs
`certain processing tasks itself. These tasks may include data
`to be processed and/or instructions to be processed.
`Although four memory modules 304 are shown in FIG. 3,
`alternate system may contain any number of memory mod-
`ules coupled to a common memory controller/graphics con-
`troller 310. This ability to add and remove memory modules
`' 304 provides an upgradeable and scalable memory and
`computing architecture.
`The architecture of FIG. 3 allows the memory controller/’
`graphics controller 310 to issue high level primitive com-
`mands to the various rendering engines 312, thereby reduc-
`ing the volume or bandwidth of data that must be
`communicated between the controller 310 and the memory
`modules 304. Thus,
`the partitioning of memory among
`multiple memory modules 304 improves graphical data
`throughput relative to systems in which a single graphics
`controller performs all processing tasks and reduces band-
`width contention with the CPU. This bandwidth reduction
`occurs because the primitive commands typically contain
`significantly less data than the amount of data referenced
`when rendering the primitive. Additionally, the system par-
`titioning described allows aggregate bandwidth between the
`rendering engines and the memory devices to be much
`higher than the bandwidth between the controller and
`memory modules. Thus, elfective system bandwidth is
`increased for processing graphics tasks.
`FIG. 4 is a flow diagram illustrating a procedure for
`partitioning a processing task into multiple portions and
`
`0010
`
`
`
`US 6,864,896 B2
`
`5
`distributing the multiple portions to different rendering
`engines. Initially, the memory controller/graphics controller
`receives a processing task (block 402). The memory
`controller/graphics controller partitions the processing task
`into multiple portions (block 404) and distributes each
`portion of the processing task to a rendering engine on a
`memory module (block 406).
`When all rendering engines have finished processing their
`portion of the processing task,
`the memory controller/
`graphics controller reassembles the finished portions of the ,
`task (block 410).
`In certain situations,
`the memory
`controller/graphics controller may perform certain portions
`of the processing task itself rather than distributing them to
`a rendering engine. In other situations, the processing task
`may be partitioned such that one or more rendering engines
`perform multiple portions of the processing task. In another
`embodiment, the processing task may be partitioned such
`that one or more rendering engines are idle (i.e., they do not
`perform any portion of the processing task).
`For typical graphics applications, primitive data (e.g.,
`triangles or polygons) are sorted by the graphics controller
`or CPU according to the spatial region of the rendering
`surface (e.g.,
`the x and y coordinates) covered by that
`primitive. The rendering surface is generally divided into
`multiple rectangular regions of pixels (or picture elements),
`referred to as “tiles” or “chunks.” Since the tiles generally do
`not overlap spatially,
`the rendering of the tiles can be
`partitioned among multiple rendering engines.
`An example of the procedure described with respect to
`FIG. 4 will be described with reference to FIG. 5, which
`illustrates a graphical rendering surface divided into sixteen
`different sections or “tiles” (four rows and four columns).
`The graphical
`rendering surface may be stored in,
`for
`example, an image buffer or displayed on a display device.
`If a particular image processing task is to be performed on
`the graphical rendering surface shown in FIG. 5,
`the
`memory controller/graphics controller may divide the pro-
`cessing tasks into different portions based on the set of tiles
`intersected by the primitive elements of the task. For
`example, four of the sixteen tiles of the surface are assigned
`to a first rendering engine (labeled “REO”) on a first memory
`module. Another four tiles of the surface are assigned to a
`different rendering engine (labeled “RE1), and so on. This
`arrangement allows the four different rendering engines
`(REO, RE1, RE2, and RE3) to process different regions of
`the surface simultaneously. This parallel processing signifi-
`cantly reduces the processing burden on the memory
`controller/graphics controller. In this example, a significant
`portion of the set of graphics rendering tasks is performed by
`the four rendering engines, which allows the memory
`controller,/graphics controller to perform other tasks while
`these rendering tasks are being processed.
`In an alternate embodiment, each of the rendering engines
`is assigned to process a particular set of rows or columns of
`the rendering surface, where these rows or columns repre-
`sent a band of any size of pixels (at least one pixel wide). As
`shown in FIG. 5, the rendering surface is divided into two or
`more horizontal pixel bands (also referred to as pixel rows)
`and divided into two or more vertical pixel bands (also
`referred to as pixel columns). The specific example of FIG.
`5 shows a rendering surface divided into four horizontal
`pixel bands and four vertical pixel bands. However,
`in
`alternate embodiments the rendering surface may be divided
`into any number of horizontal pixel bands and any number
`of vertical pixel bands. For example, the rendering surface
`may be divided into four sections or “quadrants” (i.e., two
`columns and two rows), in which each of the rendering
`engines is assigned to a particular quadrant of the rendering
`surface.
`
`6
`The example discussed above with respect to FIG. 5 is
`related to graphics or image processing. However, the sys-
`tem described herein may be applied to any type of pro-
`cessing in which one or more processing functions ca11 be
`divided into multiple tasks that can be performed in parallel
`by multiple computing engines. Other examples include
`floating point numerical processing, digital signal
`processing, block transfers of data stored in memory, coni-
`pression of data, decompression of data, and cache control.
`For example, a task such as decompression of a compressed
`MPEG (Moving Picture Experts Group) video stream can
`begin by partitioning the video stream into multiple regions
`or “blocks.” Each block is decompressed individually and
`the results of all block decompressions are combined to form
`a complete video image. In this example, each block is
`associated with a particular rendering engine.
`FIGS. 6 and 7 illustrate different memory modules con-
`taining a rendering engine and multiple memory devices.
`The multiple memory devices represent the shared memory
`314 shown in FIG. 3. Although the memory modules shown
`in FIGS. 6 and 7 each contain eight memory devices,
`alternate memory modules may contain any number of
`memory devices coupled to a rendering engine.
`Referring to FIG. 6, a memory module 600 includes eight
`memory devices 604 coupled to a rendering engine 602. The
`rendering engine 602 is also coupled to a module intercon-
`nect 606 which couples the memory module 600 to an
`associated interconnect on another module, motherboard,
`device, or other system. Amemory interconnect 608 couples
`the memory devices 604 to the rendering engine 602. The
`memory interconnect 608 may allow parallel access to
`multiple memory devices 604, and may use either a multi-
`drop topology, point-to-point topology, or any combination
`of these topologies.
`Referring to FIG. 7, a memory module 700 includes
`multiple memory devices 704 coupled to a rendering engine
`702. The rendering engine 702 is also coupled to a pair of
`module interconnects 706 and 708, which couple tie
`memory module 700 to another module, motherboard,
`device, or system. A memory interconnect 710 couples tie
`memory devices 704 to the rendering engine 702 and may
`allow parallel access to multiple memory devices 704. In an
`alternate embodiment, module interconnects 706 and 708
`are connected to form a contiguous interconnect through tie
`' module 700 connecting the rendering engine 702 to other
`devices in the system.
`As discussed above, each memory module has its own
`memory interconnect for communicating data between tie
`memory module’s rendering engine and the various memory
`devices. The memory interconnect on each memory module
`is relatively short and provides significant bandwidth for
`communicating with the multiple memory devices on tie
`module. Further, the simultaneous processing and comm —
`nication of data on each memory module reduces the volume
`or bandwidth of data communicated between the memory
`modules and the memory controller/graphics controller. Tie
`memory interconnects discussed herein may be a bus, a
`series of point-to-point links, or any combination of these
`types of interconnects.
`FIG. 8 illustrates a memory module 800 containing two
`different
`rendering engines 802 and 810, and multiple
`memory devices 804 and 812 associated with reach render-
`ing engine.
`In this embodiment,
`two different memory
`interconnects 806 and 814 couple the memory devices to the
`rendering engines. The two rendering engines 802 and 810
`are coupled to a common module interconnect 808. Thus,
`
`0011
`
`
`
`US 6,864,896 B2
`
`7
`rendering engines 802 and 810 can process data and/or tasks
`simultaneously.
`In another embodiment, each rendering
`engine 802 and 810 is coupled to an independent module
`interconnect (11ot shown).
`Aparticular embodiment of the memory modules shown
`in FIGS. 6, 7, and 8 use multiple “Direct Rambus Channels”
`developed by Rambus Inc. of Mountain View, Calif. The
`Direct Rambus Channels connect
`the multiple memory
`devices to the associated rendering engine.
`In this
`embodiment, the memory devices are Rambus DRAMs (or
`RDRAMs®). However, other memory types and topologies
`may also be used within the scope of this invention. For
`example, memory types such as Synchronous DRAMs
`(SDRAMS), Double-Data-Rate (DDR) SDRAMs, Fast-
`Cycle (FC) SDRAMS, or derivatives of these memory types
`may also be used. These devices may be arranged in a single
`rank of devices or multiple ranks, where a rank is defined as
`an group of devices which respond to a particular class of
`commands including reads and writes.
`In a particular embodiment of the architecture shown in
`FIG. 3, processing functions that are associated with typical
`graphics controller interface functions such as video input
`processing and display output processing are handled by the
`memory controller/graphics controller 310. Processing func-
`tions that are associated with high-bandwidth rendering or
`image buffer creation are handled by the rendering engines
`304 on the memory modules.
`Although particular examples are discussed herein as
`having one or two rendering engines, alternate implemen-
`tations may include any number of rendering engines or
`computing engines on a particular module. Further, a par-
`ticular module may contain any number of memory devices
`coupled to any number of rendering engines and/or C0111-
`puti11g engines.
`Thus, a system has been described that provides multiple
`discrete memory modules coupled to a common memory
`controller. Each memory module includes a computing
`engine and a shared memory. A data processing task from
`the memory controller can be partitioned among the differ-
`ent computing engines to allow parallel processing of the
`various portions of the processing task.
`is
`Although the description above uses language that
`specific to structural features and/or methodological acts, it
`is to be understood that
`the invention defined in the
`appended claims is not limited to the specific features or acts
`described. Rather, the specific features and acts are disclosed
`as exemplary forms of implementing the invention.
`What is claimed is:
`1. An apparatus comprising:
`a memory controller configured to partition processing
`tasks among a plurality of computing engines; and
`at least one module coupled to the memory controller,
`each module including:
`at least one computing engine configured to receive
`information associated with at least one of the par-
`titioned processing tasks from the memory control-
`ler; and
`a shared memory coupled to the computing engine,
`wherein the shared memory includes storage for data
`which includes a portion of the main memory for a
`central processing unit (CPU).
`2. An apparatus as recited in claim 1 further including an
`interconnect configured to couple a plurality of modules to
`the memory controller.
`3. An apparatus as recited in claim 1 wherein the com-
`puting engine is a graphics rendering engine.
`
`5
`
`'
`
`55
`
`8
`4. An apparatus as recited in claim 1 wherein the shared
`memory includes at least one memory device.
`5. An apparatus as recited in claim 1 wherein the shared
`memory is configured to store main memory instructions
`and data in addition to graphical data.
`6. An apparatus as recited in claim 1 wherein the com-
`puting engine is coupled between the memory controller and
`the shared memory.
`7. An apparatus as recited in claim 1 wherein the memory
`controller is configured to receive video data.
`8. An apparatus as recited in claim 1 wherein the memory
`controller includes a graphics controller.
`9. An apparatus as recited in claim 1 wherein each module
`includes a plurality of computing engines.
`10. An apparatus as recited in claim 9 wherein computing
`engines on the same module can communicate with one
`another.
`11. An apparatus as recited in claim 1 wherein computing
`engines on different modules can communicate with one
`another.
`12. An apparatus as recited in claim 1 wherein the
`information received by the computing engine from the
`memory controller includes instructions to be processed by
`the computing engine.
`13. An apparatus as recited in claim 1 wherein the
`information received by the computing engine from the
`memory controller includes data to be processed by the
`computing engine,
`the data associated with at least one
`section in a graphical rendering surface.
`14. An apparatus comprising:
`a memory controller configured to partition processing
`tasks among a plurality of rendering engines;
`a plurality of memory modules coupled to the memory
`controller, each module including:
`at least one rendering engine configured to receive data
`to be processed from the memory controller; and
`a shared memory coupled to the rendering engine,
`wherein the shared memory includes storage for a
`portion of the main memory for a central processing
`unit (CPU) coupled to the memory controller.
`15. An apparatus as recited in claim 14 wherein each
`shared memory includes a plurality of memory devices.
`16. An apparatus as recited in claim 14 wherein the
`memory controller is configured to receive video data.
`17. An apparatus as recited in claim 14 wherein each
`, partitioned processing task is associated with at least one
`section of a graphical rendering surface.
`18. An apparatus as recited in claim 14 further including
`an interconnect configured to couple the plurality of mod-
`ules to the memory controller.
`19. An apparatus as recited in claim 14 wherein each
`module includes a plurality of rendering engines.
`20. An apparatus as recited in claim 14 wherein rendering
`engines on different modules can communicate with one
`another.
`21. A module comprising:
`a rendering engine configured to receive information from
`a memory controller, the information associated with at
`least one processing task partitioned by the memory
`controller, and
`shared memory coupled to the rendering engine,
`wherein the shared memory is configured to store main
`memory data as well as graphical data, wherein the
`module is configured to be coupled to or decoupled
`from the memory controller.
`22. Amodule as recited in claim 21 wherein the rendering
`engine and the shared memory are mounted on a common
`substrate.
`
`0012
`
`
`
`US 6,864,896 B2
`
`9
`23. Amodule as recited in claim 21 wherein the rendering
`engine is a graphics rendering engine.
`24. A module as recited in claim 21 wherein the shared
`memory includes at least one memory device.
`25. A module as recited in claim 21 further including an
`interconnect configured to couple the module to a plurality
`of other modules.
`26. Amodule as recited in claim 21 wherein the rendering
`engine communicates with rendering engines on other mod-
`ules.
`27. A module as recited in claim 21 wherein the infor-
`mation received by the rendering engine from the memory
`controller includes instructions to be processed by the ren-
`dering engine.
`28. A module as recited in claim 21 wherein the infor-
`mation received by the rendering engine from the memory
`controller includes data to be processed by the rendering
`engine, the data associated with at least one section in a
`graphical rendering surface.
`29. A method comprising:
`receiving a processing task at a memory controller,
`partitioning, by the memory controller, the processing
`task into a plurality of portions; and
`
`10
`distributing, by the memory controller, each portion of the
`processing task t