`US007546567B2
`
`c12) United States Patent
`Cheon et al.
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 7,546,567 B2
`Jun.9,2009
`
`(54)
`
`METHOD AND APPARATUS FOR
`GENERATING A VARIATION-TOLERANT
`CLOCK-TREE FOR AN INTEGRATED
`CIRCUIT CHIP
`
`(75)
`
`Inventors: Yongseok Cheon, Portland, OR (US);
`Pei-Hsin Ho, Portland, OR (US)
`
`(73)
`
`Assignee: Synopsys, Inc., Mountain View, CA
`(US)
`
`( *)
`
`Notice:
`
`Subject to any disclaimer, the term ofthis
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 191 days.
`
`(21)
`
`Appl. No.: 11/652,302
`
`(22) Filed:
`
`Jan.10,2007
`
`(65)
`
`(51)
`
`(52)
`(58)
`
`(56)
`
`Prior Publication Data
`
`US 2008/0168412Al
`
`Jul. 10, 2008
`
`Int. Cl.
`G06F 17150
`(2006.01)
`U.S. Cl. .................................. 716/6; 716/7; 716/10
`Field of Classification Search ....................... None
`See application file for complete search history.
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`Minami ......................... 716/6
`Cheng et al . .................. 716/10
`Srinivasan et al .
`............ 716/10
`Auracher et al.
`.............. 716/13
`
`A *
`9/1996
`Bl*
`4/2002
`Bl*
`2/2004
`B2 * 10/2006
`
`5,557,779
`6,367,060
`6,698,006
`7,117,472
`
`7,257,782 B2 *
`2008/0216040 Al*
`
`8/2007 Ho et al. ........................ 716/2
`9/2008 Furnish et al.
`................ 716/10
`
`OTHER PUBLICATIONS
`
`Cheon et al., "Power-Aware Placement," Design Automation Con(cid:173)
`ference 2005, pp. 795-800.*
`Rajaram et al., "Reducing Clock Skew Variability via Cross Links,"
`Design Automation Conference 2004, pp. 18-23.*
`Velenis et al., "A Clock Tree Topology Extraction Algorithm for
`Improving the Tolerance of Clock Distribution Networks to Delay
`Uncertainty," 2001 IEEE, pp. 422-425.*
`
`* cited by examiner
`
`Primary Examiner-Leigh Marie Garbowski
`(74) Attorney, Agent, or Firm-Park, Vaughan & Fleming
`LLP
`
`(57)
`
`ABSTRACT
`
`One embodiment of the present invention relates to a process
`that generates a clock-tree on an integrated circuit (IC) chip.
`During operation, the process starts by receiving a placement
`for a chip layout, where the placement includes a set of
`registers at fixed locations in the chip layout. The process then
`generates a timing criticality profile for the set of registers,
`wherein the timing criticality profile specifies timing criticali(cid:173)
`ties between pairs of registers in the set of registers. Next, the
`process clusters the set of registers based on the timing criti(cid:173)
`cality profile to create a clock-tree for the set of registers. By
`clustering the registers based on the timing criticality profile,
`the process facilitates using commonly-shared clock paths in
`the clock-tree to provide clock signals to timing critical reg(cid:173)
`ister pairs.
`
`25 Claims, 6 Drawing Sheets
`
`RECEIVE A PLACEMENT
`402
`
`GENERATE A TIMING
`CRITICALITY PROFILE FROM
`THE PLACEMENT
`404
`
`PERFORM A GEOMETRIC
`LOCATION-BASED
`CLUSTERING TO GENERATE A
`TEMPORARY PARTlAL-CLOCK(cid:173)
`TREE
`406
`
`EXTRACT A SET OF
`CONSTRAINTS FROM THE
`TEMPORARY PARTIAL-CLOCK(cid:173)
`TREE
`408
`
`DISCARD THE TEMPORARY
`PARTIAL-CLOCK-TREE
`410
`
`PERFORM A TIMING
`CRITICALITY-BASED
`CLUSTERING UNDER THE SET
`OF CONSTRAINTS TO
`GENERATE A PARTIAL(cid:173)
`CLOCK-TREE
`412
`
`Page 1 of 14
`
`
`
`-....l = N
`
`0--,
`tit
`0--,
`~
`tit
`-....l
`d r.,;_
`
`O'I
`0 .....
`....
`.....
`rJJ =- ('D
`
`('D
`
`\,Ci
`0
`0
`N
`~\,Ci
`
`~ = ?
`
`~ = ~
`
`~
`~
`~
`•
`00
`~
`
`1;g
`
`Prata
`
`FIG. 1
`
`128
`
`126
`
`124
`
`122
`
`120
`
`118
`
`Design
`System ) Logic Design) Synthesis & )
`
`Test 116
`Design for
`
`Verif. 114
`and Fune.
`
`112
`
`) Resolution ) ~ask )
`
`Enhanc.
`
`Verfication
`Physical
`
`Implement. & Extract.
`) Analys~ )
`Physical
`
`Planning
`) Design )
`
`Verification
`
`Netlist
`
`160
`
`Assembly
`Pad<aging ) ~
`
`170
`Chips
`
`&
`
`) Fabrication ) )
`
`150
`
`EDA ) Tape-
`
`110
`
`Software
`
`140
`out
`
`Product ))
`
`100
`Idea
`
`Page 2 of 14
`
`
`
`U.S. Patent
`US. Patent
`
`Jun. 9, 2009
`Jun.9,2009
`
`Sheet 2 of6
`Sheet 2 of 6
`
`US 7,546,567 B2
`US 7,546,567 B2
`
`
`
`208
`
`FIG. 2
`FIG. 2
`
`Page 3 of 14
`
`Page 3 of 14
`
`
`
`U.S. Patent
`
`Jun.9,2009
`
`Sheet 3 of 6
`
`US 7,546,567 B2
`
`CLOCK 302
`
`310\ _________________ _
`
`(-3t~-------.I
`
`--:--
`:
`:
`
`I
`
`I
`I
`I
`I
`I
`I
`I
`
`I
`I
`I
`I
`I
`I
`
`- - -1
`I
`I
`I
`I
`I
`I
`I
`304
`___ J
`
`c32~
`I
`I
`I
`I
`I
`I
`I
`1314
`I
`
`FIG. 3
`
`Page 4 of 14
`
`
`
`U.S. Patent
`
`Jun.9,2009
`
`Sheet 4 of 6
`
`US 7,546,567 B2
`
`START
`
`RECEIVE A PLACEMENT
`402
`
`GENERATE A TIMING
`CRITICALITY PROFILE FROM
`THE PLACEMENT
`404
`
`PERFORM A GEOMETRIC
`LOCATION-BASED
`CLUSTERING TO GENERATE A
`.-----t1.iTEMPORARY PARTIAL-CLOCK(cid:173)
`TREE
`406
`
`EXTRACT A SET OF
`CONSTRAINTS FROM THE
`TEMPORARY PARTIAL-CLOCK(cid:173)
`TREE
`408
`
`DISCARD THE TEMPORARY
`PARTIAL-CLOCK-TREE
`410
`
`PERFORM A TIMING
`CRITICALITY-BASED
`CLUSTERING UNDER THE SET
`OF CONSTRAINTS TO
`GENERATE A PARTIAL(cid:173)
`CLOCK-TREE
`412
`
`NO
`
`END
`
`FIG. 4
`
`Page 5 of 14
`
`
`
`U.S. Patent
`
`Jun.9,2009
`
`Sheet 5 of 6
`
`US 7,546,567 B2
`
`START
`
`PRIORITIZE LEAF-LEVEL
`REGISTERS BASED ON THE
`TIMING CRITICALITIES
`502
`
`START WITH THE HIGHEST PRIORITY REGISTER
`PAIR, AND FOR EACH PRIORITIZED REGISTER PAIR
`
`CLUSTER THE REGISTER
`PAIR INTO A SAME CLUSTER
`504
`
`CHECK THE CLUSTER
`CONTAINING THE REGISTER
`PAIR AGAINST THE SET OF
`CONSTRAINTS
`506
`
`YES
`
`DO NOT INCLUDE
`THE REGISTER PAIR
`IN THE SAME
`CLUSTER
`510
`
`NO
`
`INCLUDE THE REGISTER PAIR
`IN THE SAME CLUSTER
`512
`
`GO TO THE NEXT HIGHEST PRIORITY REGISTER PAIR
`
`CLUSTER THE REMAINING
`REGISTERS USING A
`LOCATION-BASED
`CLUSTERING PROCESS
`514
`
`END
`
`FIG. 5
`
`Page 6 of 14
`
`
`
`U.S. Patent
`
`Jun.9,2009
`
`Sheet 6 of 6
`
`US 7,546,567 B2
`
`CHIP 600
`
`CLUSTER
`602
`
`CLUSTER
`606
`
`CLUSTER
`604
`
`CLUSTER
`608
`
`FIG. 6A
`
`CHIP 600
`
`610
`
`624
`
`612
`
`626
`
`614
`
`y7
`
`616
`
`FIG. 6B
`
`Page 7 of 14
`
`
`
`US 7,546,567 B2
`
`1
`METHOD AND APPARATUS FOR
`GENERATING A VARIATION-TOLERANT
`CLOCK-TREE FOR AN INTEGRATED
`CIRCUIT CHIP
`
`BACKGROUND
`
`2
`tion networks to delay uncertainty," ISCAS 2001). However,
`this technique does not use any physical proximity informa(cid:173)
`tion to guide clock-tree synthesis, and therefore suffers from
`problems such as unbalanced tree topology, larger wire length
`5 overhead, and higher power consumption.
`Hence, what is needed is a method and an apparatus for
`creating an OCV-tolerant clock-tree without the problems
`described above.
`
`SUMMARY
`
`1. Field of the Invention
`The present invention relates to techniques for designing
`clock distribution networks for integrated circuit (IC) chips. 10
`More specifically, the present invention relates to a method
`and an apparatus for generating a clock-tree on an IC chip to
`facilitate reducing the effects of on-chip variation (OCV).
`2. Related Art
`Advances in semiconductor technology presently make it
`possible to integrate large-scale systems, including hundreds
`of millions of transistors, onto a single semiconductor chip.
`Integrating such large-scale systems onto a single semicon(cid:173)
`ductor chip increases the speed at which such systems can
`operate, because signals between system components do not
`have to cross chip boundaries, and are not subject to lengthy
`chip-to-chip propagation delays.
`The speed of a system on an integrated circuit (IC) chip is
`largely determined by system clock frequency. In a typical
`synchronous IC chip, a clock distribution network (referred to
`as a "clock-tree") is used to distribute a clock signal from a
`common source to various circuit components. This clock
`signal is used to coordinate data transfers between circuit
`components. However, as
`increasing clock frequencies
`reduce the clock periods to fractions of a nanosecond, design- 30
`ing clock-trees is becoming increasingly more challenging.A
`direct result of the decreasing clock period is a shrinking
`"timing budget" between logically coupled clock sinks. This
`decreasing timing budget is requiring clock-trees to have
`minimal clock skew.
`Many sources contribute to clock skew in a clock-tree.
`Among these sources, "variations" have become one of the
`more significant challenges
`in synchronous clock-tree
`design. These variations can include: manufacturing process
`variations, operational voltage variations, and ambient tem(cid:173)
`perature variations. In particular, some of these variations
`occur within a chip boundary, and are hence referred to as
`"on-chip variations" (OCV). Due to the impact of OCV, the
`timing characteristics of instances of the same component
`may vary across the chip, thereby limiting the performance of
`the chip, and even threatening the functionality of the chip.
`Furthermore, OCV causes uncertainty in clock arrival times
`at circuit components. This uncertainty can cause clock skew
`and can thereby worsen the timing performance of the data
`paths between the clock sinks.
`In order to reduce the effects of OCV, some systems insert
`shunt connections called "cross links" into a clock-tree struc(cid:173)
`ture in a post-processing step (see A. Rajaram, J. Hu and R.
`Mahapatra, "Reducing Clock Skew Variability via Cross
`Links," IEEE Trans. Computer-Aided Design, Vol. 25, No. 6,
`pp. 1176-1182, June, 2006). These cross links can increase
`the amount of clock-path sharing between registers, thereby
`improving OCV-tolerance. However, this technique requires
`significantly more routing resources (wires) than would be
`needed by a typical clock-tree. Furthermore, the timing char(cid:173)
`acteristics of the cross links are generally difficult to analyze.
`Another disadvantage of this technique is that additional
`wires also increase overall power consumption of the chip.
`Other systems address the OCV issue by sequentially
`merging timing-critical pairs ofregisters based on a priority
`ordering (see D. Velenis, et al., "A clock tree topology extrac(cid:173)
`tion algorithm for improving the tolerance of clock distribu-
`
`One embodiment of the present invention relates to a pro(cid:173)
`cess that generates a clock-tree on an integrated circuit (IC)
`chip. During operation, the process starts by receiving a
`15 placement for a chip layout, where the placement includes a
`set of registers at fixed locations in the chip layout. The
`process then generates a timing criticality profile for the set of
`registers, wherein the timing criticality profile specifies tim(cid:173)
`ing criticalities between pairs of registers in the set of regis-
`20 ters. Next, the process clusters the set of registers based on the
`timing criticality profile to create a clock-tree for the set of
`registers. By clustering the registers based on the timing
`criticality profile, the process facilitates using commonly(cid:173)
`shared clock paths in the clock-tree to provide clock signals to
`25 timing critical register pairs.
`In a variation on this embodiment, the process obtains the
`timing criticality between a pair of registers by computing a
`timing slack between the pair of registers based on the
`received placement.
`In a further variation on this embodiment, the process
`computes the timing slack between the pair of registers by
`computing a data path delay and a clock skew between the
`pair of registers based on the received placement.
`In a variation on this embodiment, the process generates
`35 the timing criticality profile for the set of registers by con(cid:173)
`structing a graph G(Y, E), wherein Vis the set of registers and
`E is the set of edges between the set of registers weighted by
`the corresponding timing criticalities.
`In a variation on this embodiment, the process clusters the
`40 set of registers based on the timing criticality profile to create
`a clock-tree by: clustering the set of registers at a leaf-level to
`generate a disjoint set ofleaf-level clusters; assigning a clock
`buffer to each of the leaf-level clusters to generate a set of
`clock-buffers; and clustering the set of clock-buffers to gen-
`45 erate a disjoint set ofnon-leaflevel clusters.
`In a further variation on this embodiment, the process
`clusters the set of registers at the leaf-level by: prioritizing the
`pairs of registers based on the associated timing criticalities,
`wherein a more timing critical pair of registers receives a
`50 higher priority; and attempting to assign a pair of registers to
`the same cluster based on the associated priority, thereby
`substantially maximizing the inclusion of the timing critical
`register pairs into the same clusters.
`In a further variation, registers within the same cluster
`55 share clock-buffers and clock-nets along associated clock
`paths in the clock-tree.
`In a further variation, prior to clustering the set of registers
`based on the timing criticality profile, the process clusters the
`set of registers using a geometric location-based clustering
`60 process to obtain a temporary partial-clock-tree, which com(cid:173)
`prises a disjoint set of clusters. The process then extracts a set
`of constraints from the set of clusters. Next, the process
`discards the temporary partial-clock-tree.
`In a further variation, the set of constraints can include: a
`65 bounding box size for a cluster; a bounding box shape for a
`cluster; a fanout within a cluster; and a capacitance within a
`cluster.
`
`Page 8 of 14
`
`
`
`US 7,546,567 B2
`
`3
`In a further variation, the process clusters the set of regis(cid:173)
`ters based on the timing criticality profile by applying the set
`of constraints to each of the leaf-level clusters and the non(cid:173)
`leaf-level clusters to ensure that the constraints are not vio(cid:173)
`lated.
`In a variation on this embodiment, using commonly-shared
`clock paths in the clock-tree to provide clock signals to the
`timing critical register pairs improves timing performance of
`the IC chip by reducing the impact of on-chip-variations in
`creating clock skew between registers in timing critical reg(cid:173)
`ister pairs.
`In a further variation on this embodiment, the on-chip(cid:173)
`variations can include: process variations; voltage variations;
`and temperature variations.
`
`BRIEF DESCRIPTION OF THE FIGURES
`
`FIG.1 illustrates various steps in the design and fabrication
`of an integrated circuit in accordance with an embodiment of
`the present invention.
`FIG. 2 illustrates the effect ofOCV on an exemplary circuit
`in accordance with an embodiment of the present invention.
`FIG. 3 illustrates an exemplary clock-tree structure with
`leaf-level registers in accordance with an embodiment of the
`present.
`FIG. 4 presents a flowchart illustrating the process of gen(cid:173)
`erating a clock-tree to improve timing performance in accor(cid:173)
`dance with an embodiment of the present.
`FIG. 5 presents a flowchart illustrating the process of clus(cid:173)
`tering leaf-level registers based on both timing criticality
`profile and constraints from the location-based clustering in
`accordance with an embodiment of the present.
`FIG. 6A illustrates an exemplary leaf-level-clustering
`result in accordance with an embodiment of the present.
`FIG. 6B illustrates the non-leaf-level abstraction of the
`clock buffers in FIG. 6A in accordance with an embodiment
`of the present.
`
`DETAILED DESCRIPTION
`
`The following description is presented to enable any per(cid:173)
`son skilled in the art to make and use the invention, and is
`provided in the context of a particular application and its
`requirements. Various modifications to the disclosed embodi(cid:173)
`ments will be readily apparent to those skilled in the art, and
`the general principles defined herein may be applied to other
`embodiments and applications without departing from the
`spirit and scope of the present invention. Thus, the present
`invention is not limited to the embodiments shown, but is to
`be accorded the widest scope consistent with the principles
`and features disclosed herein.
`
`Integrated Circuit Design Flow
`FIG.1 illustrates various steps in the design and fabrication
`of an integrated circuit in accordance with an embodiment of
`the present invention.
`The process starts with the product idea ( step 100) which is
`realized using an EDA software design process (step 110).
`When the design is finalized, it can be taped-out ( event 140).
`After tape out, the fabrication process (step 150) and pack(cid:173)
`aging and assembly processes (step 160) are performed
`which ultimately result in finished chips (result 170).
`The EDA software design process (step 110), in tum, com(cid:173)
`prises steps 112-130, which are described below. Note that
`the design flow description is for illustration purposes only.
`This description is not meant to limit the present invention.
`For example, an actual integrated circuit design may require
`
`5
`
`20
`
`4
`the designer to perform the design steps in a different
`sequence than the sequence described below. The following
`discussion provides further details of the steps in the design
`process.
`System design (step 112): The designers describe the func-
`tionality that they want to implement. They can also perform
`what-if planning to refine functionality, check costs, etc.
`Hardware-software architecture partitioning can occur at this
`stage. Exemplary EDA software products from Synopsys,
`10 Inc. that can be used at this step include Model Architect,
`Saber, System Studio, and Design Ware® products.
`Logic design and functional verification (step 114): At this
`stage, the VHDL or Verilog code for modules in the system is
`written and the design is checked for functional accuracy.
`15 More specifically, the design is checked to ensure that it
`produces the correct outputs. Exemplary EDA software prod(cid:173)
`ucts from Synopsys, Inc. that can be used at this step include
`VCS, VERA, Design Ware®, Magellan, Formality, ESP and
`LEDA products.
`Synthesis and design for test (step 116): Here, the VHDL/
`Verilog is translated to a netlist. The netlist can be optimized
`for the target technology. Additionally, tests can be designed
`and implemented to check the finished chips. Exemplary
`EDA software products from Synopsys, Inc. that can be used
`25 at this step include Design Compiler®, Physical Compiler,
`Test Compiler, Power Compiler, FPGA Compiler, Tetramax,
`and Design Ware® products.
`Netlist verification (step 118): At this step, the netlist is
`checked for compliance with timing constraints and for cor-
`30 respondence with the VHDLNerilog source code. Exemplary
`EDA software products from Synopsys, Inc. that can be used
`at this step include Formality, Prime Time, and VCS products.
`Design planning (step 120): Here, an overall floorplan for
`the chip is constructed and analyzed for timing and top-level
`35 routing. Exemplary EDA software products from Synopsys,
`Inc. that can be used at this step include Astra and IC Com(cid:173)
`piler products.
`Physical implementation (step 122): The placement (posi-
`40 tioning of circuit elements) and routing ( connection of the
`same) occurs at this step. Exemplary EDA software products
`from Synopsys, Inc. that can be used at this step include the
`Astra and IC Compiler products.
`Analysis and extraction (step 124): At this step, the circuit
`function is verified at a transistor level, this in turn permits
`what-if refinement. Exemplary EDA software products from
`Synopsys, Inc. that can be used at this step includeAstroRail,
`PrimeRail, Primetime, and Star RC/XT products.
`Physical verification (step 126): In this step, the design is
`50 checked to ensure correctness for manufacturing, electrical
`issues, lithographic issues, and circuitry. Exemplary EDA
`software products from Synopsys, Inc. that can be used at this
`step include the Hercules product.
`Resolution enhancement (step 128): This step involves
`55 geometric manipulations of the layout to improve manufac(cid:173)
`turability of the design. Exemplary EDA software products
`from Synopsys, Inc. that can be used at this step include
`Proteus, ProteusAF, and PSMGen products.
`Mask data preparation (step 130): This step provides the
`60 "tape-out" data for production of masks to produce finished
`chips. Exemplary EDA software products from Synopsys,
`Inc. that can be used at this step include the CATS(R) family
`of products.
`Embodiments of the present invention can be used during
`65 one or more of the above described steps. Specifically, one
`embodiment of the present invention can be used during the
`physical implementation step 122.
`
`45
`
`Page 9 of 14
`
`
`
`US 7,546,567 B2
`
`5
`Overview of the Clock-Three Generation Technique
`The present invention provides a technique for improving
`overall timing performance of an IC chip in the presence of
`clock variations. Specifically, using a placed IC design as an
`input, the present invention performs a clock-tree synthesis
`(CTS) operation on the IC design to maximize commonly(cid:173)
`shared clock paths (including clock buffers and clock nets) to
`the registers in the IC design if they have tighter timing
`constraints between them.
`More specifically, the present invention first extracts tim(cid:173)
`ing information as a preprocessing step of the CTS, so that
`timing critical register pairs can be identified from the placed
`design netlist. Next, starting from leaf-level clustering, the
`present invention attempts to maximize the inclusion of the
`timing critical register pairs into the same cluster.
`Furthermore, one embodiment of the present invention
`uses a two-pass clustering procedure to minimize the negative
`effects of a pure timing-based clustering on general CTS
`qualities (such as maximum global skew, clock wire length,
`power, and insertion delay). In the first pass, a location-based 20
`general purpose clustering is performed, and a disjoint set of
`clusters is obtained. The statistics on the capacitances and
`bounding boxes of the clusters are then extracted from the set
`of clusters as guiding constraints for the second pass. In the
`second pass, the clustering is re-done based on both the tim- 25
`ing criticality profile and a set of constraints from the first
`pass, so that the set of constraints (e.g., maximum capaci(cid:173)
`tance, maximum fanout, etc.) are not violated.
`The two-pass CTS procedure is performed in a bottom-up
`manner, and can be extended to non-leaf levels. Note that 30
`each resulting cluster is driven by a single clock buffer ( or
`inverter), both at the leaf-level and at non-leaflevels.
`
`6
`Note that variations, such as processing variations (e.g.,
`width of the clock wire), voltage variations, and temperature
`variations can result in uncertainties in clock path delays C 1
`and C2 . Because these uncertainties can have both positive
`5 and negative values, a worst case scenario is typically con(cid:173)
`sidered. For example, the uncertainties can simultaneously
`cause an increase ofC 1 and a decrease ofC2 . Hence the clock
`skew ti. becomes significantly larger, and the above timing
`constraint becomes more stringent. Furthermore, if there is
`10 not a sufficient positive slack, this increase of clock skew can
`cause a violation of the timing constraint more easily which
`can lead to failures in the circuits.
`Note that if there are more commonly-shared portions in
`the clock paths (including clock buffers and clock nets) from
`15 the clock source to different clock sinks, the data path
`between the pair of clock sinks becomes less vulnerable to the
`variations, because the commonly-shared clock path up to a
`common point does not contribute to clock skew uncertain-
`ties. In one embodiment of the present invention, the effect of
`shared clock paths can be measured by using a method called
`"clock reconvergence pessimism removal" or CRPR.
`FIG. 3 illustrates an exemplary clock-tree structure 300
`with leaf-level registers in accordance with an embodiment of
`the present.
`Clock-tree structure 300 includes a main clock source 302
`at the root of clock-tree 300, two middle levels of clock(cid:173)
`buffers, and a leaf-level comprising a set of clustered regis(cid:173)
`ters. Registers 304 and 306 form a launcher/capturer pair
`which is logically coupled through a data path 308. When
`comparing the two clock paths from clock 302 to registers
`304 and 306, note that branch 310 and branch 312 are the
`non-shared "private" sections of the two clock paths, respec(cid:173)
`tively. These two private branches contain a total of four
`unshared clock buffers.
`Registers 314 and 316 form another launcher/capturer pair
`which is logically coupled through a data path 318. Note that
`the clock paths for registers 314 and 316 are mostly shared,
`and the private sections of the two clock paths are branch 320
`and branch 322, respectively. These two private branches are
`both local branches which contain no clock buffers. Note that
`register pair 314/316 achieves a significantly more clock-path
`sharing than the register pair 304/306.
`In one embodiment of the present invention, the total num(cid:173)
`ber of unshared clock-buffers in both clock paths of a
`launcher/capturer pair is used to indicate a severity of the
`effects due to OCV. In this embodiment, OCV causes a much
`more severe impact on data path 308 than on data path 318.
`Note that generally registers within a same cluster, for
`example in cluster 324, are less vulnerable to OCV than
`register pairs in different clusters, for example, between clus(cid:173)
`ter 326 and cluster 328. Based on above observation, we
`provide an OCV-aware CTS technique below.
`
`Variations Impact on Clock Skew and Timing Constraints
`Note that on-chip variation (OCV) increases the clock 35
`skew uncertainties which subsequently worsen timing con(cid:173)
`straints in the circuits. Due to these variations, a certain clock
`path (to a launcher) can be slower than expected; while
`another clock path ( to a capturer) can be faster than expected.
`A critical timing path clocked by such a clock-tree will paten- 40
`tially violate a specified timing constraint.
`FIG. 2 illustrates the effect ofOCV on an exemplary circuit
`in accordance with an embodiment of the present invention.
`FIG. 2 includes two logically coupled registers 202 and
`204, and a clock source 206. Register 202 is coupled to 45
`register 204 through a data path 208 which includes a com(cid:173)
`binational logic 210. Note that in this configuration, we refer
`to register 202 as a "launcher" and register 204 as a "cap(cid:173)
`turer", respectively. A clock signal generated by clock source
`206 is distributed to registers 202 and 204 through two dif- 50
`ferent clock paths 212 and 214, which are associated with
`clock path delays C 1 and C2 , respectively. Ideally, C 1 and C2
`are designed to be identical. In reality, there almost always
`exists a difference between them, which is referred to as the
`clock skew li.=C 1-C2 . Note that in FIG. 2, clock path (wire)
`212 is shorter than clock path (wire) 214.
`During a clock cycle, new data presented at the outputs of
`register 202 moves downstream through data path 208 and is
`latched onto the inputs of register 204. During normal circuit
`operation, the timing constraint stipulates that a delay d asso- 60
`ciated with data path 208 plus a setup time t associated with
`register 204 is less than the clock period T, i.e., d+t<T. Fur(cid:173)
`thermore, due to the non-zero clock skew resulting from C 1
`and C2 , the timing constraint becomes C 1 +d+t<T +C2 , which
`can be rewritten in terms of the clock skew as: li.+d+t<T. The 65
`difference (T-t)-(li.+d) is referred to as the "slack," and hence
`a negative slack results in a violation of the timing constraint.
`
`55
`
`Generating a OCV-Tolerant Clock-Tree
`FIG. 4 presents a flowchart illustrating the process of gen(cid:173)
`erating a clock-tree to improve timing performance in accor(cid:173)
`dance with an embodiment of the present.
`During operation, the process starts by receiving a place(cid:173)
`ment of an IC chip design (step 402). Specifically, in this IC
`chip design all the registers and the corresponding combina(cid:173)
`tional logics between registers have been placed at respective
`locations within the chip boundary, and the connectivities
`between registers are specified in a design netlist file.
`The process then generates a timing criticality profile for
`the set of registers, wherein the timing criticality profile
`specifies timing criticalities between pairs of launcher/cap-
`
`Page 10 of 14
`
`
`
`US 7,546,567 B2
`
`7
`turer in the set of registers (step 404). As a result, timing
`critical register pairs can be identified from the placed design
`netlist.
`Note that a timing criticality can be computed based on the
`slack described above, i.e., (T-t)-(ll+d), wherein T is the
`clock period, t is the setup time of the capturer, and d is the
`data path delay which can be obtained from the placed design
`netlist. However, because the clock-tree has not been estab(cid:173)
`lished at this stage, an exact clock skew is unknown. In one
`embodiment, an ideal clock skew of zero is assumed. In a
`further embodiment, a clock skew of a predetermined value is
`assumed (for example, by making a worst case estimate of the
`clock skew).
`One embodiment of the present invention represents the
`timing criticality profile as a graph G(V, E), wherein V rep(cid:173)
`resents the set of registers and Eis the set of edges between the
`set of registers weighted by their corresponding timing criti(cid:173)
`calities. Specifically, an edge between each pair oflauncher/
`capturer is assigned a weight (for example, from Oto 10)
`based on its timing criticality. For example, in FIG. 3, the edge
`between register pair 304/306 will be assigned a higher
`weight than the edge between register pair 314/316 if the
`former has a longer data path delay. In order to reduce the
`complexity of the timing graph G, only those edges associ(cid:173)
`ated with sufficiently large weights will be included in the
`graph, and considered as timing critical. Consequently, reg(cid:173)
`isters with relatively low timing criticalities are not consid(cid:173)
`ered for their timing profile during clustering.
`Note that the timing graph G can be overlaid on the placed
`chip layout using two-dimensional (2D) distance-coordi(cid:173)
`nates, and the set of edges can be displayed as straight lines
`connecting timing critical register pairs (as the end points of
`an edge).
`
`Performing a Location-Based CTS
`A conventional location-based clustering technique aims
`to minimize the clock skew, power consumption, wire length,
`etc., by clustering leaf-level registers and upper-level clock(cid:173)
`buffers based on the physical proximity of the components at
`each level in the placement configuration. For example, when
`two registers are placed far apart from each other, the tech(cid:173)
`nique tends to avoid clustering them together which would
`lead to larger clock skew. However, it is possible that these
`two physically distant registers are associated by a high tim(cid:173)
`ing criticality. This timing profile is not considered by the
`conventional location-based CTS. On the other hand, two
`physically nearby registers can be logically unrelated, and
`therefore are not required to be clustered together. However,
`this is also not considered by the conventional location-based
`CTS. In contract, the present invention takes into account
`both geometric and timing information while creating a
`clock-tree.
`Referring back to FIG. 4, the process next performs a 55
`first-pass location-based clustering on the set of registers to
`generate a temporary partial-clock-tree, which comprises a
`disjoint set the clusters (step 406). Specifically, the clustering
`is based primarily on the physical proximity of the compo(cid:173)
`nents, while also considering a set of user specified "hard"
`constraints. These user-specified constraints on a cluster can
`include but are not limited to: maximum/minimum capaci(cid:173)
`tance within a cluster and maximum/minimum fanout within
`a cluster.
`Note that the set of clusters in the temporary partial-clock- 65
`tree includes clusters from both the leaf-level and non-leaf-
`levels, and each cluster has an associated bounding box con-
`
`5
`
`8
`(registers or clock-buffers).
`taining a set of nodes
`Furthermore, each cluster is driven by a single common clock
`buffer.
`Next, the process extracts statistics from the set of clusters
`(step 408). These statistics form a new set of constraints on a
`cluster, which can include but is not limited to: a maximum/
`minimum bounding box size of a cluster, a bounding box
`shape ( e.g., a maximum aspect ratio) ofa cluster, a maximum/
`minimum capacitance within a cluster, and maximurn/mini-
`10 mum fanout