Lecture 8: Skew Tolerant Domino Clocking Computer Systems Laboratory Stanford University horowitz@stanford.edu Copyright 2001 by Mark Horowitz (Original Slides from David Harris) 1
Introduction Domino Circuits are becoming ubiquitous in high speed digital ICs Offer 30% (or more) speedup over static CMOS raw gate delay Dual-rail domino becoming more common because many functions are nonmonotonic, area is less of an issue Nevertheless, traditional domino pipelines have significant overhead Latch required to hold result while next stage evals, prev. precharges Skew budget, no time borrowing, latch delay Look at several ways to reduce this overhead Better latches, Self-timing Skew-tolerant domino is a powerful new technique Evaluate performance benefits of skew-tolerant domino 2
Domino from a System Perspective Domino doesn t look so attractive in the context of a traditional pipeline clk clk_b Legend: Static: One inverting static gate Domino: One inverting dynamic gate Latch: Inverting tristate latch 1. Pay clock skew twice each phase 2. Balancing short phases is hard since there is no time borrowing 3. Latches become a significant fraction of the cycle time 3
Traditional Domino Performance Evaluation Let T = cycle time = 20 FO4 delays; t skew = 2; t setup = 1 1 Difficult filling cycle exactly (no time borrowing) -> t imbalance = 1 T phase-logic = T/2 - t skew -t setup -t imbalance Baseline Design: T phase-logic = 20/2-2-1-1 = 6 40% of the phase is wasted in overhead! Slower than static! Optimized Design: Define clock domains and use t skew-local = 1 Work hard to balance logic between phases: t imbalance = 0 (optimistic) T phase-logic = 20/2-1- 1-0= 8 Still, 20% of the phase is overhead! 1. Remember for this situation, the setup time must be large enough that the output has settled before clock arrives since the output might go into a dynamics gate on the next cycle and might not be monotonic 4
Early Enhancements Good designers have recognized this problem for years. The largest problem is the hard edges set by the latches. A variety of latches soften this edge: Gate outputs are already _q1, so why use another clock. An SR latch will work instead Use the monotonic nature of the signal to feed it into a precharged latch stage SR Latch φ Dual-Monotonic Latch from domino Still have a problem if you want to use non-monotonic logic somewhere, since logic must settle before earliest clock, while gate might not evaluate until a late clock φ TSPC Latch But if you only have monotonic gates... 5
Skew Tolerant Domino Clocking If inputs are all dual rail, then as long as the clock arrives before the data, The gate will wait and fire when the data arrives If the next gate fires before the current gate precharges, There is no need for a latch Like the self-timed pipeline Can generate these properties using overlapping clocks 6
Skew-Tolerant Domino Circuits How much clock skew could we tolerate given N clock phases? Divide logic into N phases of T/N duration each. Overlapping clocks eliminates need for latches Extra overlap accommodates clock skew and time borrowing φ1 φ2 φ1 φ1 φ1 φ1 φ2 φ2 φ2 φ2 As with other domino techniques, budget skew on the transition from static to domino 7
Skew Tolerance T = t e +t p t p =t prech +t skew ;t e = T/N + t skew +t hold Hence t skew-max = [T(N-1)/N - t prech -t hold ] / 2 φ1 φ2 φ1a φ1b t e must o verlap by t hold tp φ2a Effective Precharge Window φ1a φ1b φ2a 8
Numerical Example Let t prech = 4, long enough to: precharge domino gate make subsequent skewed static fall below V t t hold is slightly negative for reasonable cell libraries next phase can evaluate before precharge ripples through static gate conservatively bound t hold at 0 N t skew t p 2 2 6 3 3.33 7.33 4 4 8 6 4.66 8.66 8 5 9 Sweet spots: N=2 (fewest clocks), N=4 (good tolerance, 50% duty cycle) 9
Global & Local Skew This is good, but we can do better! Local skew can be more tightly controlled than global skew (~ 1 FO4) Require that each phase of logic fit in a local clock domain: t p =t prech + t skew-local ; t e = T/N + t skew-global +t hold Hence t skew-global-max = T(N-1)/N - t skew-local -t prech -t hold When t skew-global gets huge, precharge interferes with subsequent phase N t skew-global t p 2 2 5 3 5.66 5 4 6 6 6 6 7.33 8 6 8 10
Time Borrowing We don t need such a large global skew tolerance! Use some of this time instead to allow time borrowing t borrow = T(N-1)/N - t skew-global -t skew-local -t prech -t hold Intentional borrowing helps balance logic between phases Opportunistic time borrowing compensates for uncertainties in models, analysis tools, and processing If actual t skew-global = 2, t skew-local = 1: N t borrow t p 2 1 5 3 3.66 5 4 5 5 6 6..33 5 8 7 5 11
Other Design Issues State is no longer stored in the latch at the end of a phase Instead, it is held by the first domino gate in the phase Use a full keeper to allow stop-clock operation from φ1 block weak φ 2 All systems with overlapping clocks require min-delay checks Domino paths are presumably critical anyway, so few mindelay errors 4-phase has effectively no min-delay risk Overlap of all four phases is at most very small A minimum of 8 gates are in the cycle anyway 12
Skew-Tolerant Performance Evaluation Evaluate ALU self-bypass of superscalar µproc (like DEC Alpha) 3-metal 0.6 µm process FO4 delay in TT corner = 138 ps Compare traditional domino to 4-phase skew-tolerant domino x2 Add/Sub 64-bit Adder Traditional Result Mux 1 mm x4 Bypass Mux To Data Cache 1 mm 2 mm Other ALU blocks (150 ff) x2 Add/Sub 64-bit Adder Skew-Tolerant Result Mux 1 mm x4 Bypass Mux To Data Cache 1 mm 2 mm Other ALU blocks (150 ff) 13
Simulation Results No Skew: Traditional Domino: Latency = 13.0 FO4, cycle time = 16.6 Cycles are unbalanced; no time borrowing available Skew-Tolerant Domino: Latency = 11.9 FO4, cycle time = 11.9 Remove latches from critical path, balance pipe stages 1 FO4 local skew: Traditional Domino: Latency = 15.0 FO4, cycle time = 17.6 Skew adds to both phases for latency Unbalanced second stage already has margin in the cycle time Skew-Tolerant Domino: Latency = 11.9 FO4, cycle time = 11.9 Skew is tolerated 14
Summary Offers most of the benefits of self-timed designs while preserving the simplicity of a synchronous methodology. Clock generation & distribution becomes key issue. However, control generation and distribution can be just as tough in self-timed designs. Skew-Tolerant Domino eliminates most of the overhead found in traditional domino systems: Tolerates clock skew Removes latches from the critical path Allows time borrowing Robust High-performance microprocessor designs have used these ideas but they don t talk about them. 15