# DUAL NODE PULSE DOMINO TECHNIQUE FOR BUFFER CIRCUIT IN LOW POWER MEMORY ARRAYS

<sup>1</sup>C.Deepika, <sup>2</sup>MA.Sohana Parveen, <sup>3</sup>Poonam Swami <sup>1</sup>Assistant Professor, <sup>2</sup>Assistant Professor, <sup>3</sup>Assistant Professor Electronics and Communication Engineering Department K.G.Reddy College of Engineering and Technology, Hyderabad, India

Abstract -In wide fan-in dynamic multiplexers, the two phase evaluate-precharge operation leads to high switching activity at the dynamic and the output nodes introducing a significant power penalty. To address this issue, the switching-aware design techniques are being explored but these existing techniques suffer from design inflexibilities. Dynamic gates are inherently less resistant to noises than static CMOS gates. In this paper, we propose a pulse domino switching-aware technique, called SSPD, to reduce the overall power consumption of a wide fan-in dynamic gate by having static-like switching behavior at the dynamic node, and the gate input/output terminals. A conditional pulse generator is also proposed, which enables the SSPD multiplexers to be easily adapted to a wide set of noise and delay specifications. Simulation results of 16-bit and 32-bit dynamic multiplexers designed and simulated in a 1.2-V 90-nm CMOS process show that the SSPD technique can reduce the average power by up to 21% and 36%, respectively, when compared to the conventional footless domino technique. This circuit can easily be designed to meet a wide range of power dissipation and noise specifications.

#### Index Terms- Dynamic gates, Pulse domino, Switching behavior, Power consumption, Delay

#### I. INTRODUCTION

Register files (RF) are performance-critical memory components in general-purpose microprocessors. They usually require a multiple read/write port capability to enable simultaneous access to several execution units in a super-scalar architecture. This requirement, coupled with the demand for a large number of word entries per port, forces the use of wired-OR style dynamic circuits for their local and global bit-lines [1]. To perform a bitline (BL) read operation on a register file with 2<sup>N</sup> registers, it requires a dynamic multiplexer structure with 2<sup>N</sup> parallel inputs. Dynamic circuits offer compactness, higher speed as compared to static CMOS circuits. However aggressive scaling of device and interconnect dimensions, power supplies in deep submicron region have further degraded the reliability of dynamic circuits [3]. Noise in digital integrated circuits refers to any phenomenon that causes the voltage at a node to deviate from its nominal value [1]. Noise phenomena always existed and they had an impact on the performance of dynamic circuits but it is technology scaling that has made the noise effects much more severe. In deep submicron region various noise sources are related to cross talk, leakage current, charge sharing and variations in the supply voltage. Leakage current increases exponentially with the scaling of device dimensions. Dynamic logic circuits are much affected by noise as compared to static CMOS circuit. This is due to the fact that dynamic logic circuits have lower value of switching threshold voltage, which is equal to the threshold voltage of the pull down NMOS devices.

On the other hand, switching threshold voltage of static CMOS logic circuit is around half the supply voltage. The dynamic multiplexers also require a strong keeper to compensate for the cumulative leakage from the parallel evaluation paths, which increases the read access time. Therefore the bitlines typically have a hierarchical organization, in which they are partitioned into local and global bitlines (LBLs and GBLs), with the latter driving the output [3], [8]. But both the LBLs and GBLs remain susceptible to the noise problem intrinsic to the exponential increase in sub-threshold leakage. Increasing the size of the keeper is no longer considered a viable option for improving bitline noise immunity [9], and so several alternative ways of dealing with noise have been proposed [3], [9]. Their common goal is to achieve high noise immunity

572



Fig.1 Local bitline (LBL) organization of the read port of a register file (RF) using a conventional n-bit footless dynamic multiplexer with its input and output switching waveforms.

In addition to low noise immunity, bitline charging and discharging with wide fan-in dynamic multiplexers also dissipates a significant portion of the power used by a register file: this makes it a good target for new low-power designs. While a static gate only consumes switching power when a toggling event occurs at its output, the switching power of a dynamic gate depends on its output state [9]. If the probability of a rising transition at the input is high, which is usually for a high fan-in structure like a dynamic multiplexer in an RF read port, the intensity of switching activity approaches that of the clock. Due to the large capacitance (Fig.1) on the dynamic node caused by the bitline interconnect loading, together with parasitic diffusion capacitances from the pull-down network, high switching activity significantly increases the switching power. In addition, as shown in Fig.1, dynamic operation requires all the RWL inputs to be driven by clocked drivers, which use more energy than static buffers.

## **II. PREVIOUS WORKS**

Previously switching-aware design techniques [2], [4] have been proposed to tackle this excessive switching and the related overheads of wide fan-in dynamic multiplexers. Limited switch dynamic logic (LSDL) [6], [9] adds a latch structure at the gate output [Fig. 2(a)]. This eliminates redundant switching, but only at the output; the dynamic node with its large capacitive loading still has a high switching rate. Thus LSDL fails to produce a truly static switching behavior. The single-phase SP-Domino technique aims to achieve static input and output characteristics. With static input characteristics, the clocked word line drivers can be replaced by static buffers thereby making them more energy efficient. It has a clock-delayed [2] single-phase mode of operation, in which both pull-up and pull-down of the dynamic node occur during the evaluation phase. The reduction in switching at the dynamic and output nodes resulting from this static-like behavior saves a lot of power [7].



Fig.2.Switching-aware techniques—(a) limited switch domino and (b) single-phase SP-Domino

However, SP-Domino design uses the same transistor M1 to perform pull-up and keeper operations. To equalize the rise and fall delays of the gate requires M1 to have a particular width, which fixes the delay and noise design points, precluding any tuning of performance [3]. To overcome these various drawbacks, we recently proposed two different dynamic logic styles in [8], [2] and verified their correct operations through transistor-level schematic simulations of individual logic gates. In this work, however, we make the following unpublished contributions: In Section 3, through discussions on the common principles of operation of the new techniques, we describe how adopting dual dynamic nodes helps to simultaneously overcome the problems of high power dissipation, sub-threshold leakage and poor noise immunity.

## III. PROPOSED DOMINO TECHNIQ<mark>UE</mark>

High sensitivity to noise and a large switching power are the two main limitations of the wide fan-in dynamic multiplexers. We will now describe two dual dynamic node bitline techniques which simultaneously achieve high noise immunity and reduced switching power, while maintaining high performance employed in the read ports of register files.

#### 3.1. Static-Switching Pulse Domino (SSPD)

In this section, we introduce the SSPD technique which achieves a static switching factor like SP-domino but avoids its inflexibility by offering tunable delay and noise performance. The schematic and simulation waveforms of the proposed staticswitching pulse domino (SSPD) are shown in Fig 3 respectively. Similar to an SP-Domino gate, it is a clock-delayed footless domino gate with static input/output characteristics. However to avoid the several design constraints introduced by combining the keeper and pull-up action, we separate the pull-up transistor (M1) from the keeper transistor (M2). This enables the use of a Conditional Pulse Generator (CPG) which turns on M1 during evaluation only if the dynamic node has been discharged during a previous cycle. If the dynamic node has not been discharged, M1 is not turned on and the value is maintained by the keeper transistor M2 which forms a half-latch with the output inverter. Consequently, the switching factor of the internal nodes of the pulse generator becomes outputstate dependent (consuming power only when output is in logic state '1') helping to reduce the power overhead of the pulse generator block. In conventional domino design, the keeper ratio (K) is the most important design parameter in determining the gate's delay performance and noise robustness. However, since SSPD has an additional transistor M1 specifically to function as the pull-up device, an additional design parameter, the width of the pull-up transistor M1, requires simultaneous consideration along with the keeper ratio to characterize the gate's performance. We also employ a clocked isolation transistor M4 to separate the drain terminal of the pulldown network with large capacitive loading (DYN2) from the main dynamic node (DYN1) which is inversely coupled to the output. The purpose of the isolation transistor in the SSPD gate is to shield the large parasitic capacitance at DYN2 (due to the wide pulldown network) from M1 during a pull-up operation. Consider a situation where both DYN1 and DYN2 have been discharged to logical ground in the previous evaluation cycle. At the start of the next clock cycle, if the pull-down network is off, the pull-up transistor M1 will evaluate DYN1 to the logical high state. Contrary to the case in an SP-Domino gate where the pull-up device has to be adequately sized to charge the large capacitance on the dynamic node, most of M1's initial current drive will be utilized to quickly charge up the much smaller capacitance on DYN1 as the current drained by the isolation transistor MN1 would be limited by its nearzero drain-to-source voltage. Thus the sizing constraint on the pull-up device to equalize the high-to-low delay of the gate with its low-to-high delay is now much relaxed. In addition, the voltage swing on DYN2 is also reduced by  $V_{TN}$  (nMOS threshold voltage) leading to additional power savings. Also, note that MN2 is only a minimum-sized nMOS keeper for the node DYN2. Further, the pulse generator is made conditional by generating two additional clock phases, CLKd and CLKi, CLKd behaves as the delayed version of clock and CLKi as the inverse of the clock only if the main dynamic node (DYN1) has been discharged during an evaluate cycle which will make a pull-up operation in the ensuing cycle probable. If, however, the dynamic node is maintained high, CLK and CLKi are pulled down to the low logic state (using feedback from DYN1) half a clock period apart (CLKd is pulled down only at the next negative clock edge).



#### Fig.3 Dynamic multiplexer implemented with the SSPD technique

Thus, no pulse is generated at the output of the pulse generator during the next cycle. The pulse generator is therefore off and no extravagant switching activity is seen on its internal nodes. If the pull-down network turns on during the next cycle, it faces contention only from the keeper transistor M2 and not from the turned-off M1. The situation is depicted in Thus, the keeper ratio, like in a conventional domino, affects only the low-to-high delay of the gate and the noise robustness. Consider the case when DYN1 is evaluated to the low state during a clock cycle and then pulled-up high by M1 during the next cycle. The situation is depicted in. Since M2 and then NMOS evaluation network is off, the speed of pull-up is determined only by the size of MP1 (assuming MN1, like the evaluation transistors, is fixed-sized). Thus, the gate's fall delay can be independently tuned by only modifying the width ofM1. The action of CLKd and CLKi also extends the pulse width to nearly the on-period of the clock during a pull-up operation. This is made possible by turning on Path 2 and turning off Path 1 in the gate G1 of the pulse generator. The extended pulse width further relaxes the design constraint on MP1. The design of the SSPD can thus be accomplished in two simple steps. In the first step, to meet a particular noise target and delay performance, M2 is sized to achieve a particular keeper ratio. In the second step, MP1 is sized to equalize the gate's high-to-low delay with the low-to-high delay (determined by K). Note that the two steps are independent and affords the designer the flexibility of designing for a wide set of specifications, shows the delay and Unity Noise Gain (UNG) variation of a 16-bit SSPD multiplexer for a keeper ratio between 0.1-1 designed.

If we consider only the dynamic node capacitances, the switching power of an SSPD can be written as,

$$_{\rm DYN,SSPD} = P_{\rm MUX} + P_{\rm CPG} + P_{\rm SC} + P_{\rm CLK}$$

(1)

 $= [1/2\alpha C_{dyn2}V_{dd2} + 1/2\alpha C_{dyn1}(Vdd - V_{th,N})V_{dd}f_{c1k}] + P_{CPG} + Pr\{1\}I_{sc,AVG}V_{dd}f_{c1k} + P_{c1k}$ 

Where P<sub>MUX</sub> and P<sub>CPG</sub> respectively are the power dissipated in the dynamic multiplexer (excluding the CPG) and the pulse generator

## **IV. SIMULATION RESULTS**

 $\mathbf{P}$ 

16-input footless dynamic multiplexers are simulated in 1.2V 90-nm industrial CMOS process using Tanner EDA 14.11 tool. The average power consumption for different output state probabilities of an SP-Domino gate, optimized for equal rise and fall delays, is compared with a conventional footless domino gate (equal-UNG conditions) and the proposed SSPD gate (under equal delay and equal-UNG conditions). To account for the overhead of clocked transistors, the power consumption of the local clock buffer is included in the power measurements while that of the input buffer is excluded. The evaluation transistors of the pull-down network are sized equally for all three designs. Simulations are done by varying the output state probability (*Pout*(1)) between 0.1 and 1.

575







For each value of *Pout*(1), the maximum possible value of the input switching factor (which is equal to the output switching factor for SP-Domino and SSPD -  $\alpha$ ), is chosen so as to have the maximum power dissipation. As an example, for *Pout* (1) equal to 0.5, when  $\alpha$  can assume a value of either 0.2 or 1, the input is varied to have an  $\alpha$  value of 1. Similarly,  $\alpha$  is 0.2 for *Pout*(1) equal to 0.1 and 0.9, and 0.4 for Pout(1) equal to 0.2 and 0.8 and so on. Power measurements with 2FO4 and 3FO4 output loads. It is seen that when Pout(1) is less than 0.5, the SSPD gate has a similar power consumption to that of the same-UNG conventional gate. However, the power advantage due to the static-switching behavior becomes apparent for output state probabilities greater than 0.5. For equal noise robustness and Pout(1) greater than 0.5, SSPD gate offers a 18-35% power reduction for a 2FO4 load and around 20-44% power reduction for a 3FO4 load when compared to the conventional domino gate. This is because  $\alpha$  is greater than Pout(1) when Pout(1) is less than 0.5 but starts decreasing for larger values of Pout(1). Since the capacitive power consumption of an SSPD gate is dependent on  $\alpha$  this leads to a power reduction as well. Also notice that although the  $\alpha$  values are same for Pout(1) equal to 0.2 and 0.8, the power demand in the latter case is higher due to the larger power consumption by the pulse generator and contention currents which are output-state dependent. Due to the reduced activity at the output, the power advantage also increases with larger output loads. In, the variation of average power with different keeper ratios for the SSPD gate is shown. With increasing K, the size of and the contention current due to MP2 increases while the size of MP1 and contention due to MP1 decreases. For lower output state probabilities, contention due to MP2 is dominant and hence the average power with increasing K increases. However, for higher output state probabilities, contention due to MP1 becomes more frequent and therefore, the power follows a decreasing trend with increasing K values. The power performance of SP-Domino is marginally better (~ 5-12%) than the SSPD gate. This can be explained by the use of a simpler pulse generator, which contributes around 21% of the total gate power at the highest value of Pout(1). The same value for the SSPD gate is closer to 25%. Therefore, the simulation results show that while both SP-Domino and SSPD techniques offer significant power reductions for biased output states (Pout(1)>0.5, which is usually the case for high fan-in gates), the SSPD gate has the important advantage of being easily modified for a particular delay or noise performance. The three designs are also analyzed for process variations by performing 500-point Monte Carlo simulations with the standard deviation of threshold variations set to 1%, 5% and 10% of the mean value. The average delay and its variation, and the average power values are shown in

576

Table 1. The variation in power is found to be negligible and is omitted. Since both the SSPD and domino gates were designed to have a sufficiently wide pulse width to account for variations, the delay spread of both the techniques is similar to that of the conventional scheme and the pulse generator is found to not increase the performance variability.

#### 5. PERFORMANCE COMPARISONS OF DYNAMIC MULTIPLEXERS

Using the conventional domino and SSPD circuit techniques, we designed and simulated 16-bit and 32-bit dynamic multiplexers in a 1.2-V, low-, 65-nm CMOS process. The conventional domino multiplexers were simulated with two different keeper ratios: a small keeper (1% keeper ratio) provides a reference for high performance, and a large keeper (7% keeper ratio) provides a reference for good noise tolerance. We found that a 2% keeper can be used with the SSPD multiplexer to achieve similar robustness (the same UNG) against noise as a conventional multiplexer with a 7% keeper, which demonstrates the improved noise tolerance of the SSPD topology. Since the SP domino technique only requires keeper transistors of minimum size, it has a very small keeper ratio of about 0.07%.

## TABLE I

# Simulated Delay And Average Power Values Of 16-Bit And 32- Bit Conventional And SSPD Multiplexers At The Nominal

| Process Corner                   |                        |                        |                                     |                                   |
|----------------------------------|------------------------|------------------------|-------------------------------------|-----------------------------------|
|                                  | 16bit<br>delay<br>[ps] | 32bit<br>delay<br>[ps] | 16bit<br>power<br>[mW]<br>Pr{1}=0.5 | 32 bit<br>power<br>[mW]<br>Pr{1}= |
| Conventio-<br>nal<br>(1% keeper) | 60.4                   | 95.1                   | 0.087                               | 0.175                             |
| Conventio-<br>nal<br>(7% keeper) | 98.3                   | 151.4                  | 0.129                               | 0.235                             |
| SSPD<br>(2% keeper)              | 77.5                   | 85.3                   | 0.105                               | 0.176                             |

The read performance measurement results for the register files are summarized in Table I

#### TABLE II

## Performance Summary Of 0.5-Kbit Conventional, Fvfd And Sspd Register Files In 1.2-V 65-Nm Low- Cmos Technology

| 2     |                   | Keeper<br>Ratio | Maximum<br>Freque-<br>Ncy<br>[GHz] | Average<br>Read<br>Power<br>[Mw/GHz] |
|-------|-------------------|-----------------|------------------------------------|--------------------------------------|
|       | Conven-<br>tional | 0.07%           | 1.8                                | 0.8                                  |
| 10 al | SSPD              | 2%              | 2.5                                | 1.95                                 |

Compared to a conventional register file employing an upsized keeper (7% keeper ratio) which has a memory core read power dissipation of 2.94mW/GHz at 1.2 V, the SSPD register file's memory core uses 71% lesser read power. SSPD register file's high speed confirms the results in Section IV where that while a 16-bit SSPD multiplexer was slightly slower than a conventional multiplexer with 1% keeper, a 32-bit SSPD multiplexer was faster due to the split bitline. Further, since we measure power over several read cycles in which the same data word is read from the same location,

# 6. CONCLUSION

In this we propose a static-switching pulse domino technique that utilizes a conditional pulse generator and an isolation transistor to remove the inflexibility of an SP-Domino gate while retaining the power advantages from having a static-switching behavior. To reduce this switching power, we have introduced dual dynamic nodes into the register file bitline read-out circuits. These dual dynamic node techniques achieve high noise immunity, leakage tolerance and reduced switching power, but performance is not significantly affected. Measurement results from 0.5-Kb register files implemented in a 65-nm CMOS technology suggest that these techniques can be a promising choice for register files in very deep sub-micrometer technologies.

# REFERENCES

- [1] "Bitline Techniques With Dual Dynamic Nodes for Low-Power Register Files", Rahul Singh, Gi-Moon Hong, and Suhwan Kim, *Senior Member, IEEE* transactions on circuits and systems—i: regular papers, vol. 60, no. 4, april 2013
- [2] K. L. Shepard and V. Narayanan, "Noise in deep submicron digital design," in *Proc. IEEE Int. ASIC/SOC Conf.*, 1996, pp. 524-531.
- [3] K. Bernstein, K. M. Carrig, C. M. Durham, P. R. Hansen, D. Hogenmiller, E. J. Nowak, and N. J. Rohrer, *High Speed CMOS Design Styles*, Kluwer Academic Publishers, 1999.
- [4] C. J. Akl and M. A. Bayoumi, "Single-Phase SP-domino: a limited switching dynamic circuit technique for low-power wide fan-in logic gates," *IEEE Trans. Circuits Syst. II, Exp. Briefs*, vol. 55, no. 2, pp. 141-145, Feb. 2008.
- [5] R. Montoye, *et al.*, "A double precision floating point multiply," in *Proc. IEEE Int. Solid-State Circuits Conf.*, 2003, pp. 336-337.
- [6] J. Sivagnaname, H.C. Ngo, K. J. Nowka, R. K. Montoye, and R. B. Brown, "Wide limited switch dynamic logic implementations," in *Proc. IEEE Int. Conf. on VLSI Design*, 2006.
- [7] G. Yee and C. Sechen, "Clock-delayed domino for dynamic circuit design," *IEEE Trans. Very Large-Scale Integr. (VLSI) Syst.*, vol. 8, no. 4, pp. 425-430, Aug. 2000.
- [8] H. Mahmoodi-Meimand and K. Roy, "Diode-footed domino: a leakage-tolerant high fan-in dynamic circuit design," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 51, no. 3, pp. 495-503, Mar. 2004.
- [9] C.J.Akl,M.A.Bayoumi,Single-Phase SPdomino:alimited-switching dynamic circuit technique for low- power wide fan-in logic gates,

