Static Timing Analysis Basics

Preface#

This note only introduces the essential concepts about Static Timing Analysis, which does not include:

Async, i.e. remove, recover
Timing concepts, i.e. false path, multi-cycle path, etc.
Advanced timing domain knowledge
- POCV, MCMM, etc.

What is STA#

As the clock frequency increases, the logic units in the chip can complete more operations in a unit of time, so frequency is positively correlated with chip performance. Chip design requires a trade-off between PPA, so how can we know the frequency limit at which a chip can operate normally? This introduces the concept of STA (Static Timing Analysis).

STA is used to verify whether the design can safely operate at a given clock frequency without timing violations. STA has the following characteristics:

Pros
- No need for input stimulus simulation
- Comprehensive timing checks
Cons
- Cannot handle asynchronous timing

STA Application Scenarios#

STA can be applied at multiple stages of PD and has different characteristics, such as:

Synthesis: At the logic design stage, since there is no physical information related to layout, it can be assumed that interconnects are in an ideal state. This stage focuses more on examining the logic that leads to the worst path. Another technique used at this stage is the wire load model to estimate the length of interconnects; the wire load model provides an estimated RC value based on the fan-out of logic units.
Pre-CTS: At the beginning of physical design, the clock tree is considered ideal, meaning it has zero delay. After CTS, the clock has actual propagation delay.
Pre-Route: Before actual routing, STA is used to calculate the delay of metal line parasitic RC as an estimate.

Cell#

Cells can be standard cells, IO buffers, or complex IPs like USB cores. In addition to timing information, the library cell description also includes some other attributes, such as cell area and functionality, which are unrelated to timing but are used during RTL synthesis.

Pin Capacitance#

Each input and output of a cell can specify capacitance at the pin. In most cases, only the input pins of the cell specify capacitance, while output pins do not, meaning that the output pin capacitance in most cell libraries is 0.

The above example shows the general specification of the input INP1 pin capacitance value. In the most basic format, pin capacitance is specified as a single value (0.5 units in the above example). The capacitance unit is usually picofarads (pF), typically specified at the beginning of the library file. The cell description can also specify values for rise_capacitance (0.5 units) and fall_capacitance (0.45 units), which refer to the values when the level rises and falls at pin INP1. The values of rise_capacitance and fall_capacitance can also be specified as ranges, with lower and upper limits specified in the description.

Drive Strength#

Input pin capacitance is defined in liberty, while output pin capacitance is determined by all downstream cells driven by that cell. When a CMOS cell switches state, the speed of switching depends on how quickly the capacitance on the output pin is charged and discharged.

Generally, the drive strength of a cell determines the maximum capacitive load it can drive, and the maximum capacitive load determines the maximum number of fan-outs, i.e., how many other cells it can drive. Higher output drive corresponds to lower output pull-up/pull-down resistance, allowing the cell to charge and discharge larger loads on the output pin.

The larger the drive strength, the larger the cell area, and the larger the max_cap.
The larger the drive strength, the smaller the corresponding output resistance, and the smaller the delay.

image.png|500

If the standard cell library only contains standard logic cells with small drive strengths, what impact does it have on timing?
- When the entire library consists of small drive cells, the first thought is that the driving capability of each cell is weak, and the output resistance is larger.
- If an inverter has a small drive strength, it can drive a smaller maximum load capacitance. If certain nodes in the design must drive larger capacitances, such as long lines or high fan-out networks, small drive cells may not meet the requirements, leading to setup time or hold time violations.

Propagation Delay#

The propagation delay of a cell is defined by certain measurement points on the level switching waveform. The units of these thresholds are a percentage of Vdd or the power supply, and for most standard cell libraries, the 50% threshold is typically used to calculate delay.

The propagation delay here is divided into two types (not equal) based on the rise/fall of the output signal:

output rise delay: The delay from when the input signal reaches the falling edge threshold point to when the output signal reaches the rising edge threshold point.
output fall delay: The opposite of output rise delay.

Slew#

The definition of slew rate is the rate of voltage change. In STA, the rise or fall waveform is usually measured based on the speed of level transitions. Slew is typically defined based on transition time, which refers to the time required for a signal to transition between two specific levels. Note that the transition time is actually the inverse of the slew rate, so the larger the transition time, the lower the slew rate, and vice versa.

A specified threshold voltage is generally used to define the starting and ending points for transition time calculations.

Slew rate and slew are not the same thing. Slew refers to transition, while slew rate is its inverse.

Timing Arc#

Timing arcs describe the delay of signal transmission between cell pins and the signal transition conditions.

For combinational logic units like AND gates, OR gates, NAND gates, and adders, there is a timing arc from each input pin to each output pin.
For sequential logic units like flip-flops, in addition to the timing arc from the clock pin to the output pin, there are also timing constraints for data pins relative to the clock pin.

Each timing arc has a specific timing sense, which indicates how the output changes in response to different transition types of the input. In non-unate timing arcs, the transition direction from one input pin alone cannot determine how the output pin level will change; it also depends on the state of other input pins.

Timing Model#

The timing model of a logic unit aims to provide accurate timing information for various instances of units in the design.

Each timing arc has a timing model.
The timing model is obtained from detailed circuit simulation.

For an inverter, there are two types of delays: the output rise delay $T_{r}$ and the output fall delay $T_{f}$.

The delay and output transition through the inverter mainly depend on:

Output load, i.e., the capacitive load on the inverter's output pin.
Transition time of the input signal.
Transistor layout design: negligible.

The signal input of a logic unit is like water flowing into a tank; the water first drives the blue water wheel (similar to input transition time), and then fills the tank (output capacitance) before it can drive the red water wheel (the next logic unit).

image.png|625

Delay values are directly related to load capacitance: the larger the load capacitance, the larger the delay. In most cases, delay also increases with the increase in input signal transition time. PS: Not absolute.

NLDM#

The timing model of a logic unit can be simply understood as a function with input slew and output load as parameters, but simple linear timing models are not accurate when applied to submicron technology. Therefore, most cell libraries currently use more complex non-linear delay models (NLDM).

Most cell libraries include table models to specify delays for various timing arcs of the unit and perform timing checks. These table models are called NLDM (Non-Linear Delay Model) and can be used for delay, output slew calculations, or other timing checks. The table model provides various combinations of input transition times at the cell input pin and output load capacitance at the output pin to determine the delay through the unit.

According to the delay table, when the input falling transition time is 0.3ns and the output load is 0.16pf, the rise delay of the inverter is 0.1018ns. Since the falling edge transition of the input causes the rising edge transition of the inverter output, when the input pin experiences a falling edge transition, the cell_rise delay table should be queried. Note that the table model can also be three-dimensional, for example, a flip-flop with complementary outputs Q and QN.

The NLDM model can be used not only to calculate delays but also to calculate the transition time of the logic unit's output pin, which is also characterized by input transition time and output load capacitance.

Thus, the NLDM model can calculate:

Rise Delay
Fall Delay
Rise Slew
Fall Slew

Additionally, if there is no corresponding index in the table, interpolation can be used to calculate the result.

Derate#

skip it

Slew values are based on measurement threshold points specified in the library, and most previous generation libraries (0.25um or older) used 10% and 90% (corresponding to the linear portion of the waveform) as measurement threshold points for slew (or transition time).

With technological advancements, the most linear portion of the actual waveform is usually between 30% and 70%. Therefore, most new generation timing libraries specify slew measurement threshold points at 30% and 70% of Vdd. However, since previously measured transition times were between 10% and 90%, when filling the library, the measured transition times of 30% to 70% are typically doubled, specified by the slew derate factor (slew derate factor), usually set to 0.5. A slew measurement threshold point of 30% and 70% with a slew derate factor of 0.5 is equivalent to measurement threshold points of 10% and 90%.

Combinational Logic Units#

For a two-input AND gate: there are four types of delays and four types of output transitions.

Rise and fall * two input pins = 4
In an FPGA, all delay information for each logic unit is generally fixed, so each type of logic unit fits a fixed delay (e.g., LUT is 0.1ns, DSP is 1.3ns, etc.).

General Combinational Logic Block#

Consider the following general combinational logic block with three inputs and two outputs:

Such combinational logic blocks can have multiple timing arcs. Typically, there is a timing arc from each input of the block to each output.

Sequential Logic Units#

The timing arcs of sequential logic units are as follows:

image.png|500

For synchronous input signals at pins D, SI, and SE, there are the following timing arcs (both rise and fall):

Setup time check timing arc
Hold time check timing arc

For synchronous output signals at pin Q, there is the following timing arc:

CK to Q or QN Propagation delay arc

For asynchronous input signals at pin CDN, there are the following timing arcs:

Removal time check timing arc
Recovery time check timing arc

Additionally, for the clock pin and asynchronous pins, there are also:

Pulse width timing check

Setup and Hold#

Setup time and hold time synchronous timing checks are used to ensure that data can be correctly propagated through the timing unit. These timing checks can verify whether the input data is in a determined logical state at the clock's active edge and that the correct data is latched at the active edge.

The two-dimensional table model is determined based on the transition time at the constrained pin (D) and the related pin (CK).

Setup and hold details will be introduced later.

Asynchronous Timing Checks#

SKIP

State-Dependent Timing Models#

SKIP

The timing arc between inputs and outputs depends on the logical state of other pins in the module.

Black Box Interface Timing Models#

SKIP

Advanced Timing Models#

SKIP

Non-linear delay models (NLDM) are timing models that represent delays through timing arcs based on output load capacitance and input transition time. In practice, the load on the unit's output not only includes capacitance but should also include interconnect resistance.

Since the NLDM method assumes that the output load is purely capacitive, interconnect resistance becomes an issue. Even if interconnect resistance is not zero, these NLDM models are still used when the impact of interconnect resistance is small. In the presence of interconnect resistance, the delay calculation method is improved by obtaining equivalent effective capacitance at the output of the unit. The effective capacitance obtained using the "effective" capacitance method in delay calculation tools ensures that the unit's output delay matches the output delay of a unit with RC interconnect.

Since NLDM cannot handle the errors caused by interconnect resistance well, more advanced timing models such as CCS (Composite Current Source) have been proposed.

Clock#

Skew#

Skew refers to the timing difference between two or more signals (data or clock). For example, if a clock tree has 500 endpoints and has a 50ps skew, it means that the delay difference between the longest clock path and the shortest clock path is 50ps.

The starting point of the clock tree is usually the node that defines the clock, and the endpoints of the clock tree are usually the clock pins of synchronous elements (such as flip-flops). Clock latency (Source + Insertion) refers to the total time taken from the clock source to the endpoint, while clock skew refers to the time difference in reaching different endpoints of the clock tree.

An ideal clock tree assumes that the clock source has infinite driving capability, and the clock can drive an infinite number of endpoints with no delay. Additionally, it is assumed that any logic units present in the clock tree have zero delay. In the early stages of logic design, STA typically uses an ideal clock tree for execution, so the focus of the analysis is on the data path. The set_clock_latency command can be used to specify the clock tree delay.

Uncertainty#

The set_clock_uncertainty command specifies a window for the appearance of clock edges. The uncertainty of clock edge timing considers multiple factors, such as clock cycle jitter and additional margins (slack) needed for timing verification. In practice, there is no ideal clock; all clocks have a certain amount of jitter, and clock cycle jitter should be included when specifying clock uncertainty.

Before the clock tree is implemented, clock uncertainty must also include the expected clock skew. Hold time checks do not need to include clock jitter, so a smaller clock uncertainty is usually specified for hold time checks.

Actual Clock Signals#

Actual clock signals include rising and falling edges:

Combining the two clock signals results in an ideal eye diagram, where only transitions are present:

However, in reality, clock signals have different arrival times (jitter), resulting in the following eye diagram:

Additionally, clock signals can experience voltage drops and ground bounce due to power supply variations.

Ultimately, the actual clock signal is as follows:

For level fluctuations, noise margin is defined, allowing for some distortion:

The area of the clock signal without jitter is referred to as the window where data is reliable:

The area of the clock signal with jitter is referred to as jitter: Jitter has to be accounted for in the timing reports. We model this using one more parameter called Uncertainty.

Example: Uncertainty = 90ps = 0.09ns

Clock Domain#

A clock usually drives many flip-flops, and a group of flip-flops driven by the same clock is called its clock domain. The following diagram shows two clock domains:

A key question to consider is: Are the two clock domains related or independent of each other? The answer depends on whether there is a data path that starts in one clock domain and ends in another. If there is no such path, we can confidently say that the two clock domains are independent of each other, meaning there is no timing path that starts in one clock domain and ends in another.

If there is a data path crossing clock domains (as shown in the diagram below), it must be determined whether these paths are real paths: for example, a flip-flop driven by a double-frequency clock initiates data, which is then captured by a flip-flop driven by a single-frequency clock; this path is a real path.

An example of a false path is when designers explicitly place clock synchronizer logic between two clock domains. In this case, even though it seems that there is a timing path from one clock domain to the next, it is not a real timing path because the data is not constrained to propagate through the synchronizer logic within one clock cycle. Such paths are called false paths (not real) because it is the clock synchronizer that ensures data is correctly passed from one clock domain to another.

False paths belong to timing exceptions, so skip it.
In designs, some paths cannot exist or cannot occur; these paths are called false paths. False paths usually occur in asynchronous circuits and across clock domains; or the internal logic of the circuit is complex, and it is deduced that it is actually a constant that will not change.

In practice, cross-clock domain situations are often bidirectional, i.e., from the USBCLK clock domain to the MEMCLK clock domain, and from the MEMCLK clock domain to the USBCLK clock domain; both situations need to be correctly understood and handled in STA.

SDC#

Correct constraints are important for analyzing STA results; only by accurately specifying the design environment can STA analysis identify all timing issues in the design. The preparation for STA includes setting clocks, specifying IO timing characteristics, and specifying false paths and multi-cycle paths.

To perform STA on such a design, it is necessary to specify the clock for the flip-flops and the timing constraints for all paths entering and exiting the design.

Specifying Clocks#

To define a clock, we need to provide the following information:

Clock source: It can be a port of the design or a pin of an internal unit in the design (usually part of clock generation logic).
Period: The period of the clock.
Duty cycle: The duration of the high level (positive phase) and the duration of the low level (negative phase).
Edge times: The moments of the rising and falling edges.

Example of creating a clock: create_clock -name SYSCLK -period 20 -waveform {0 5} [get_ports SCLK]; this clock is named SYSCLK and is defined at port SCLK. The period of SYSCLK is specified as 20 units; if not specified, the default time unit is nanoseconds (usually, the time unit is specified in the technology library). The first argument in the waveform specifies the moment the rising edge occurs, and the second argument specifies the moment the falling edge occurs.

Clock Uncertainty#

The set_clock_uncertainty constraint can be used to specify the timing uncertainty of the clock cycle, which can be used to model various factors that may reduce the effective clock cycle. These factors may include jitter and any other pessimism that may need to be considered in timing analysis.

set_clock_uncertainty -setup 0.2 [get_clocks CLK_CONFIG]; note that the clock uncertainty for setup time checks will reduce the available effective clock cycle. For hold time checks, clock uncertainty will serve as additional timing margin that needs to be satisfied.

Clock Latency#

Clock latency can be set using the following command, such as set_clock_latency 1.8 -rise [get_clocks MAIN_CLK].

There are two types of clock latency: network latency and source latency: the total clock latency at the clock pin of the flip-flop is the sum of source latency and network latency. After the clock tree synthesis is completed, the total clock latency from the clock source to the clock pin of the flip-flop is the source latency plus the actual delay of the clock tree from the clock definition point to the flip-flop.

Network latency refers to the delay from the clock definition point (create_clock) to the clock pin of the flip-flop.
- Ignored after CTS
Source latency, also known as insertion delay: refers to the delay from the clock source to the clock definition point; source latency may represent on-chip or off-chip delay.
- Retained after CTS

An important distinction between source latency and network latency is that once a clock tree is established for the design, network latency can be ignored (assuming the set_propagated_clock command is specified).

Constraining Input Paths#

The flip-flop UFF0 is external to the design and provides data to the internal flip-flop UFF1. The data connects the two flip-flops through the input port INP1.

The clock definition for CLKA specifies the clock period, which is the total time available between the two flip-flops UFF0 and UFF1. The time required by the external logic is Tclk2q (the CK to Q delay of the data initiating flip-flop UFF0) plus Tc1 (the delay through the external combinational logic), so the delay definition at the input pin INP1 specifies the external delay of Tclk2q plus Tc1.

The following are the constraints for input delays (which can be defined separately for min and max):

set Tclk2q 0.9
set Tc1 0.6
set_input_delay -clock CLKA -max [ expr Tclk2q + Tc1] [ get_ports INP1]

Constraining Output Paths#

Constraining output paths is similar to constraining input paths and can be specified using the command set_output_delay to specify external delays:

Timing Path Groups#

Timing paths in the design can be viewed as a collection of paths, each with a start point and an endpoint.

Timing paths can be classified into different timing path groups based on the clock associated with the endpoint. Therefore, each clock has a set of timing paths associated with it. There is also a default timing path group that includes all non-clock (asynchronous) paths.

External Attribute Modeling#

Although create_clock, set_input_delay, and set_output_delay are sufficient to constrain all paths used for timing analysis in the design, they are not enough to obtain the accurate timing on the module's IO pins.

For inputs, slew must be specified at the input port:

set_drive
set_driving_cell
set_input_transition

For outputs, the capacitive load at the output pin must be specified:

set_load

Drive Strength Modeling#

In summary, designers need to specify the slew value at the input to determine the delay of the first unit in the input path. In the absence of this constraint, it will be assumed to be an ideal transition value of 0, which is clearly unrealistic.

The set_drive and set_driving_cell constraints are used to model the drive strength of external units at the input port of the driving module. In the absence of these constraints, it is assumed that all inputs have infinite drive strength, meaning the transition time at the input pin is 0.

set_drive explicitly specifies the resistance value at the DUA input pin, the smaller the resistance value, the higher the drive strength, and a resistance value of 0 indicates infinite drive strength. The drive strength at the input port is used to calculate the transition time of the first unit. The specified drive strength can also be used to calculate the delay value from the input port to the first unit under any RC interconnect conditions.

Delay value = (Drive strength * Network load) + Interconnect delay

The set_driving_cell constraint provides a more convenient and accurate way to describe the driving capability of the port. The set_driving_cell can be used to specify the type of unit driving the input port. However, the incremental delay caused by the driving unit due to the capacitive load at the input port is considered as additional delay included on the input.

As an alternative to the above methods, the set_input_transition constraint provides a convenient way to represent transition time at the input port and can specify a reference clock.

Load Capacitance Modeling#

Specifying the load on the output is important because this value affects the delay of the unit driving the output. In the absence of this constraint, the load will be assumed to be 0, which is clearly unrealistic.

The set_load constraint sets the capacitive load at the output port to simulate the external load driven by the output port. By default, the capacitive load at the port is 0. The load can be explicitly specified as a capacitance value or the input pin capacitance of a unit.

DRV#

Two commonly used design rules in STA are maximum transition time -max_transition and maximum capacitance -max_capacitance. These rules will check whether all ports and pins in the design meet the specified constraints for transition time and capacitance.

In addition, other design rule checks can be specified for the design, such as: set_max_fanout (specifying fanout constraints for all pins in the design) and set_max_area (for the design). However, these checks apply to synthesis rather than STA.

Delay Calculation#

Basic Concepts of Delay Calculation#

As mentioned above, each unit's input pin has pin capacitance, so each net will have capacitive load, which is the sum of the pin load capacitance of all fanouts and the parasitic capacitance of the interconnect.

Consider the following design:

For NET0, ignoring interconnect parasitics, its capacitance equals the sum of the input pin capacitance of UAND1 and UNOR2. Thus, the above diagram can be equivalently represented as:

The load capacitance of output O1 is equivalent to the output port load (not specified, can be specified using set_load) plus the input pin load capacitance of UNOR2 (specified in the library), so at this point, simply specifying the slew (or set_drive) for input I1 can yield the propagation delay and output transition of unit UAND1 (knowing the output transition of the previous level allows us to obtain the input transition of the next level).

Since multi-input units have multiple timing arcs from different inputs to outputs, the value of output transition is determined by the slew merge results.

Effective Capacitance Calculation for Unit Delay#

When the load at the unit output pin includes interconnect resistance, the NLDM model cannot be used directly. Therefore, the "effective" capacitance method is used to handle the impact of resistance.

The effective capacitance method attempts to find a capacitance that can be used as an equivalent load so that the original design behaves consistently in terms of timing at the unit output with a design that has an equivalent capacitive load. This equivalent capacitance is referred to as effective capacitance (effective capacitance).

In practical situations, the impact of interconnect parasitics cannot ignore the effect of resistance, and at this point, RC interconnect can be modeled as a simplified PI model. Since NLDM only accepts capacitance, RC is calculated as an equivalent $C_{eff}$, allowing the use of NLDM lookup tables to obtain unit delay. Different algorithms exist for calculating this $C_{eff}$, such as second-order AWE, Arnoldi algorithm, etc.

Note: Although it is possible to obtain approximate unit delays, the output slew does not match the actual output waveform of the unit.

Net Delay#

For students with a background in large-scale simple circuits, the essence of wire delay is that the conducting circuit can be equivalently represented as resistance and capacitance (R and C), and the delay of signal transmission over it can be simplified to RC Delay. Overall, wire delay depends on wire width, wire length (Wire length), process, fanout branches (Fanout). In different EDA stages, we can estimate wire delay between two pins using different models.

Logic synthesis: For example, Synopsys's Design Compiler estimates the wire delay between two signal pins based on the wire load model (Wire Load Model, WLM). At this design stage, the chip design has not yet reached layout and routing, so there is no relative position to specify the routing path. Therefore, WLM estimates the length of the network based on the number of fanouts, leading to delays (the error can be imagined, after all, a logic path with fewer fanouts may also be stretched far during layout). WLM is usually provided by the corresponding ASIC/FPGA manufacturers, and designers can fine-tune it based on their designs. In a design, different levels and different routes can configure different WLMs to approximate actual delays.
Layout: During the layout phase, the specific positions of each logic unit are known, so we can fully utilize positional information to estimate paths: we first estimate how long the wire is between two connected logic units, and then estimate the delay based on wire length. It should be noted that although theoretically, the longer the wire, the longer the delay, it is not entirely linear, after all, driving from Guangzhou to Beijing, starting on the highway and then switching to county roads is still different. Generally speaking, the timing estimates before and after routing in Cadence Innovus are based on the so-called TrialRoute (attempted routing) or Early Global Route (early global routing) to estimate routing conditions, and then based on this rough routing condition, perform RC parasitic parameter extraction, and then add these parasitic parameters to the input pin capacitance of the driving unit to obtain wire delay. The most important thing is how to obtain an accurate routing estimate; an accurate estimate can achieve very small timing jumps (Timing Jump) before and after routing.
Routing: During this period, not only are the positions known, but the specific metal routing is also known. Therefore, RC parameters can be directly extracted, and timing analysis engines can be run.

Elmore Model for Interconnect Delay Calculation#

Elmore is a delay model used to calculate net delay under specific conditions of RC interconnect structures.

Slew Merging (TBD)#

Path Delay Calculation#

Review several concepts: timing path, timing arc

The theoretical timing path has a start point and an endpoint:

Start point: input port and clk pin
End point: d pin and output port

Thus, there are four types of timing paths: r2r, i2o, i2r, and r2o.

Timing arcs are used to describe:

The signal transmission relationship between pins (transmission delay and how it changes)
Timing constraints: setup/hold, etc.

Therefore, once timing arcs annotate the whole design, calculating path delay is simply adding all net arcs and cell arcs.

`I2O`#

The first type of timing path, from input port to output port.

The transition time from the input port to the first load cell needs special handling, i.e., the transition time (or slew) at the input of the first inverter can be specified; if no such specification is made, it is assumed to be 0 (equivalent to the ideal case).

In OpenSTA, if not specified, the load slew of the first cell is 0. The root to the first cell's load delay and load slew can be calculated when seedRootSlew is executed.

Additionally, an equivalent capacitance can be calculated based on the RC load situation at the output of the first cell, allowing for lookup to obtain the first cell's delay and output slew.

Once the output slew of the first cell is calculated, the input slew of the next-level unit can be obtained, and this process continues in a loop.

Note that similar to the first-level input, the last-level output needs to manually set_load; otherwise, only the line load of network N3 will be used.

`I2R`#

Similar calculations apply.

`R2R`#

Similar calculations apply.

Timing Graph#

STA breaks a design down into timing paths, calculates the signal propagation delay along each path, and checks for violations of timing constraints inside the design and at the input/output interface.

Timing Path#

Timing paths have start and end points, defined as follows:

Based on the start point and endpoint, timing paths can be classified into four categories:

Input port to d pin, I2R
Clk pin to output port, R2O
Clk pin to d pin, R2R
Input port to output port, I2O

Timing paths are a collection of segments of timing arcs; in addition to being classified based on start and end points, they can also be classified by signal type or timing check: Data path, Clock path, Clock-gating path, Asynchronous path.

Timing Graph#

Consider the following netlist:

Convert the above circuit to a 'Direct Acyclic Graph (DAG)' shown below:

OpenSTA Timing Graph#

The timing graph is a flat DAG, although OpenSTA has a full hierarchical netlist.

The following netlist is an example:
image.png|425

Convert it to a timing graph:

A vertex is defined as: Each vertex corresponds to one network pin.

It includes internal pins (not shown in the diagram).

An edge is defined as: There is one edge between each pair of pins that has a timing path between them.

Each edge has its own timing role: it represents either cell delay or wire delay, or various types of timing analysis.

image.png|275

Additionally, a set of timing arcs is stored on each edge: A timing arc set is a group of related timing arcs between a pair of cell ports. Wire timing arcs are a special set owned by the TimingArcSet class.

Timing Analysis Methods#

In terms of analysis, it can be divided into Path-Based and Block-Based, with the main difference being how the transition time of specific logic units is handled.

In practice, during the operation of the circuit, the input level transition time received by a logic unit is influenced by the preceding logic unit.

image.png|500

The input transition at pin C depends on the output transition of the preceding logic unit, and the output transition caused by different input pin transitions varies, thus affecting the input transition of C. How to determine the transition time at C is the difference in timing analysis algorithms.

Graph-Based#

Graph-based static timing analysis (GBA) is the default analysis mode for most tools, which calculates delays based on worst-case level transition times when reading cell delays from the standard cell library. For example, in the above example, regardless of how A and B transition, the maximum level transition time at C will always be taken, such as 12ps. Therefore, even if the signal at A remains unchanged in a certain timing path, and changes only occur at pin B, the subsequent blue or NOT gate will apply the 9ps level transition time caused by pin B; in the GBA analysis algorithm, it will still use 12ps. Thus, GBA mode tends to be more pessimistic, which may lead to timing violations on some paths, because actual transitions are unlikely to cause every logic unit to transition exactly at the worst-case level transition time. To address this pessimism and improve accuracy, path-based static timing analysis (Path-based Analysis, PBA) was introduced.

Path-Based#

PBA adopts a path-based timing analysis method that analyzes all timing paths.

Compared to GBA, PBA traverses all possible timing paths and theoretically enumerates all possible input transition combinations for timing evaluation, thus obtaining the most accurate timing analysis results. In the example above, if the transition occurs at pin B, PBA will genuinely use the 9ps transition at pin B to calculate the delay of the next-level blue or NOT gate. However, because it traverses more scenarios compared to GBA, the runtime of the algorithm is extremely slow; in complex cases, PBA may be an order of magnitude slower than GBA.

GBA vs PBA#

For the same combinational design, GBA vs PBA is shown below:

min_delay_in_GBA <= min_delay_in_PBA
max_delay_in_GBA >= max_delay_in_PBA

In GBA (Graph Base Analysis), instead of choosing 2 combinations of AND gate (1) delay, i.e., (Combination_1: 0.5ns, 1.5ns; Combination_2: 0.2ns, 1.2ns), we choose extreme boundaries, i.e., min delay = 0.2ns and max delay = 1.5ns.

In the case of PBA (Path Base Analysis), we are using the actual delay between input pin and output combination (means choosing both combination of delay).

Combination_1: 0.5ns, 1.5ns
Combination_2: 0.2ns, 1.2ns

You might be thinking that this is not accurate (means why in GBA we missed 2 values), we are adding unnecessary delay in our calculation. And I am glad to say that you are right. :) The reason we are doing this is that from the tool's point of view - doing analysis or calculations according to GBA is very fast compared to PBA. The runtime of the tool is very low. And the only difference is that we are adding pessimism to our calculations.

GBA is faster than PBA
GBA is more pessimistic than PBA

Based on the above characteristics, GBA and PBA have different uses in static timing analysis. GBA can achieve fast but rough analysis; if no violations are detected, then because GBA is so pessimistic that there are no violations, the results of PBA analysis should have no violations. If GBA has violations, we can then use PBA, but there is no need to analyze all timing paths again; we only need to analyze the paths that produced violations in GBA mode (of course, global PBA can also be performed).

GBA Delay Calculation#

Each cell arc regardless of rise, fall, or min, max, takes extreme values, making calculations simpler and faster, but also more pessimistic and less accurate.

PBA Delay Calculation#

PBA will exhaust all arc combinations on a timing path.

Principles of Graph-Based Static Timing Analysis#

Assuming all latches receive the clock rising edge at the same time (i.e., ignoring clock skew caused by layout). Under this series of simplifications, the STA problem can be reduced to: finding out how far the timing endpoints are from the farthest timing start point in a directed graph (theoretically referring to Arrival Time, the delay of the signal from the source to a certain node), which is the longest path problem (reference: How to find the longest path in a directed acyclic graph?, here solving the multi-source multi-sink longest path), as shown in the figure below. The simple description of the algorithm is to start from all starting points, traverse all nodes, and update the farthest distance of each node from the starting point:

ArrivalTime[i] = max{ArrivalTime[Predessor[i, j]]+CellDelay[j]+NetDelay[i,j]}

Where i is the current node number, $Predecessor[i, j]$ refers to the number of the j-th predecessor node of node i, and $ArrivalTime[Predecessor[i, j]]$ is the time the source signal arrives at that predecessor node, $CellDelay[j]$ is the logic delay of the predecessor node, and $NetDelay[i,j]$ is the wire delay from the predecessor node to node i, which can be obtained based on the coordinates of the two nodes.

image.png|550

Netlist Partitioning#

Since layout requires re-evaluating timing conditions, the longest path algorithm for multi-source multi-sink will be frequently called during layout algorithm execution. From the above recursive formula, we can see that to calculate the $ArrivalTime[i]$ of a node, we need to have already obtained the ArrivalTime of its predecessor nodes; otherwise, these ArrivalTimes cannot ensure the longest path. The circuit partitioning is aimed at enabling parallel computation, meaning we need to color and partition the nodes in the circuit so that the nodes in each subgraph do not depend on each other (i.e., there are no edges between them, and the graph does not have to be connected), for each subgraph's nodes, we can compute ArrivalTime in parallel.

To achieve this characteristic in circuit partitioning, the basic algorithm principle is: store all timing starting points in a queue and mark them as level=0. Then start BFS, updating the farthest distance of each node in the directed graph from the timing starting point, and the directed graph will be marked:

image.png|675

Timing Propagation Based on Layered Synchronous BFS#

Forward BFS to Calculate Arrival Time#

It can be noted that after layering, all nodes in level=i: (1) except for the special case of level=0 (the starting unit at level=0 will be forced to be marked as ArrivalTime=0 and will not be calculated), there are no edges between them, meaning there are no dependencies in their timing calculations; (2) if we calculate ArrivalTime in the order from level=0 to i, when traversing to level=i, the nodes from level=0 to i-1 have completed their ArrivalTime calculations. Thus, all nodes at level=i can simultaneously perform their Arrival time calculations, and we can call parallelization frameworks like openmp for acceleration. We initialize the Arrival time of all nodes to 0, then run the longest path algorithm mentioned above to derive the timing information shown in the right diagram (the arrival time of the leftmost node in the third level should be 10).

Backward BFS to Calculate RAT#

After the forward delay analysis in the previous section, we know how far each endpoint is from the starting point. However, designers may have different constraints for each endpoint; they may want certain signals to arrive at specific timing endpoints earlier, thus introducing the concept of timing slack (Timing Slack):

$Timing Slack = Required Arrival Time (specified by the user) - Arrival Time (actual arrival time)$

If Timing Slack is less than 0, it indicates that the signal is late, resulting in a timing violation. The Arrival Time is calculated after the forward timing propagation is completed. Typically, for timing endpoints, the Required Arrival Time (RAT) is the clock period minus the setup time (Clock Period - Setup Time). However, for each intermediate timing node, designers usually do not set RAT. Therefore, in STA, we need to perform backward timing propagation so that every node, except for the endpoints, knows how early it is actually required to obtain the signal, similar to project management where each task node needs to know by when it must be completed to avoid digging a pit for the subsequent team.

The basic method of backward timing propagation is similar to forward propagation, but the previous forward formula was:

$ArrivalTime[i] = max{ArrivalTime[Predecessor[i, j]]+CellDelay[j]+NetDelay[i,j]}$

While the backward formula is:

$RequiredArrivalTime[i] = min{RequiredArrivalTime[Successor[i, j]] - NetDelay[i,j]} - CellDelay[i]$

Where $RequiredArrivalTime[i]$ is the RAT of node i, and $RequiredArrivalTime[Successor[i, j]]$ refers to the RAT of the j-th successor node of node i, and $NetDelay[i,j]$ is the wire delay between the two nodes, while $CellDelay[i]$ is the logic delay of node i.

Based on the transformation of the above formula, we also need to re-layer the netlist in reverse, as shown in the figure:

We initialize the RAT of all nodes to infinity, while the RAT of the endpoint nodes is set by the designer, and then run the longest path algorithm mentioned above to derive the RAT information shown in the figure, assuming all endpoint RATs are 20:

Incremental Timing Analysis#

If local adjustments are made during layout, there is no need to perform global STA, as global analysis is slow and inefficient, and many nodes' timing may not change. In this case, our forward and backward layers do not need to be modified; we only need to reinsert all predecessor and successor nodes of the nodes that have changed into the BFS process above to achieve fast incremental timing analysis.

When the red node is moved, only the nodes covered by blue and orange need to be re-analyzed for STA.

Timing Analysis#

Timing analysis mainly focuses on setup and hold violations, corresponding to the worst and best cases, respectively. Additionally, it is essential to master the usage of the following commands:

set_input_delay
set_output_delay
set_drive, set_driving_cell, and set_input_transition
set_load

Setup#

Input data must remain stable for the shortest time before the active clock edge, referred to as setup time (setup time). Note: This is measured as the time interval from the latest data signal exceeding its threshold (usually 50% of Vdd) to the active clock edge exceeding its threshold (usually 50% of Vdd).

Before the active edge of the clock reaches the flip-flop, the data should remain stable for a certain time, which is the setup time of the flip-flop, ensuring that the data is reliably captured by the flip-flop.

Note that setup time checks allow launch and capture to belong to different clock domains.

Setup Case 1#

Taking the following diagram as an example, the clock CLKM period is $T_{cycle}$.

For the launch path: the time from clock CLKM to the clock pin of flip-flop UFF0 is $T_{launch}$ + the propagation delay of flip-flop UFF0 ($T_{ck2q}$) + Data path delay ($T_{dp}$).
For the capture path: the time from clock CLKM to the clock pin of flip-flop UFF1 is $T_{capture}$ + clock period $T_{cycle}$.

image.png|500

Since the capture setup time constraint requires the data signal to be stable at least one setup time in advance relative to the clock signal, the following formula must be satisfied:

T_{launch} + T_{ck2q} + T_{dp} < T_{cycle} + T_{capture} - T_{setup}

Arrival and Required in Setup#

It is known that the setup check must satisfy: (the above diagram is a specific case, and the following formula is a generic formula)

T_{launch} + T_{ck2q} + T_{dp} < T_{cycle} + T_{capture} - T_{setup}

Thus, the definitions of required and arrival time are as follows:

Required time: capture path delay
Arrival time: launch path delay

Since slack needs to be >= 0, the following formula holds:

T_{cycle} + T_{capture} - T_{launch} - (T_{ck2q} + T_{dp}) - T_{setup} >= 0

`R2R` Setup Check#

Analyzing the following timing report:

The start point & end point are both flip-flops, triggered by the rising edge of clock CLKM.
Path group: determined by capture ff.
Path type: max, i.e., setup time check.
Clock network delay is zero as it's an ideal clock network.
- i.e., $T_{launch}$ and $T_{capture}$ are zero.
Clock uncertainty:
- Jitter
- Setup time

Clock Network Delay#

What is the clock network delay in the timing report? Why is it marked as ideal? This line in the timing report indicates that the clock tree is considered ideal, and any buffers in the clock path are assumed to have zero delay. Once the clock tree is built, the clock network can be marked as "propagated," allowing the clock path to display actual delay values.

Clock network delay is used to model the delay through the clock path before the clock tree is established (i.e., before clock tree synthesis). Once the clock tree is built and marked as "propagated," this clock network delay constraint is ignored. The set_clock_latency command can also be used to model the delay from the main clock to its derived clocks.

Additionally, if it is a displayed clock tree, i.e., clock buffers are inserted:

image.png|475

The delay of the first cell needs to know its input transition, so it needs to be explicitly specified through set_drive, set_driving_cell, or set_input_transition; otherwise, it is assumed that its input transition is 0.

Additionally, the definition of clock source latency, i.e., insertion delay, is the delay from the clock source to the DUA clock definition point. This can be set using the command set_clock_latency -source.

This command will set the clock network delay if the source option is not used.

`I2R` Setup Check#

Set external input delay relative to virtual clock or actual clock: set_input_delay

The timing path from the input port to the register can be triggered by a virtual clock or an actual clock, as follows:

This clock can be considered as a virtual flip-flop driving the design input port INA, with the clock of this virtual flip-flop being VIRTUAL_CLKM. Additionally, the maximum delay from the clock pin of this virtual flip-flop to the input port INA is specified as 2.55ns, displayed in the report as input external delay.

Input delays can also be specified relative to the actual clock, and do not necessarily have to be specified relative to the virtual clock. The actual clock can be an internal pin in the design or a clock at the input port.

`R2R` Setup Check#

set_output_delay
set_load

Similar to the constraints for the input port mentioned above, output ports can also be constrained relative to a virtual clock or an internal clock in the design, or they can be constrained relative to an actual input clock port or output clock port.

To determine the delay of the last unit connected to the output port, the load at that port must be specified; the set_load command is used to specify the output load. Note that the port ROUT may have a load internally in DUA, while the set_load constraint specifies the additional load, i.e., the load from outside DUA.

Note that in the R2O path, the setup check for its endpoint is calculated as $T_{period} - T_{output}$.

`I2O` Setup Check#

The design can also have pure combinational logic paths from input ports to output ports.

Hold#

Hold time is the shortest time that input data must remain stable after the clock active edge, which is also measured as the time interval from the active clock edge exceeding its threshold to the earliest (the earliest) data signal exceeding its threshold.

Hold time checks ensure that the changing output value of the flip-flop does not propagate to the capturing flip-flop, and overwrite its output before the capturing flip-flop has a chance to capture its original value.

Hold time violations are analyzed against the fastest launch path, requiring that the fastest signal arriving at the D pin must also remain stable for at least one hold time relative to the clock signal. Thus, the formula is:

T_{launch} + T_{ck2q} + T_{dp} > T_{capture} + T_{hold}

Required and Arrival in Hold Check#

Since the launch delay is the arrival, and capture is the required time, and the hold check requires that the arrival time must be later than the required time, so

T_{arrival} = T_{launch} + T_{ck2q} + T_{dp}

T_{require} = T_{capture} + T_{hold}

T_{arrival} > T_{require}

Hold Time Check#

Generally, hold time violations are analyzed after CTS.

Other Analysis Types#

Slew Analysis#

Two Types of Slew/Transition Analysis.

Data(max/min)
Clock(max/min)

Load Analysis#

Two Types of Load Analysis

Fanout(max/main)
Capacitance(max/min)

Clock Analysis#

Two Types of Clock Analysis

Skew: The difference between the latencies (L1, L2, L3, L4, etc.) is referred to as skew.
Pulse Width: This type of analysis is performed due to the parasitic elements in the clock network path, and we need to see up to which point the pulse width gets degraded.

Interconnect Parasitics#

Nets are typically single driver and multi-load. After physical implementation, nets can move across multiple metal layers on the chip, and various metal layers can have different resistance and capacitance values.

For equivalent electrical representation, networks are typically divided into multiple segments, each represented by equivalent parasitic parameters. We also refer to segments as interconnect traces, which are parts of the network on a specific metal layer.

Interconnect RLC#

Interconnect RC is caused by nets passing through different metal layers, including:

Interconnect resistance (R) comes from the interconnect traces in various metal layers and vias in the design implementation. We can view interconnect resistance as the resistance between the output pin of the unit and the input pin of the fan-out unit.
Interconnect capacitance (C) also comes from metal traces, including ground capacitance and capacitance between adjacent signal paths.
Interconnect inductance (L) is not considered.

Ideally, the resistance and capacitance (RC) of a portion of the interconnect trace are represented by a distributed RC tree.

Additionally, a simplified method can be used to model the RC tree.

T Model#

Π Model#

WLM#

Before physical implementation, wire load models (WLM) can be used to estimate the capacitance, resistance, and area overhead caused by interconnects. Wire load models can be used to estimate the length of the network based on the number of fanouts; wire load models depend on the area of the block. Designs with different areas can choose different wire load models. Wire load models can also map the estimated length of the network to resistance, capacitance, and the corresponding area overhead caused by routing.

Wire load models are used to estimate wire length based on fanout and obtain corresponding RC and area overhead.
Wire load models are determined by the area of the unit. As the area of the block increases, the routing will also grow.

For different areas (chips or blocks), different wire load models are typically used to determine parasitic effects.

Specifying Wire Load Models (TBD)#

todo

Wire load models can be specified using the command set_wire_load_model, and the wire load mode can be specified using set_wire_load_mode.

Interconnect Trees (TBD)#

todo

What is the difference between T/Pi model and RC tree?

Since the interconnect delay from the driver pin to the load pin depends not only on the RC values but also on the structure of the interconnect.