Electronics Design AU
DigitalSolved

FPGA design passes timing analysis and simulation but fails intermittently in hardware — is this a clock domain crossing issue?

8 min read3 replies
Original Question

Asked by fresh_grad_fern ·

I'm on my first real FPGA project at work and I'm getting intermittent data corruption that I can't explain. The design has two clock domains: a 100 MHz acquisition clock (ADC sampling) and a 50 MHz processing clock where the main logic runs. I'm passing a 16-bit sample counter from the fast domain to the slow domain — the slow domain reads it whenever it needs to know how many samples have been collected.

Synthesis completes cleanly. Post-implementation timing analysis shows no violations — setup slack is positive on every path. Behavioural simulation passes, including a version with timing annotations.

But in hardware, every few minutes I get a counter value that's clearly wrong — like jumping from 2,391 to 49,152 in one read, which is impossible since it's a saturating counter that can't exceed 32,768. It's happened 23 times in 6 hours of testing. The pattern seems random.

I read through the synchronous vs asynchronous logic article which mentioned metastability and dual-flop synchronisers, but it says timing analysis should catch setup/hold violations. So if timing passes, how is this happening?

From the knowledge baseWhat Is the Difference Between Combinational and Sequential Logic?

3 Replies

fpga_philosopher
Accepted Answer

What you're describing is a textbook clock domain crossing (CDC) data corruption. The timing report passing is not the reassurance it appears to be — and understanding why is one of those things that fundamentally changes how you think about digital design.

Why timing analysis doesn't catch CDC violations

Static timing analysis verifies setup and hold time within a single clock domain, and across paths where the tool has a defined relationship between the launch clock and capture clock. When you register a signal in the 100 MHz domain and read it in the 50 MHz domain without a synchronizer, there is no defined relationship between those clock edges. The synthesis tool cannot check what it cannot model. It may not even see the crossing as a problem — especially if no timing constraint (set_clock_groups, set_false_path, or a multicycle path exception) tells it how to treat the crossing. The tool silently ignores the path, timing reports show no violations, and the FPGA ships with a latent defect.

Simulation doesn't catch it either. Behavioural RTL simulation has no model of flip-flop metastability — it treats every clock edge as instantaneous and every signal as resolving cleanly. Even timing-annotated simulation with SDF back-annotation typically doesn't model metastability unless you specifically construct a testbench that injects hold violations on the crossing path.

What's actually happening in hardware

Each 50 MHz capture edge can arrive at any arbitrary phase relative to the 100 MHz counter's update. When a capture edge arrives while a counter bit is still transitioning, the capturing flip-flop's setup or hold time is violated. The output of a flip-flop that violates its timing constraints enters a metastable state — it resolves to a valid logic level, but the time it takes to resolve is probabilistic. Usually it resolves within one cycle, but occasionally (depending on Vcc, temperature, and the specific FPGA cell) it takes longer, and the downstream logic sees a glitch or an intermediate voltage that resolves as either 0 or 1 unpredictably.

Here is the critical part: for a multi-bit value like your 16-bit counter, different bits update at different times during a count increment. If you capture the counter without synchronisation during a transition, some bits capture the old value and some capture the new value. The result is a number that is neither the old count nor the new count — it's a spurious combination. Your jump from 2,391 to 49,152 is exactly this: binary 0000_1001_0101_0111 becoming 1100_0000_0000_0000 or similar, because half the bits captured before the update and half captured after.

The fix: it depends on what you're crossing

For a single-bit control signal (a flag, an enable, a done signal): a two-flop synchroniser in the destination clock domain is the standard solution. The first flip-flop may go metastable, but its output is only sampled by the second flip-flop one slow-clock cycle later — by then, metastability has resolved with overwhelming probability. The second flip-flop presents a clean signal to the rest of the destination-domain logic.

For a multi-bit value like your counter: do not synchronise each bit independently. The bits will resolve from metastability at different times and you will still get corrupted intermediate values. Two correct approaches:

  1. Gray code encoding — encode the counter in Gray code (adjacent values differ by exactly one bit) before crossing the domain. If a capture happens mid-transition, only one bit is ambiguous; the error is at most ±1 count. This works well for monotonically incrementing counters and is how async FIFO pointers are typically handled. See the FPGA fundamentals article for context on how FPGA routing affects setup/hold margins.

  2. Asynchronous FIFO — use a FIFO primitive with separate read and write clock ports (Xilinx FIFO18, Altera DCFIFO, or a hand-rolled async FIFO with Gray-coded read/write pointers). The FIFO encapsulates all the CDC complexity and presents a clean, synchronised output to the read-clock domain. For data transfer as opposed to just pointer/count synchronisation, this is the robust solution. Clifford Cummings' paper "Simulation and Synthesis Techniques for Asynchronous FIFO Design" (SNUG 2002) is the canonical reference if you want to understand how to implement it correctly from scratch.

For your specific case — a 16-bit sample counter — Gray code encoding is probably the simplest fix: encode the counter output in Gray code in the write domain, add a two-flop synchroniser per bit in the read domain, then decode from Gray back to binary. If precise count accuracy matters (not ±1), use a full async FIFO or a handshaking protocol with an acknowledge signal.

timing_budget_tamara

One thing worth adding on the tool side: Vivado has a dedicated CDC analysis report that's separate from the timing report, and it's your fastest path to confirming what fpga_philosopher described.

In Vivado, after implementation, go to Reports → Report CDC (or run report_cdc -details in the Tcl console). This runs a structural analysis of inter-domain crossings and categorises them:

  • Safe — the crossing has a recognised synchronisation structure (two-flop synchroniser, FIFO, etc.)
  • Warning — the path crosses domains but the tool is uncertain about the synchronisation
  • Critical — the path crosses domains with no synchroniser detected at all

If your 16-bit counter is connecting directly from the 100 MHz domain to 50 MHz logic without a synchroniser, you'll see it as a critical CDC violation in this report even though the timing report shows no issues.

One subtlety: if someone earlier added set_false_path -from [get_clocks clk_100] -to [get_clocks clk_50] or set_clock_groups -asynchronous -group clk_100 -group clk_50 in the XDC constraints file, that correctly suppresses the timing violation (which you should suppress on a truly asynchronous crossing, since there's no valid setup/hold relationship to check) — but it does not insert a synchroniser in the design. The constraint is telling the tool "I promise I've handled this safely," but it doesn't verify that you actually have. Report CDC will still flag the unprotected path.

Check the XDC file for any false-path or clock-group constraints that might be masking the issue, and then run Report CDC to see what the tool actually thinks about the crossing.

Quartus equivalent: TimeQuest Timing Analyzer → Report Clock Domain Crossings, or report_clock_domain_crossing -clock_crossing_type metastability_protection in the Tcl console.

glitch_getter

Before you touch the RTL, you have one more option if you want hardware proof rather than inference: add an ILA.

Xilinx's Integrated Logic Analyser (ILA) IP lets you sample signals inside the FPGA fabric on a trigger condition. You can add one to your design, wire up both the fast-domain counter output and the slow-domain captured value, and set the trigger to fire when the slow-domain value exceeds your known maximum (32,768 in your case). When the corruption happens, the ILA captures the waveforms and uploads them to Vivado's waveform viewer — you'll be able to see exactly what happened in the cycles around the bad read: the fast counter was at value X, the slow-domain flip-flop captured some bits from X-1 and some from X, and the output jumped to a value that's neither.

It takes maybe 20 minutes to add an ILA in Vivado (instantiate the IP, connect the probe signals, set the trigger), and it gives you hardware-confirmed evidence of the crossing rather than inferred evidence from the symptom. Useful if you later need to convince someone else on the team that CDC is the actual root cause.

One thing to keep in mind: the ILA is clocked too, so it samples on one of your clock domains. If you're probing both clock domains simultaneously, use a separate ILA core for each domain clocked by its own clock — don't try to share a clock between them. Vivado's ILA IP handles multi-clock configurations, but you need to connect each core's clk port to the correct domain clock.

The synchronous vs sequential logic article has a good diagram of the dual-flop synchroniser setup and the metastability window concept — worth reading alongside this thread before you implement the Gray code fix or async FIFO.

Related Discussions