Electronics Design AU
ComponentsSolved

8 MHz crystal starts fine on most boards, dead silent on a few — same BOM, same layout

5 min read3 replies
Original Question

Asked by drc_nightmare ·

Small batch of 15 boards, same Gerbers, same BOM, same reflow run. 8 MHz crystal (ECS-80-20-4X-TR, 20 pF load, ±30 ppm) with two 22 pF load caps into a Cortex-M0+ MCU. 13 of the 15 boards clock up fine — I can see a clean 8 MHz on the scope at the OSC_OUT pin. 2 of them are completely silent. No oscillation at all, MCU sits stuck (falls back to internal RC where that's an option, otherwise just doesn't boot).

Swapped the crystal on one of the dead boards with a known-good one pulled from a working board — still dead. Swapped the MCU instead, on the same dead board — it started working. So it seems to be MCU-dependent, not crystal-dependent, but I don't have an explanation for why. Same reel of crystals, same cap reel, same layout. Scoped the OSC_IN/OSC_OUT traces on a dead board with the "known bad" MCU installed and there's nothing, not even a decaying transient — like the oscillator amplifier just isn't providing enough gain to get things going.

Only other thing I've noticed: both dead boards are also the two coldest-feeling ones on the bench closest to the aircon vent, if that's even relevant. Feels like it shouldn't be but I'm out of other ideas.

From the knowledge baseHow Do You Select a Crystal Oscillator for an Embedded System?

3 Replies

clock_tree_chaos
Accepted Answer

That MCU-swap result is the key clue, and the temperature observation isn't a coincidence — you're describing a classic negative resistance / gain margin problem, and both of those symptoms point the same direction.

A Pierce oscillator (which is what every MCU's internal crystal amplifier is) only starts reliably if the amplifier's negative resistance exceeds the crystal's ESR by a healthy margin — the commonly cited rule of thumb from crystal manufacturers (Abracon, ECS, and others publish this in their oscillator application notes) is a safety factor of at least 5×: R_negative ≥ 5 × ESR_max. Below roughly 3× the margin gets thin and startup becomes probabilistic rather than guaranteed — which matches exactly what you're seeing: most units start, some don't, and it's sensitive to exactly which factors are stacked against it on a given board.

Three things stack in the same direction here, and any one of them being marginal on its own is usually fine — it's the combination that tips a specific board over the edge:

1. MCU-to-MCU variation in oscillator amplifier transconductance (gm). The internal Pierce amplifier's gm varies across process corners, same as any other analog block on the die. Datasheets specify a minimum gm (or an equivalent negative resistance spec) — most parts run comfortably above that minimum, but by definition some fraction of any production lot sits closer to it. Your MCU swap fixing the problem on the same PCB, same crystal, same caps is strong evidence you landed a low-gm part on those two boards.

2. Crystal ESR is a spec'd maximum, not a fixed value. The ECS-80-20-4X-TR series ESR spec is a ceiling (check the exact figure against the datasheet for your specific part), and individual units from the same reel can sit anywhere under that ceiling. Higher-ESR units need more negative resistance margin to start reliably — pulling a "known good" crystal from a working board doesn't control for this, because that crystal was never tested against the low-gm MCU's margin, only against whatever MCU was already on the working board.

3. Temperature genuinely affects oscillator gain. This is real, not a coincidence — transistor gm in the internal amplifier decreases at lower temperature for most CMOS oscillator designs, on top of the crystal's own frequency-vs-temperature curve (which is a separate effect from gain margin). A board that's marginal at room temperature can fail to start at all a few degrees cooler. This is exactly why crystal oscillator startup should be tested across the product's full operating temperature range, not just on the bench at 22°C.

None of these alone would necessarily explain a 2-in-15 failure rate. Together — a slightly-low-gm MCU, a slightly-high-ESR crystal, sitting a few degrees cooler than the rest of the batch — they're a believable combination for exactly the boundary-case behaviour you're describing. See the crystal oscillator selection guide for the load capacitance side of this (a mismatched C_L also increases the effective negative resistance the amplifier needs to overcome, so it compounds with everything above).

grumpy_otter7

Worth checking the boring layout stuff too before you conclude it's pure silicon lottery, because I've been burned by this exact "13 fine, 2 dead" pattern turning out to be a trace length difference I introduced without noticing. Are OSC_IN/OSC_OUT trace lengths genuinely identical panel-to-panel, or did the panelisation shift the crystal's position relative to the MCU on some board instances (multi-up panel with slightly different routing on repeated instances is more common than people expect)? Extra trace length adds parasitic capacitance in parallel with your load caps, which increases effective load capacitance beyond the 20 pF the crystal wants, and that increases the negative resistance the amplifier needs — same failure direction as everything above, just a manufacturing cause instead of a silicon one. Worth a five-minute look at the Gerbers for those two specific instances before you sink time chasing a die-level explanation.

cap_derating_captain

Also check your load cap tolerance while you're at it — 22 pF C0G/NP0 at ±5% (common, cheap grade) gives you a wider spread than ±1% or ±2% parts, and unlike X7R/X5R power caps this isn't a DC bias story, it's plain manufacturing tolerance stacking with whatever margin you've already got. It's a smaller contributor than the gm/ESR/temperature combination above, but if you're already at the edge, tightening the load cap tolerance is the cheapest lever you have — a few cents difference per part, no layout change, no requalifying the crystal. Won't fix a genuinely low-gm part on its own, but it removes one more variable from the stack.

Related Discussions