Electronics Design AU
CommunicationsSolved

I2C bus completely dead after unplugging a sensor live — SDA stuck low, no NACK, nothing

6 min read3 replies
Original Question

Asked by fresh_grad_fern ·

We've got a shared I2C bus with four sensors on it (temperature, humidity, an IMU, and a current sense IC), all on the same bus off one MCU. A technician on the test bench unplugged one of the sensor connectors while the board was still powered — not something that's supposed to happen in the field, but it happened during testing and now I want to understand what actually went wrong.

After the unplug, the entire bus is dead. Not just the sensor that got unplugged — every device on the bus stops responding, including the three still connected. I2C transactions to any address just time out. I put a scope on it and SDA is sitting flat low, all the time, not toggling at all. SCL still looks fine when the MCU tries to clock out a transaction, so the master side seems to be behaving.

Only fix I've found so far is power-cycling the whole board, which obviously isn't something we can do in a deployed product every time this happens. What's actually going on, and is there a way to recover this without a full power cycle?

From the knowledge baseWhat Is I2C (Inter-Integrated Circuit)?

3 Replies

i2c_inspector
Accepted Answer

SDA stuck permanently low with SCL toggling normally is the textbook signature of a slave that got interrupted mid-transaction while it was driving SDA low — and hot-unplugging a device while it's mid-ACK or mid-data-bit is exactly the kind of event that causes it. This is a different failure from what you'd see with a wrong address or missing pull-ups — see our other I2C thread for that one, where you get a clean NACK on every address. What you're describing is the bus itself being held hostage, not a communication failure with a specific device.

What happened, mechanically: I2C is open-drain — any device on the bus can pull SDA low, and it only goes high again when every device releases it. If the connector for your sensor was pulled out at the exact moment that sensor's I2C peripheral was holding SDA low (acknowledging a byte, or driving a 0 data bit), that device's internal state machine never got the chance to complete the transaction and release the line. A device that's still physically present but wedged mid-transaction does the same thing — connector removal is just one of several ways to trigger it; a brownout on that sensor's supply rail mid-transfer causes an identical symptom. Every other device on the bus is now stuck too, because none of them can pull SDA high themselves — they can only release it and let the pull-up do that, and the wedged device is actively holding it down.

Why a power cycle "fixes" it: removing power resets every device's internal state machine, including the wedged one, so it releases SDA. But you're right that this isn't viable in the field, and it's not necessary — the I2C specification has a defined bus recovery procedure for exactly this situation (NXP UM10204, section 3.1.16, "Bus clear").

The recovery sequence:

  1. Configure SCL as a GPIO output (temporarily bypass the I2C peripheral).
  2. Configure SDA as a GPIO input (just read it, don't drive it).
  3. Toggle SCL up to 9 times. A wedged slave is always partway through clocking out a byte to the master (that's specifically why it's stuck holding SDA), and toggling SCL gives it the clock edges it's still waiting for to finish that byte and release the bus.
  4. After each clock pulse, read SDA. As soon as SDA reads high, stop — the slave has released the bus.
  5. Issue a STOP condition manually (SDA low-to-high transition while SCL is high) to leave the bus in a clean idle state.
  6. Re-configure both pins back to the I2C peripheral's alternate function and resume normal operation.
void i2c_bus_recover(void) {
    gpio_set_mode(SCL_PIN, GPIO_OUTPUT_OPEN_DRAIN);
    gpio_set_mode(SDA_PIN, GPIO_INPUT);

    for (int i = 0; i < 9; i++) {
        if (gpio_read(SDA_PIN)) break;   /* slave released the bus */
        gpio_write(SCL_PIN, 0);
        delay_us(5);
        gpio_write(SCL_PIN, 1);
        delay_us(5);
    }

    /* Manual STOP: SDA low->high while SCL is high */
    gpio_set_mode(SDA_PIN, GPIO_OUTPUT_OPEN_DRAIN);
    gpio_write(SDA_PIN, 0);
    delay_us(5);
    gpio_write(SCL_PIN, 1);
    delay_us(5);
    gpio_write(SDA_PIN, 1);
    delay_us(5);

    gpio_set_mode(SCL_PIN, GPIO_ALTERNATE_I2C);
    gpio_set_mode(SDA_PIN, GPIO_ALTERNATE_I2C);
}

Run this whenever you detect the bus is stuck (a transaction timeout is the usual trigger) rather than only at boot — that way it recovers live, without a reset. Some MCU I2C peripherals (several STM32 families among them) expose this same recovery natively as part of their peripheral reset sequence; check whether yours does before rolling your own bit-bang version, since a peripheral-native version avoids the pin mode switching entirely.

register_jockey

Worth being precise about the pin mode switch, because getting it wrong bricks the recovery routine itself. SCL needs to be genuine open-drain output during the bit-bang, not push-pull — if you drive it push-pull and another device also happens to be trying to pull it low for its own reasons, you'll fight it electrically instead of cooperating with the bus. Same reasoning is why SDA stays as an input for the clock-toggle phase: you're reading whether the wedged device has released it, not driving it. Only switch SDA to output for the manual STOP at the very end.

Also — confirm your pull-up resistors are still doing their job during this. If you're running with weak or marginal pull-ups (borderline for your bus capacitance already), the recovery sequence still needs SDA to actually rise on its own once every device releases it. A logic analyser capture of the recovery sequence the first time you implement it is worth the ten minutes — you want to see SDA's rising edge land where you expect it, not lag behind SCL toggling for several microseconds.

watchdog_wendy

Since you've already had this happen once during bench testing, I'd build the recovery routine in as standard practice rather than treating it as an exception handler you hope never fires. Two things worth adding on top of the recovery sequence itself: a transaction timeout on every I2C call (don't let a read/write block indefinitely waiting on a bus that's never coming back on its own), and a periodic bus-health check if this is a long-running unattended product — a cheap read of a known-good register on one device, on a timer, that triggers the recovery sequence automatically if it ever times out. That turns "the bus randomly stops working and needs a power cycle" into "the bus self-heals within one polling interval," which is a very different support conversation to have with a customer. For a hot-swap connector specifically, I'd also look at whether the connector genuinely needs to be live-pluggable at all — if it doesn't, adding an interlock or requiring power-off before disconnection removes the failure mode at the source instead of just recovering from it gracefully.

Related Discussions