The Dual Write Problem: A Comprehensive Guide

The dual write problem is one of the most fundamental and tricky challenges in distributed systems engineering. It arises whenever a system needs to write data to two or more independent data stores (or systems) atomically, and the inherent impossibility of doing so reliably without careful design.

Introduction

As applications evolve from monolithic architectures to distributed microservices, maintaining data consistency becomes increasingly challenging.

One of the most common and dangerous consistency issues is the Dual Write Problem.

The Dual Write Problem occurs when an application needs to update two separate systems as part of a single business operation, but those updates cannot be executed atomically. If one write succeeds and the other fails, the system enters an inconsistent state.

What Is the Dual Write Problem?

Dual write occurs when an application writes the same piece of data to two separate systems — for example, a relational database and a message broker (Kafka), a search index (Elasticsearch), a cache (Redis), or another microservice’s database — as part of a single logical operation.

The core problem: there is no single transaction boundary spanning both systems, so a failure between the two writes leaves the systems in an inconsistent state.

For example, if the database update succeeds but Kafka write fails, the system ends up in an inconsistent state.

1. Save order to SQL  ✅
2. Publish "order.created" event to Kafka  ❌ (broker down)

1. Save order to SQL  ✅
2. Publish "order.created" event to Kafka  ❌ (broker down)

Python

Now the database has the order, but downstream services (inventory, email, and billing) have never received the event. The systems are permanently out of sync unless you build explicit compensation logic.

Why Does the Dual Write Problem Exist?

The root cause is the absence of distributed atomicity. Traditional ACID transactions work within a single database. Once you cross the boundary of two independent systems, you lose the “all-or-nothing” guarantee.

MySQL      <-- Transaction #1
Kafka      <-- Transaction #2

MySQL      <-- Transaction #1
Kafka      <-- Transaction #2

Python

There is no single transaction covering both.

Why does this happen?

Distributed systems experience failures regularly:

Network outages
Service crashes
Kafka unavailability
Database failures
Timeouts
Process termination

Whenever one write succeeds and the other fails, data becomes inconsistent.

Example

Let’s see this in detail with a classic e-commerce order system

Scenario

A customer places an order on an online shopping site. Your application needs to do two writes:

Save the order in the database.
Send an event/message to a queue (Kafka, RabbitMQ, etc.) so other services can:
- Reduce inventory
- Send confirmation email
- Update analytics

Normal Case

save_order_to_db(order)
publish_order_created_event(order)

save_order_to_db(order)
publish_order_created_event(order)

Python

Everything works.

Problem Case #1

Database write succeeds, but Kafka fails

Result:

Order exists in DB.
The inventory service never gets the event.
Stock is not reduced.
The email is not sent.

The system becomes inconsistent.

Problem Case #2

Kafka writes successfully, but the DB failed

Result:

The inventory service receives an event.
Stock gets reduced.
But order does not exist in the DB.

Now, other services believe an order exists when it doesn’t.

Why Database Transactions Don’t Solve It

Many developers assume a database transaction can solve the problem.

For example:

BEGIN; 
INSERT INTO orders (...); 
SEND EVENT TO KAFKA; COMMIT;

BEGIN; 
INSERT INTO orders (...); 
SEND EVENT TO KAFKA; COMMIT;

Python

Unfortunately, Kafka is not part of the database transaction.

The database can roll back its own changes.
It cannot roll back Kafka.
Similarly, Kafka cannot roll back database writes.

Because the two systems are independent, atomicity is lost.

Common but Incorrect Solutions

Retry After Failure

save_order() 
retry_event()

save_order() 
retry_event()

Python

Problems:

The application may crash before retry.
Retries can create duplicate events.
Long outages may exhaust retry limits.

Two-Phase Commit (2PC)

Two-Phase Commit attempts to coordinate multiple systems.

Advantages:

Strong consistency

Disadvantages:

Complex.
Slow.
Poor scalability.
Rarely used in modern microservices.

Most modern architectures avoid 2PC.

Reversing the order, first Kafka, then the database

Reversing the order just shifts the failure window. If the broker write succeeds but the DB write fails, you now have an event with no corresponding record — equally bad, or worse, since consumers may start acting on data that doesn’t exist.

Proven Solutions

Outbox Pattern (Transactional Outbox)

This is the gold-standard solution for the dual-write problem in event-driven architectures. Instead of writing directly to Kafka, the application writes both the business data and event information into the same database transaction.

How It Works

Instead of writing to two external systems, you write to two tables in the same database within a single transaction:

BEGIN;
  INSERT INTO orders (id, user_id, amount, status)
  VALUES (123, 42, 99.99, 'CREATED');

  INSERT INTO outbox (aggregate_id, event_type, payload, created_at)
  VALUES (123, 'order.created', '{"id":123,...}', NOW());
COMMIT;

BEGIN;
  INSERT INTO orders (id, user_id, amount, status)
  VALUES (123, 42, 99.99, 'CREATED');

  INSERT INTO outbox (aggregate_id, event_type, payload, created_at)
  VALUES (123, 'order.created', '{"id":123,...}', NOW());
COMMIT;

Python

A separate Outbox Relay / Message Relay process reads from the outbox table and publishes events to the message broker. Once published, the outbox record is marked as processed (or deleted).

┌─────────────────────────────┐
│        Application          │
│  ┌──────────┐  ┌─────────┐  │
│  │  orders  │  │ outbox  │  │ ← single DB transaction
│  └──────────┘  └────┬────┘  │
└──────────────────────┼──────┘
                       │
               ┌───────▼────────┐
               │  Message Relay  │ ← reads outbox, publishes
               └───────┬────────┘
                       │
               ┌───────▼────────┐
               │     Kafka       │
               └────────────────┘

┌─────────────────────────────┐
│        Application          │
│  ┌──────────┐  ┌─────────┐  │
│  │  orders  │  │ outbox  │  │ ← single DB transaction
│  └──────────┘  └────┬────┘  │
└──────────────────────┼──────┘
                       │
               ┌───────▼────────┐
               │  Message Relay  │ ← reads outbox, publishes
               └───────┬────────┘
                       │
               ┌───────▼────────┐
               │     Kafka       │
               └────────────────┘

Python

Guarantees

At-least-once delivery (the relay may retry, consumers must be idempotent).
No data loss — if the relay crashes, it re-reads unprocessed outbox rows on restart.
Ordering — events can be published in the order they were inserted.

Variants of the Relay

Polling-based relay: Periodically queries WHERE processed = false. Simple but adds DB load and latency.
CDC-based relay (preferred): Uses Change Data Capture (e.g., Debezium) to stream the outbox table’s WAL (Write-Ahead Log) directly to Kafka. Near-real-time, low DB overhead.

Drawbacks

Adds complexity (outbox table, relay service).
Relay is a new component that can fail and needs its own reliability engineering.
Slight latency increase between write and event publication.

Change data capture(CDC)

CDC tools (like Debezium, Maxwell, AWS DMS) tap directly into the database’s binary/write-ahead log and stream changes to a message broker.

Every committed change to the database (INSERT, UPDATE, DELETE) is captured from the replication log and published as a change event. No application-level outbox table is needed.

Advantages

Zero application code changes required.
Captures all changes, including those made by migrations or admin tools.
Extremely low latency.
The source of truth is always the database.

Drawbacks

Schema changes in the DB can break CDC pipelines.
Database must support replication logs (PostgreSQL, MySQL, and MongoDB all do; some older systems don’t).
Vendor/tool dependency (Debezium requires Kafka Connect infrastructure).
Schema evolution and deserialization must be carefully managed.

Event souring

Instead of storing the current state and emitting events as a side effect, Event Sourcing stores events as the primary source of truth. The current state is derived by replaying events.

Example: All states have been stored

OrderCreated
PaymentReceived
OrderShipped
OrderDelivered

OrderCreated
PaymentReceived
OrderShipped
OrderDelivered

Python

In a traditional system, there are two separate, independent writes to two different systems.

While in event-based systems

Instead of saving the current state and then sending an event as a side effect. You only save the event. That’s it. Nothing else. The event IS your data. The event IS your source of truth

Think of a bank passbook where every transaction is written

So there is only one write.

Everything else (updating the read database, publishing to Kafka, notifying services) are just listeners that react to the event. They are not part of the original write. They happen later, automatically.

What If a Listener Fails?

Say the Kafka publisher crashes after the event was saved. No problem.

The event is permanently stored in the Event Store. When the Kafka publisher restarts, it says “where did I stop?” and replays from that point. Nothing is lost.

Saga Pattern

A Saga is a sequence of small, local transactions — one per service — where each step publishes an event or sends a message to trigger the next step.
If any step fails, the Saga runs compensating transactions — basically, undo operations — to reverse the previous steps.

Example

Step 1: Flight Service    → books flight    ✅ → triggers Step 2
Step 2: Hotel Service     → books hotel     ✅ → triggers Step 3
Step 3: Payment Service   → charges card    ✅ → DONE 🎉

Step 1: Flight Service    → books flight    ✅ → triggers Step 2
Step 2: Hotel Service     → books hotel     ✅ → triggers Step 3
Step 3: Payment Service   → charges card    ✅ → DONE 🎉

Python

Additional Considerations

Idempotency

Any solution involving retries (outbox relay, CDC, sagas) produces at-least-once delivery. Consumers must be idempotent — processing the same message twice must produce the same result as processing it once.

Consistency Models

There is no free lunch — every solution trades strong consistency for availability and partition tolerance (CAP theorem) or vice versa.

Solution	Consistency	Complexity	Throughput
2PC (XA)	Strong	Very High	Low
Outbox + Polling	Eventual	Medium	Medium
Outbox + CDC	Eventual	High	High
Pure CDC	Eventual	Medium	High
Event Sourcing	Eventual	Very High	High
Saga	Eventual	High	High

Conclusion

The Dual Write Problem occurs when an application updates two independent systems as part of a single business operation.

If one write succeeds and the other fails, the system becomes inconsistent.

This problem is extremely common in event-driven architectures, microservices, and distributed systems.

The most widely adopted solution is the Transactional Outbox Pattern, where business data and event data are stored together in a single database transaction, and a background worker later publishes the event.

The Dual Write Problem: A Comprehensive Guide

Table of Contents

Introduction

What Is the Dual Write Problem?

Why Does the Dual Write Problem Exist?

Why Database Transactions Don’t Solve It

Common but Incorrect Solutions

Retry After Failure

Two-Phase Commit (2PC)

Reversing the order, first Kafka, then the database

Proven Solutions

Outbox Pattern (Transactional Outbox)

Change data capture(CDC)

Event souring

Saga Pattern

Additional Considerations

Idempotency

Consistency Models

Conclusion

Resources

About Puneet Verma

Leave a Comment Cancel reply