Object Storage WAL Replication in UnisonDB

UnisonDB adds object-storage WAL replication for edge and multi-region replicas. Writers publish immutable WAL segments once, while replicas catch up from cataloged LSN ranges.

The Hidden Cost of Live WAL Streams#

UnisonDB started with a simple replication model.

A writable node owns the WAL. A replica connects over gRPC. The writer streams log records. The replica applies them in order.

That is the right model for many systems. It is direct, easy to understand, and works well when the writer and replica are online at the same time.

But there is a hard wall hidden inside this design.

The writer is not only writing the database. It is also responsible for keeping every downstream replica fed.

That coupling is fine at small scale. It becomes painful when replicas are slow, offline, far away, or too many.

Diagram of UnisonDB object-store WAL replication with one leader publishing to object storage and multiple replicas polling independently

The Writer Should Not Be the Replay Server#

Live replication works best when the reader is present while the writer is producing data.

In UnisonDB, a replica reconnects with the LSN it has already applied. The writer does not need to remember that replica. But the missing WAL range still has to be served by the writer or by a relay.

If many replicas are behind, that catch-up traffic still lands on live database nodes.

This creates a practical limit.

The primary should spend most of its time accepting writes and keeping the local database healthy. It should not become a replay server for every slow or disconnected replica.

This is especially visible in edge and multi-region deployments.

Why Edge Replication Changes the Stakes#

Edge replicas do not always behave like stable servers in the same rack.

Some replicas may run on weaker hardware. Some may sit behind unreliable networks. Some may be offline for minutes and then come back. Some deployments may have many replicas reading the same history at different speeds.

In that environment, a live stream from the writer becomes a fragile place to serve history from.

If one replica is slow, its catch-up traffic should not compete with the writer’s main job.

If a replica disappears and comes back later, its last applied LSN should be enough to resume.

If a new replica joins, it should be able to read old WAL ranges without asking the primary to become its history server.

But we still need the same database property: records must be replayed in WAL order from a known LSN.

So the question becomes:

Can we keep the WAL as the source of truth, but remove the writer from the historical read path?

Why “Just Reconnect Over gRPC” Is Not Enough#

The obvious answer is to keep gRPC and make replicas retry harder.

That helps with short failures, but it does not change the shape of the system.

The writer or relay is still the place replicas come back to for history. The replica brings its last applied LSN, but the catch-up bytes are still served from live database infrastructure.

For a small cluster, that is acceptable.

For edge replication, pull-based recovery is often a better shape. The writer should publish durable history once. Any replica should be able to catch up from that history later.

Most teams already have a durable shared layer for this: object storage.

S3, MinIO, GCS, and Azure Blob are very good at storing immutable files and serving many independent readers. They are not a database log by themselves, but they are a good place to put finalized WAL history.

Why Plain Object Uploads Are Not Enough#

The first instinct is to upload WAL files somewhere and let replicas list the bucket.

That is not enough.

A replica needs more than files. It needs to know which LSN ranges are complete, which ones are visible, and where to resume after a crash. Listing a bucket does not give a safe commit point. It only tells you that some objects exist.

This creates the same kind of danger as partial writes inside a database.

If the writer uploads half a range and then dies, should the replica read it?

If two leaders write after a leadership change, which range is authoritative?

If a replica sees a new object before the writer has finished publishing metadata, is that object safe to apply?

The answer has to be explicit. For UnisonDB, the answer is the catalog.

The segment file stores the WAL bytes. The catalog decides visibility.

UnisonDB Solution: WAL History as Segment Files#

UnisonDB now has a second replication path.

The writer can publish committed WAL records into object storage as immutable segment files. A small catalog records which LSN ranges are visible and where those files live.

The idea is simple:

writer WAL
    -> seal a range of records
    -> upload immutable segment file
    -> commit catalog metadata
    -> replicas replay from object storage

Once the catalog update is committed, that WAL range is visible to replicas.

The writer publishes the range once. Replicas read it whenever they need it.

This is not a replacement for gRPC replication. gRPC is still the live stream path. Object-store replication is the durable pull path.

The Flow: APPEND, SEAL, PUBLISH, REPLAY#

The lifecycle looks like this.

  1. APPEND

    The writable UnisonDB node appends records to its local WAL as usual.

  2. SEAL

    A contiguous LSN range is packed into an immutable segment file. The file is designed for object storage, so it is not a tiny mutable object per WAL record.

  3. PUBLISH

    The writer uploads the segment file and then commits catalog metadata for that LSN range.

  4. REPLAY

    A replica reads the catalog, finds the missing range after its local LSN, fetches the segment bytes, and applies the WAL records.

The important rule is visibility.

A segment is not visible just because the file exists. It becomes visible only after the catalog metadata is committed.

This gives us a clean failure boundary.

If upload succeeds but catalog commit fails, replicas ignore the file.

If catalog commit succeeds, replicas can replay it.

Seeing It in the Configuration#

On the writer side, object-store publishing is enabled per namespace:

[blob_store_streaming]
enabled = true
flush_interval = "2s"

[blob_store_streaming.namespaces.default]
bucket_url = "s3://my-bucket?region=us-east-1"
base_prefix = "unisondb/prod"

The same shape works with MinIO, GCS, and Azure Blob:

bucket_url = "s3://my-bucket?region=us-east-1"
bucket_url = "gcs://my-bucket"
bucket_url = "azblob://my-container"

On the replica side, the relayer can use object storage as the upstream:

[relayer_config.edge_blob]
namespaces = ["default"]
streamer_type = "blobstore"
lsn_lag_threshold = 100

[relayer_config.edge_blob.blobstore]
bucket_url = "s3://my-bucket?region=us-east-1"
prefix = "unisondb/prod"
refresh_interval = "1s"

The config still uses blobstore as the streamer type because that is the public configuration name. The behavior is object-store WAL replay.

The Recovery Walk#

Recovery is where this design becomes useful.

Suppose a replica has applied up to LSN 5000 and then goes offline.

The writer keeps running. It publishes segment files and advances the catalog.

Later, the replica comes back. It does not ask the writer to replay history. It reads the catalog and asks:

What is the first visible segment after LSN 5000?

Then it walks forward:

local replica LSN = 5000
catalog says next visible ranges:
    5001  -> 7000
    7001  -> 9000
    9001  -> 12000

replica reads those segment files
replica applies records in LSN order
local replica LSN = 12000

The same idea helps on writer restart.

The writer reads the catalog to know the last LSN already published to object storage. It does not have to guess from local process state.

What Happens in Raft Mode#

In Raft mode, only the current leader publishes object-store WAL history.

If the node loses leadership, publishing stops.

When a new leader starts publishing, it reads the catalog and resumes from the last committed object-store LSN.

That matters because leadership changes should not create duplicate or missing published ranges. The catalog is the shared progress record.

Run It Locally With MinIO#

There is a local example in the UnisonDB repo:

Start the standalone example:

./cmd/examples/blobstore-minio/run-live.sh

Write more keys:

./cmd/examples/blobstore-minio/write-10-kv.sh

There is also a Raft-aware example:

./cmd/examples/blobstore-minio/run-live-raft.sh

That example shows the intended leadership behavior:

  • publish only while leader,
  • stop when leadership is lost,
  • resume from committed object-store metadata when leadership changes.

Conclusion#

Replication is a game of trade-offs.

A live gRPC stream is simple and fast when replicas are online and close to the writer.

Object-store replication is better when history needs to be replayed later by many independent replicas.

By publishing WAL as immutable segment files, UnisonDB keeps the local WAL as the source of truth while moving historical fan-out away from the writer.

The writer writes once. Object storage keeps the shared history. Replicas replay the LSN ranges they are missing.