Viewstamped Replication is one of the earliest consensus algorithms for distributed systems. It is designed around the log replication concept of state machines and it can be efficiently implemented for modern systems. The revisited version of the paper offers a number of improvements to the algorithm from the original paper which both: simplifies it and makes it more suitable for high volume systems. The original paper was published in 1988 which is ten years before the Paxos algorithm 1 was published.
I will explain how the protocol works in detail covering as well a number of optimizations which are described in the papers. The “revisited” version offers a simplified version of the protocol with improvements which were made by the authors in later works and published after the original paper. Therefore most of the content here will be driven from the more 2012 paper.
- Paper links:
- Viewstamped Replication: A New Primary Copy method to Support Highly-Available Distributed Systems, B. Oki, B. Liskov, (1988) 2
- Viewstamped Replication Revisited, B. Liskov, J.Cowling (2012) 3
- Viewstamped Replication presentation at PapersWeLove London 4
- Viewstamped Replication: A New Primary Copy method to Support Highly-Available Distributed Systems, B. Oki, B. Liskov, (1988) 2
What is “Viewstamped Replication”?
It is a replication protocol. It aims to guarantee a consistent view over replicated data. And it is a “consensus algorithm”. To provide a consistent view over replicated data, replicas must agree on the replicated state.
It is designed to be a pluggable protocol which works on the communication layer between the clients and the servers and between the servers themselves. The idea is that you can take a non-distributed system and using this replication protocol you can turn a single node system into a high-available, fault-tolerant distributed system.
To achieve fault-tolerance, the system must introduce redundancy in time and space. Redundancy in time is to combat the unreliability of the network. Messages can be dropped, reordered or arbitrarily delayed, therefore protocol must account for it and allow requests to be replayed if necessary without causing duplication in the system. Redundancy in space is generally achieved via adding redundant copies in different machines with different fault-domain isolation. the aim is to be able to stay in operation in face of a node failure, whether is a system crash or a hardware failure, the system must be able to continue normal operation within certain limits.
Together, the redundancy in time and space provide the ability to tolerate and recover from temporary network partitions, systems crashes, hardware failure and a wide range of network related issues.
However, the system can only tolerate a number of failures depending on
the cluster (or ensemble size). In particular to tolerate 𝑓
failures it requires an ensemble of
2𝑓 + 1 replicas. The type of
failures that the protocol can tolerate are non-Byzantine failures 5
which means that all the nodes in the ensemble will be either in a
working state, or in a failed state or simply isolated. However, every
node in the system will not deviate from the protocol and it will not
lie about its state. Therefore, if the system must be able to tolerate
one node to be unresponsive, then the minimum ensemble size is 3
nodes. If the system must be able to tolerate 2 concurrent failures,
then the minimum size of the cluster required is 5 nodes, and so
𝑓 + 1 nodes is called quorum. It is not possible to create a
quorum with less than 3 nodes, therefore the minimum ensemble size is
Replicated State Machines
Viewstamped Replication is based on the State Machine Replication concept 6. A State Machine has an initial state and an operation log. The idea is that if you apply the ordered set of operation in the operation log to the initial state you will end up always with the same final state, given that all the operations in the operation log are deterministic.
Therefore, if we replicate the initial state and the operation log into other machines and repeat the same operations the final state on all the machines will be the same.
It is crucial that the order of operations is preserved as operations are not required to be commutative. Additionally all sources of indeterminism must be eliminated before the operation is added to the operation log. For example, if you have an operation which generate a unique random id, the primary replica will need to generate the random id and then add an operation in the log which already contains the generated number such that when the replicas apply the operation won’t need to generate the random unique id themselves which will cause the replica state to diverge.
The objective of Viewstamped Replication is to ensure that there is a strictly consistent view on the operation log. In other words, is ensures that all replicas agree on which operations are in the log and their exact order.
Anatomy of a Replica
The ensemble or cluster is made of replica nodes. Each replica is composed of the following parts.
The operation log (
op-log) is a (mostly) append-only sequence of
operations. Operations are applied against the current state which
could be external to the replica itself. Operations must be
deterministic which means that every application of the same
operation with the same arguments must produce the same
result. Additionally, if the operations produce side effect or write
to an external system you have to ensure the operation is applied only
once by the use of transactional support or by making operation
idempotent so that multiple application of the same operation won’t
produce duplication in the target system. Each operation in the
operation-log has a positional identifier which is a monotonically
The last inserted operation number is the high water mark for the operation
log and it is recorded as the
op-num which is a monotonically
increasing number as well. It identifies which operation has been
already received and it is used in parts of the protocol.
Operations in the
op-log are appended first, then shared with other
replicas and once there is a confirmation that enough replicas have
received the operation then it is actually executed. We will see how
this process works in more details later. The
the number of the last operation which was executed in the replica. It
also implies that all the previous operations have been executed as
commit-num is a monotonically increasing number.
The view-number (
view-num) is a monotonically increasing number
which changes every time there is a change of primary replica in
Each replica must also know who the current primary replica is.
This is stored in the
status field shows the current replica operation mode.
As we will see later, the
status can assume different values
depending on whether the replica is ready to process client requests,
or it is getting ready and doing internal preparation.
Every replica node will also have a list of all the replica nodes in the ensemble with their IP addresses and their unique identifiers. Some parts of the protocol require the replicas to communicate with other replicas therefore they must know how to contact the other nodes.
client-table is used to keep track of client’s requests.
Clients are identified by a unique id, and each client can only make
one request at the time. Since communication is assumed to be
unreliable, clients can re-issue the last request without the risk of
duplication in the system. Every client request has a monotonically
increasing number which identifies the request of a particular
client. Each time the primary receives a client requests it add the
request to the client table. If the client re-sends the same requests
because didn’t receive the response the primary can verify that the
request was already processed and send the cached response.
For brevity, the pictures which will follow will omit some details.
Next, we are going to analyse the protocol in details. The protocol is presented in the paper in its simplest form first. Then the paper goes on describing a number of optimisations which do not change the basic structure of the protocol but make it efficient and practical to implement for real-world application. Efficiency is a major concern throughout both papers as the authors were building real-world systems.
Client requests handling
In this section, we are going to see how a client request is processed and the data replicated. We are going to see, in details, how the protocol ensures that at least a quorum of replicas acknowledges the request before executing the request.
Clients have a unique identifier (
client-id), and they communicate
only with the primary. If a client contacts a replica which is not the
primary, the replica drops the requests and returns an error message
to the client advising it to connect to the primary. Each client can
only send one request at the time, and each request has a request
request#) which is a monotonically increasing number. The
<REQUEST> message which contains the
request# and the operation (
op) to perform.
The primary only processes the requests if its
otherwise, it drops the request and sends an error message to the client
advising to try later. When the primary receives the request from the
client, it checks whether the request is already present in the client
table. If the request’s number is greater of the one in the client
table then it is a new request, otherwise, it means that the client
might not have received last response and it is re-sending the
previous request. In this case, the request is dropped and the primary
re-send last response present in the client table.
If it is a new request, then the primary increases the operation
op-num), it appends the requested operation to its operation
op-log) and it updates the
client-table with the new request.
Then it needs to notify all the replicas about the new request, so it
<PREPARE> message which contains: the current view number
view-num), the operation’s
op-num, the last commit number
commit-num), and the
message which is the client request itself,
and it sends the message to all the replicas.
When a replica receives a
<PREPARE> request, it first checks whether
the view number is the same
view-num, if its view number is
different than the message
view-num it means that a new primary was
nominated and, depending on who is behind, it needs to get up to date.
If its view-number is smaller than the message
view-num, then it
means that this particular node is behind, so it needs to change its
recovery and initiate a state transfer from the new
primary (we will see the primary change process called view change
later). If its view-number is greater than the message
then it means that the other replica needs to get up-to-date, so it
drops the message. Finally, if the
view-num is the same, then it
looks at the
op-num in the message. The
op-num needs to be
strictly consecutive. If there are gaps it means that this replica
missed one or more
<PREPARE> messages, so it drops the message and
it initiates a recovery with a state transfer. If the
strictly consecutive then it increments its
op-num, appends the
operation to the
op-log and updates the client table.
Now the replica sends an acknowledgment to the primary that the
operation, and all previous operations, were successfully prepared. It
<PREPARE-OK> message with its
its identity. Sending a
<PREPARE-OK> for a given
that all the previous operations in the log have been prepared as well
The primary waits for
𝑓 + 1 including itself
messages, at which point it knows that a quorum of nodes knows
about the operation to perform therefore it is considered safe to
proceed as it is guaranteed that the operation will survive the loss
𝑓 nodes. When it receives enough
<PREPARE-OK> then it performs
the operation, possibly with side effect, it increases its commit
commit-num, and update the client table with the operation
result. Again, advancing the
commit-num must be done in strict order,
and it also means that all previous operations have been committed
Finally, it prepares a
<REPLY> message with the current
the client request number
request# and the operation result
response) and it sends it to the client.
At this point, the primary is the only node that has performed the
operation (and advanced its
commit-num) while the replicas have only
prepared the operation but not applied. Before they can safely apply
they need confirmation from the primary that it is safe to do so.
This can happen in two ways.
Let’s assume that the same client or a new client comes along and
makes a new request. The client will create a
<REQUEST> message and
send it to the primary as well. The primary will process the request
as usual, check the client table, increase the
op-num and append to
op-log, then it creates a
<PREPARE> message with the
op-num, the client’s
message but it also includes
commit-num which now shows that the primary advanced its
When the replicas receives the
<PREPARE> it follows the normal
process; It checks the
view-num, checks the
op-num, increment its
operation number append to the
op-log and update the client table,
then it sends a
<PREPARE-OK> to the primary for the operation it
<PREPARE> message from the primary also contained the
commit-num which showed that the primary has now executed the
operation against its state and it is notifying the replicas that it
is safe to do so as well. So the replicas will perform all requests in
op-log between the last
commit-num and the
<PREPARE> message strictly following the order of operations
and advance its
commit-num as well. At this point both, primary and
replicas have performed the committed operations in their state.
So we seen how the primary uses the
<PREPARE> message to inform
replicas about new operations as well as operations which are now safe
to perform. But what if there is not a new request coming in? what if
the system is experiencing a low request rate, maybe because it is the
middle of the night and all users are asleep? If the primary doesn’t
receive a new client request within a certain time then it will create
<COMMIT> message to inform replicas about which operations can now
<COMMIT> message will only contain the current view number
view-num) and the last committed operation (
commit-num). When a
replica receives a
<COMMIT> it will execute all operation in their
op-log between the last
commit-num and the
commit-num in the
<COMMIT> message strictly following the order of operations and
commit-num as well.
<PREPARE> message together with the
<COMMIT> message act as a
sort of heartbeat for the primary. Replicas use this to identify when
a primary might be dead and elect a new primary. This process is
called view change and it is what we are going to discuss next.
View change protocol
The view change protocol is used when one of the replicas detects that the current primary is unavailable and it proceeds to inform the rest of the ensemble. If a quorum of nodes agrees then the ensemble will transition to a new primary. This part of the protocol ensures that the new primary has all the latest information to act as the new leader.
In my view, one of the most interesting parts of the ViewStamped Replication protocol is the way the new primary node is selected. Other consensus protocols (such as Paxos and Raft) have a rather sophisticated way to elect a new primary (or leader) with candidates having to step up, ask for votes, ballots, turns etc. ViewStamped Replication takes a completely different approach. It uses a deterministic function over a fixed property of the replica nodes, something like a unique identifier or IP address.
For example, if the replica nodes IP addresses don’t change you could
sort function which given a list of nodes IPs will return a
deterministic sequence of nodes. The ViewStamped Replication
algorithm simply uses the sorted list of IPs to determine who is the
next primary node, and in round-robin fashion all nodes will, in
turn, become the new primary when the previous one is unavailable. I
find this strategy very simple and effective. There is no vote to
cast, candidates stepping up, election turns, ballots, just a
predefined sequence of who’s turn is.
For example, if you have a three nodes cluster with the following IP addresses
10.5.3.12 10.5.1.20 (1) 10.5.1.20 ~sort~> 10.5.2.28 (2) 10.5.2.28 10.5.3.12 (3)
10.5.1.20 will be the first primary, and
everybody knows that it will be the first. When that node dies or is
partitioned away from the rest of the cluster, the next node to become
the primary will be the
10.5.2.28 followed by
10.5.3.12 and then
starting from the beginning again.
The protocol ensures that if the node who is set to be the next primary, somehow, has fallen behind and it is not the most up-to-date, it will be able to gather the latest information from the other nodes and get ready for the job. We will see this in more details in this section.
The view change
Let’s start with a working cluster. Replicas
R5 are all up
R1 is the first primary node, followed by
R2 and so
on. Since ensembles are formed by
2𝑓 + 1 nodes, this cluster can
survive and continue processing requests in the face of two failed
R1 is currently accepting client’s requests and those are handled as
seen earlier. Therefore for each client request, a
is sent to all replicas and they reply with a
<COMMIT> messages are also used to signal which
operations are committed by the primary.
Let’s assume that the primary
R1 crashes or is isolated from the
rest of the cluster. No more
<COMMIT> messages will
be received by the rest of the cluster. As said previously these messages act as a heartbeat for the primary health.
At some point, in one of the nodes (
R4 for example) a timeout will
expire, and it will detect that it didn’t hear from the primary since
a predefined amount of time. At this point the replica will change
view-change, increase its view number (
and create a
<START-VIEW-CHANGE> message with the new
its identity. It will then send this message to all the replicas.
When the other replicas receive a
<START-VIEW-CHANGE> message with a
view-num bigger than the one they have, they set their
view-change and set the
view-num to the view number in the message
and reply with
<START-VIEW-CHANGE> to all replicas.
When a replica receives
𝑓 + 1 (including itself)
<START-VIEW-CHANGE> message with its
view-num then it means that
the majority of the nodes agrees on the view change so it sends a
<DO-VIEW-CHANGE> to the new primary (selected as described above).
<DO-VIEW-CHANGE> message contains the new
view-num the last
view number in which the state was normal
op-num and its operation log (
op-log) and its commit number
When the new primary, in this case, node
R2, it receives
𝑓 + 1
<DO-VIEW-CHANGE> with the same
compares the messages against its own information and pick the most
up-to-date. It will set the
view-num the new
view-num, it will
take the operation log (
op-log) from the replicas with the highest
old-view-number. If many replicas have the same
will pick the one with the largest
op-log, it will take the
from the chosen
op-log and the highest
commit-num and execute all
committed operations in the operation log between its old
value and the new
commit-num value. At this point, the new primary
is ready to accept requests from the client so it sets its
normal. Finally, it sends a
<START-VIEW> message to all replicas
with the new
view-num, the most up to date
op-num and the highest
When the other replicas receive the
<START-VIEW> message they
replace their own local information with the data in the
<START-VIEW> message, specifically they take: the
op-num and the
view-num. They change their
and they execute all the operation from their old
commit-num to the
commit-num and they send a
<PREPARE-OK> for all operation in
op-log which haven’t been committed yet.
Some time later the failed node (
R1) might come back alive either
because the partition is now terminated or because the crashed nodes
it has been restarted. At this point, the
R1 might still think it is
the primary and be waiting for requests from clients. However, since
the ensemble transitioned to a new primary in the meantime, it is
likely that clients will be sending requests to the new primary and it
is likely that the new primary is sending
messages to all the replicas, including the one which previously
failed. At this point, the failed replica will notice that
<COMMIT> messages have a
view-num greater than
view-num and it will understand that the cluster transitioned to
a new primary and that it needs to get up to date.
In this case, it will set its
recovery and issue a
<GET-STATE> request to any of the other replicas. The
will contains its current values of the
commit-num together with its identity. The
<GET-STATE> message is
the same message used by the replicas to get up to date when the fall
behind. The replica who receives the
<GET-STATE> message will only
respond if its
view-num in the
<GET-STATE> message is the same as its
view-num it means that the requesting node just fell behind, so it
will prepare a
<NEW-STATE> message with its
commit-num and its
op-num and the portion of the
op-num in the
<GET-STATE> and its
view-num in the
<GET-STATE> message is different, then it
means that the node was in a partition during a view change. In this
case it will prepare a
<NEW-STATE> message with new
commit-num and its
op-num and the portion of the
commit-num in the
<GET-STATE> this time and its
is because some operations in the minority group might not have
survived the view-change.
This is concludes the discussion about the ViewStamped Replication protocol. We seen how requests are handled in the normal case, and we have seen how the ensemble safely transition to a new primary in case it detects a failure or partition in the previous one. We also seen how the protocol design accounts for unreliable network and allows for any message to be dropped, delayed or reordered and still work correctly.
Some of the limitations described earlier were imposed to allow to focus the discussion over the protocol correctness and clarity. In this section we’ll see how some of the optimisations proposed by the authors make this protocol not only realistic to implement for a real world system, but also efficient and practical.
While the protocol as described above doesn’t require the internal
replica state to be persisted to work properly you can store the state
in a durable storage to speed up crash recovery. In fact in a large
system the operation log (
op-log) could become, over time, quite
large. It would be unreasonable and ineffective to keep the entire
operation log only in memory. Upon a crash, or when a replica is
restarted it will require to get a copy of the
op-log from another
<GET-STATE>) and if the replica has to transfer
terabytes of data the operation could take a long time.
Storing the internal replica state into a durable storage can speed up
the recovery in the event the process is crashed or restarted. In
fact in this case, new replica process will only need to fetch the
tail of the operation log since its last committed operation
Since the persistent state is not required to run the protocol
correctly, then the implementer can make design decisions based on the
efficient use of the storage. For example it could decide to
only when there are enough changes to fill a buffer, as supposed to
fsync-ing on every request which it would be slow and inefficient.
Additionally the durability of the state could be managed completely
asynchronously from the protocol execution.
Running the protocol has a cost. As we seen the primary has to wait
for a quorum of nodes to respond on every
<PREPARE> request before
executing the client request. In a large cluster, as the number of
nodes increases, it also increases the number of nodes that the
primary has to wait for a response and it increases the likelihood the
at one being slower and having to wait for longer. In such cases we
can divide the ensemble into active replicas and witness replicas.
Active replicas account for
𝑓 + 1 nodes which run the full protocol
and keep the state. The primary is always an active replica. The
remainder of the nodes are called witnesses and only participate to
the view changes and recovery.
In a high volume system there will be a huge number of
messages flying around. To reduce the latency cost of running the
protocol the primary replica could batch a number of operation
before sending a
<PREPARE> messages to the other replicas. In this
<PREPARE> messages will include all the operations in the
batch and the
<PREPARE-OK> message will confirm that all
operations in the batch have been prepared. Several batching
strategies can be applied for example the primary could wait for at
most 20 milliseconds or when 50 operations have been batched or
whichever comes first, and then send the batch in a single
The same strategy could be applied on the client side for batching client’s requests, given that the client table is adjusted to accommodate a batch of requests.
Fast read-only requests on Primary
In the general definition of the protocol, read-only requests, which do not alter the state are served via the normal request processing. However since read requests do not need to survive a partition, they could be served unilaterally by the primary without contacting the other replicas.
The advantage of this optimisation is that read requests could benefit of shorter latency as there is no additional cost of running the protocol. However, in some case it is possible for the primary to serve stale responses. Let’s consider the following case:
An ensemble with three replica nodes (
the current primary. A bunch of clients are connected to the
primary and issuing requests. Requests manipulate a set of named
registers. The current value of the register
1. At a certain
point a partition occur which isolates the primary (
R2) from the
rest of the ensemble and from most of the clients. Once the view
change protocol kicks in replica
R3 will be elected as a new
primary, and clients will reconnect to the new primary.
At this point one of the clients could request a change of the
a = 2 and the change would succeed because a quorum of
replica nodes is available (
In the meanwhile the old primary (
R2) is unaware of the change to
a because the partition is still isolating
communications. Therefore is not even aware that a new primary
replica has been selected and it is no more the current primary.
An unfortunate client which is in the same side of the partition as
the old primary (
R2) might still be able to talk to the old primary.
In this case if the client is requesting the current value of register
a, it will get a stale value.
To avoid this problem and make read-only requests possible to be served unilaterally by the primary we can make the use of leases. A primary can only serve read-only requests without running the protocol and consulting with the other replicas if and only if, its lease is not expired. When the lease expires, the primary must run the protocol and get the grant for a new lease which will ensure the view hasn’t changed in the meantime. In case of a partition, the ensemble will have to wait for the lease to expire before triggering a view-change and select a new primary.
Read-only requests on other replicas
In high request volume systems, where the number of read requests are much higher than the write requests, and if clients systems can accept stale reads (eventually consistent reads), then read-only requests can be made directly to the follower replicas. It is a effective way to reduce the load on the Primary and scale out the reads across all replicas. The primary is always handling the writes.
In this situations it might be useful to track causality of client
requests, for example a client makes a write followed by a read of the
same value (read your own writes). Causality can be achieved by
tracking the operation number
op-num for a client write request and
propagating back this information to the client. Clients interested in
causal ordering can issue a read request to a follower replica stating
op-num. In this case the replica knows that it must respond to
that request only after it advanced its
commit-num past the
For example the client is connecting to the primary and ask for a
write operation such as a registry increment (
write a = a + 1), the
primary will process the request as usual, but along with the response
it will communicate the operation number (
op-num) for this request.
The client records the
op-num and it uses to make subsequent
requests to the follower replicas for a read-only request. The replica,
if it hasn’t processed the operation in the client request
will have to wait until it the primary communicate that it is safe to
do so and then reply.
As seen so far, each node needs to be known to the others for the protocol to work correctly. However, modern systems, especially in a cloud environment, tend to change frequently their ensemble configuration. Maybe as a response to the increasing or decreasing of load (elastically scalable systems), or just as a consequence of operational necessities (hardware replacement, network reconfiguration etc).
To face this increasingly common requirement the authors proposed a protocol extension to allow a reconfiguration of the ensemble. With this extension is possible to add, remove, or replace some or all nodes. The only limit is that the minimum allowed number of nodes is three, below this limit is not possible to achieve a quorum.
In order to support reconfiguration the replicas have to track
additional properties. A monotonically increasing number called
epoch-num tracks every reconfiguration, a property called
old-config holds the previous configuration (list of nodes IPs,
names, and IDs). Finally a new status called
transitioning is used
to mark when a reconfiguration is in progress.
A special client, typically with admin rights, will issue a
<RECONFIGURATION> request. The request includes the current
request# and the
the list of all ensemble nodes which is expected to replace the
The request will be sent to the primary and processed as a normal
request. However, as soon as a
<RECONFIGURATION> request is
received by the primary it will stop accepting new client’s requests.
The other replicas will process the request as usual and send a
<PREPARE-OK> message back to the primary. Once the primary receives
a quorum of
<PREPARE-OK> responses it will increase the
and send a
<COMMIT> message to all replicas. At this point it sends
<START-EPOCH> to all new replicas (all the replicas which are part
of the new configuration but not of the old configuration) and it sets
status = transitioning. The
<START-EPOCH> contains the new
epoch-num the primary
op-num and both: the old and new
When the new replicas receive the
<START-EPOCH> message they update
the configuration taking
new-config from the
message itself, they reset the
view-num to zero and set their
status = transitioning. If a new replica is missing data in
op-log it will issue a
<GET-STATE> request to the old
replicas. Once the new replica is up to date, then it sets its
= normal, send a
<EPOCH-STARTED> with the new
epoch-num and its
identity to the old group, and finally, the new primary will start to
accept client requests.
This protocol extension allows to control the size of the ensemble for both: sizing up and sizing down. Additionally it can be used to replace a defective machine or also to migrate the entire cluster to a new set of more powerful machines.
As we seen the Viewstamped Replication algorithm is a fairly simple consensus algorithm and quite interesting replication protocol. It is quite simple to understand and, together with the many optimisations proposed, it can be implemented efficiently for real world application. Since it operates only on the interactions between replicas and clients it can be used as a “library” wrapper atop an existing system rendering the system distributed, high available and strictly consistent. For example it possible to take a single node file system and turn it into a distributed high available file system, which it was the initial motivation of the viewstamped replication protocol in first place. Moreover it could be used to wrap communications of systems such as Redis and Memcache or any other non-distributed system and turn it into a distributed, fault tolerant, high available and strictly consistent system.
Links and resources: