Monday, 28 September 2015

Couchdb, MVCC and Conflicts while Replicating

Setup

In my last post regarding Multiversion Concurrency Control, we saw what it takes to enter conflicting versions of a document into a single couch instance. You have to be somewhat resourceful.

But the real fun with the couch comes from its distributed nature.
We will see that the rules change a bit when we talk about more than one instance and use replication to synchronize them.

Here's the setup for playing around with MVCC on two couches:

First Pi:

  • hostname: frodo
  • Model: Pi B+ (ARMv6h)
  • OS: Arch Linux ARM
  • Couchdb 1.6.1_4 (taken from Arch Linux ARM Repository)

Second Pi:


Again, everything will be done via curl. Data will not be directly on the command line but always be taken from a file (due to strange behaviour of my curl on Windows).

We assume two users entering data into the their respective Pis. Arwen uses the the couch on her arwen-Pi when Frodo uses his frodo-Pi. Eventually they will exchange their work via replication.

Preparation

Let's start from scratch by creating the database on each Pi respectively:
#
# create the database on arwen
curl -X PUT http://arwen:5984/mvcc
#
# reaponse
{"ok":true}
#
# create the database on frodo
curl -X PUT http://frodo:5984/mvcc
#
# resonse
{"ok":true}
#

...And Go

Arwen inserts her document first:
#
# Arwen inserts her Doc rep_mydoc_u1_1.json
{
  "content": "U1_1"
}
#
curl -H "Content-Type: application/json" -d @rep_mydoc_u1_1.json -X PUT http://arwen:5984/mvcc/mydoc 
#response
{
  "ok":true,
  "id":"mydoc",
  "rev":"1-3557461c60a30b0d156f8b36a1bdcf9f"
}
#

Arwen wants to share her document with Frodo. She submits a Push Replication Request into the _replicator database of her Pi to trigger the replication:
#
# Arwen shares her doc with frodo via replication
# She initiates a push replication from arwen to frodo
# push_a2f_01.json:
{
  "source": "mvcc", 
  "target": "http://frodo:5984/mvcc"
}
#
curl -H "Content-Type: application/json" -d @push_a2f_01.json -X PUT http://arwen:5984/_replicator/a2s01
#
#responsse
{
  "ok":true,
  "id":"a2s01",
  "rev":"1-0088a4a381404b513bf0586d08d6ce80"
}
#
Taking a look into Arwen's couch.log tells us that the replication took place:
#
Document `a2s01` triggered replication `6018cd9109568fed438add0722e9bccb`
starting new replication `6018cd9109568fed438add0722e9bccb` at <0 data-blogger-escaped-.31751.2=""> (`mvcc` -> `http://frodo:5984/mvcc/`)
recording a checkpoint for `mvcc` -> `http://frodo:5984/mvcc/` at source update_seq 1
Replication `6018cd9109568fed438add0722e9bccb` finished (triggered by document `a2s01`)
#

Please note that these Pis do not know about each other. The replication request is the only point of contact. This request requires Arwen to know about a frodo-Pi.

OK, Frodo should have Arwen's document on his Pi now:
#
# Frodo should now have the document too:
curl  http://frodo:5984/mvcc/mydoc
#respopnse
{
  "_id":"mydoc",
  "_rev":"1-3557461c60a30b0d156f8b36a1bdcf9f",
  "content":"U1_1"
}
#

Now Arwen and Frodo both continue to work on their respective copy of the document and eventually save their work:
#
# Arwen edits her document on arwen
# rep_mydoc_u1_2.json:
{
  "_rev": "1-3557461c60a30b0d156f8b36a1bdcf9f",
  "content": "U1_2"
}
#
curl -H "Content-Type: application/json" -d @rep_mydoc_u1_2.json -X PUT http://arwen:5984/mvcc/mydoc
# response
{
  "ok":true,
  "id":"mydoc",
  "rev":"2-2686fb85c0681a3d8c411617f048f94f"
}
#
# Frodo does the same on frodo
# rep_mydoc_u2_2.json:
{
  "_rev":"1-3557461c60a30b0d156f8b36a1bdcf9f",
  "content": "U2_2"
} 
#
curl -H "Content-Type: application/json" -d @rep_mydoc_u2_2.json -X PUT http://frodo:5984/mvcc/mydoc
# response
{
  "ok":true,
  "id":"mydoc",
  "rev":"2-03b64efa2cd6619f46bcbe618fa791f9"
}
#
Each Pi now holds a different version of the document.
Frodo initiates a full sync by triggering first a push replication followed by a pull replication to Arwen. As Frodo now takes the lead, both requests will be submitted into the_replicator DB of his Pi:
# Frodo pushes his stuff to Arwen
# push request push_f2a_01.json:
{
  "source": "mvcc", 
  "target": "http://arwen:5984/mvcc"
}
#
curl -H "Content-Type: application/json" -d @push_f2a_01.json -X PUT http://frodo:5984/_replicator/f2a01
#response
{
  "ok":true,
  "id":"f2a01",
  "rev":"1-d20099b5d5b65eb05271be0204d8100a"
}
#
# Next Frodo pulls from Arwen
# pull request pull_a2f_01.json:
{
  "source": "http://arwen:5984/mvcc", 
  "target": "mvcc"
}
curl -H "Content-Type: application/json" -d @pull_a2f_01.json -X PUT http://frodo:5984/_replicator/a2f01
#response
{
  "ok":true,
  "id":"a2f01",
  "rev":"1-26926753f759498b86ece4e48fdb0e5f"
}
#
What would be our expectation after syncing both Pis?
The same document was edited on different hosts. After the new versions had been submitted, each host then held the old and a new version of the document. Both hosts may claim to hold the current version of the document with equal rights.
After a full sync, we expect this:

  • there is identical data on both hosts
  • each host holds the old version and both "new" versions of the document
So, let's see.
We're going to check by requesting the current version and conflicting versions if any.
Let's check on Arwen first:
#
# there should be a conflict on arwen now... 
curl  http://arwen:5984/mvcc/mydoc?conflicts=true
#response
{
  "_id":"mydoc",
  "_rev":"2-2686fb85c0681a3d8c411617f048f94f",
  "content":"U1_2",
  "_conflicts":["2-03b64efa2cd6619f46bcbe618fa791f9"]
}
#
The current version is the one that Arwen herself submitted.
As expected, there is a conflict.

What is it on Frodo's Pi?
#
# there should be a conflict on frodo too... 
curl  http://frodo:5984/mvcc/mydoc?conflicts=true
#response
{
  "_id":"mydoc",
  "_rev":"2-2686fb85c0681a3d8c411617f048f94f",
  "content":"U1_2",
  "_conflicts":["2-03b64efa2cd6619f46bcbe618fa791f9"]
}
#
On Frodo's Pi we find the same situation. Arwen's document is delivered as current. Frodo's version constitutes the conflict.
The couch keeps its promise to deliver the same "winning" version on both nodes.

Summery

As far as conflicts are concerned, working distributed changes the rules completely.
On a single node, the couch is quite strict avoiding conflicts. You need a bulk update with a special mode switched on to get it done.
Once you decide to work distributed, the priorities change. When replicating between nodes, pushing or pulling your data successfully becomes the main objective. The goal is to save data over a network. As the nodes operate completely independent from one another, conflicts cannot be avoided.

Well, if you need to go for distributed and want your nodes to be independent, this is the price you have to pay. As economics teaches us: there is no such thing as a free lunch. This seems to hold true for the computer scientist's menu too.




No comments:

Post a Comment