Distributed Datastores

Distributed NoSQL key-value stores
(e.g. MongoDB, DynamoDB by Amazon, Cassandra by Facebook)

CAP Theorem

In a distributed system you can only guarantee at most 2 out of the following 3 properties:

Consistency: reads return latest written value by any client (all nodes see same data at any time)
Availability: every request received by a non-failing node in the system must result in a response (quickly)
Partition Tolerance: the system continues to work in spite of network partitions

We usually want to keep Partition Tolerance always

Cassandra

A distributed key-value store, intended to run in a datacenter, first designed by Facebook, now is an Apache project

Data Partitioning

Cassandra uses a ring-based Distributed Hash Table (DHT) but without finger or routing tables.

Pasted image 20230424144308.png|400

Partitioner

Partitioner is the component responsible for key to server mapping and determining the primary replica for a key. There are two types:

Chord-like hash partitioning: uses a hash function
Byte-ordered Partitioner: assigns ranges of keys to servers
- Easier for range queries

Replication Policies

Two options for replication strategy:

Simple Strategy:
- First replica placed based on the partitioner
- Remaining replicas clockwise in relation to the primary replica
Network Topology Strategy:
- Two or three replicas per data center
- Per data center
  - First replica placed according to Partitioner
  - Then go clockwise around ring until you hit a different rack

Operations

Writes

Need to be lock-free and fast
Client sends write to one coordinator node in Cassandra cluster
- Coordinator may be per-key, or per-client, or per-query
Coordinator uses Partitioner to send query to all replica nodes responsible for key
When $X$ replicas respond, coordinator returns an acknowledgement to the client
- $X$ = any one, majority, all (consistency spectrum)

Hinted Handoff
To ensure that writes are eventually written to all replicas

If any replica is down, the coordinator writes to all other replicas, and keeps the write locally until down replica comes back up
When all replicas are down, the Coordinator (front end) buffers writes (for up to a few hours)

Writes at Replica Node
When a replica node receives a write request:

Log it in disk commit log (for failure recovery)
Make changes to appropriate memtables
- Memtable = in-memory representation of multiple key-value pairs
- Cache that can be searched by key
- Write-back cache as opposed to write-through
Later, when memtable is full or old, flush to disk. The disk stores:
- Data file: an SSTable (Sorted String Table) - list of key-value pairs sorted by key
- Index file: An SSTable of (key - position in data sstable) pairs
- and a #Bloom Filter (for efficient search)

Bloom Filter

A Bloom Filter is a probabilistic data structure that is based on hashing.

Compact way of representing a set of items
Checking for existence in set is cheap
Some probability of false positives: an item not in set may check true as being in set
No false negatives

How it works:

On insert, set all hashed bits to 1
On check-if-present, return true if all hashed bits are 1

Compaction

Data updates accumulate over time and over multiple SSTables
Need to be compacted
The process of compaction merges SSTables, i.e., by merging updates for a key
Run periodically and locally at each server

Deletes

Deletes don't delete the data right away

Writes a tombstone for the key
Eventually, when #Compaction encounters tombstone it will delete item

Reads

Coordinator contacts $X$ replicas

Coordinator sends read to replicas that have responded quickest in past
When $X$ replicas respond, coordinator returns the latest-timestamped value from among those $X$

Coordinator also fetches value from other replicas

Checks consistency in the background, initiating a Read Repair if any two values are different
The mechanism seeks to eventually bring all replicas up to date

At a replica

Read looks at Memtables first, and then SSTables
A row may be slit across multiple SSTables $\to$ reads need to touch multiple SSTables $\to$ read slower than writes

Cross-DC Coordination

Replicas may span multiple Data Centers.
Per-DC coordinator elected to coordinate with other DCs.
Election done via Zookeeper which runs a Bully algorithm variant.

Membership

Any server in cluster could be the leader
So every server needs to maintain a list of all the other servers that are currently in the cluster
List needs to be updated automatically as servers join, leave, and fail

Cluster Membership

Cassandra uses gossip-based cluster membership
Nodes periodically gossip their membership list
On receipt, the local membership list is updated, as shown

Pasted image 20230508083326.png|380

Consistency

Consistency Spectrum

Cassandra offers Eventual Consistency: if writes to a key stop, all replicas of key will converge.

Pasted image 20230508084122.png|350

Consistency Level

Client is allowed to choose a consistency level for each operation (read/write)

ANY: any server (may not be replica)
- Fastest: coordinator caches write and replies quickly to client
ALL: all replicas
- Ensures strong consistency, but slowest
ONE: at least one replica
- Faster than ALL
QUORUM: quorum across all replicas in all data centers (DC)
- Faster than ALL, but still ensure strong consistency
- Any two quorums intersect
  - Client 1 does a write in red quorum
  - Then client 2 does read in blue quorum
  - At least one server in blue quorum returns latest write
EACH_QUORUM: quorum in every DC
- Lets each DC do its own quorum (not supported for reads)
LOCAL_QUORUM: quorum in coordinator’s DC
- Faster: only waits for quorum in first DC client contacts

Quorum Details

Read Quorums

Client specifies $R (\leq N =$ total number of replicas of that key)
- $R$ = read consistency level
Coordinator waits for $R$ replicas to respond before sending result to client
In background, coordinator checks for consistency of remaining $(N - R)$ replicas, and initiates read repair if needed

Write Quorums

Client specifies $W (\leq N)$
- $W$ = write consistency level
Client writes new value to $W$ replicas and returns when it hears back from all

Necessary Conditions for consistency

$W + R > N$
- Write and read intersect at a replica. Read returns latest write.
$W > N / 2$
- Two conflicting writes on a data item don't occur at the same time.

Values selected based on application

$(W = N, R = 1)$ : great for read-heavy workloads
$(W = 1, R = N)$ : great for write-heavy workloads with no conflicting writes
$(W = \frac{N}{2} + 1, R = \frac{N}{2})$ : great for write-heavy workloads with potential for write conflicts
$(W = 1, R = 1)$ : very few writes and reads / high availability requirement

Eventual Consistency

Sources of inconsistency:

Quorum condition not satisfied $R + W < N$
- $R$ and $W$ are chosen as such
- when write returns before W replicas respond
  - Sloppy quorum: when value stored elsewhere if intended replica is down, and later moved to the intended replica when it is up again
- When local quorum is chosen instead of global quorum

Hinted-handoff and read repair help in achieving eventual consistency

If all writes (to a key) stop, then all its values (replicas) will converge eventually
May still return stale values to clients (e.g., if many back-to-back writes)
But works well when there a few periods of low writes – system converges quickly