Distributed Datastores

Distributed NoSQL key-value stores
(e.g. MongoDB, DynamoDB by Amazon, Cassandra by Facebook)

CAP Theorem

In a distributed system you can only guarantee at most 2 out of the following 3 properties:

We usually want to keep Partition Tolerance always

Cassandra

A distributed key-value store, intended to run in a datacenter, first designed by Facebook, now is an Apache project

Data Partitioning

Cassandra uses a ring-based Distributed Hash Table (DHT) but without finger or routing tables.

Pasted image 20230424144308.png|400

Partitioner

Partitioner is the component responsible for key to server mapping and determining the primary replica for a key. There are two types:

  1. Chord-like hash partitioning: uses a hash function
  2. Byte-ordered Partitioner: assigns ranges of keys to servers
    • Easier for range queries

Replication Policies

Two options for replication strategy:

  1. Simple Strategy:
    • First replica placed based on the partitioner
    • Remaining replicas clockwise in relation to the primary replica
  2. Network Topology Strategy:
    • Two or three replicas per data center
    • Per data center
      • First replica placed according to Partitioner
      • Then go clockwise around ring until you hit a different rack

Operations

Writes

Hinted Handoff
To ensure that writes are eventually written to all replicas

Writes at Replica Node
When a replica node receives a write request:

  1. Log it in disk commit log (for failure recovery)
  2. Make changes to appropriate memtables
    • Memtable = in-memory representation of multiple key-value pairs
    • Cache that can be searched by key
    • Write-back cache as opposed to write-through
  3. Later, when memtable is full or old, flush to disk. The disk stores:
    • Data file: an SSTable (Sorted String Table) - list of key-value pairs sorted by key
    • Index file: An SSTable of (key - position in data sstable) pairs
    • and a #Bloom Filter (for efficient search)

Bloom Filter

A Bloom Filter is a probabilistic data structure that is based on hashing.

How it works:

Compaction

Deletes

Deletes don't delete the data right away

Reads

Coordinator contacts X replicas

Coordinator also fetches value from other replicas

At a replica

Cross-DC Coordination

Membership

Cluster Membership

Pasted image 20230508083326.png|380

Consistency

Consistency Spectrum

Cassandra offers Eventual Consistency: if writes to a key stop, all replicas of key will converge.

Pasted image 20230508084122.png|350

Consistency Level

Client is allowed to choose a consistency level for each operation (read/write)

Quorum Details

Read Quorums

Write Quorums

Necessary Conditions for consistency

  1. W+R>N
    • Write and read intersect at a replica. Read returns latest write.
  2. W>N/2
    • Two conflicting writes on a data item don't occur at the same time.

Values selected based on application

Eventual Consistency

Sources of inconsistency:

Hinted-handoff and read repair help in achieving eventual consistency