Hello, welcome back to another edition of Source Network Concept Explainers. This time, we’ll be going deep into another critical element of the Source Network — and DefraDB stack — which increases the potential of content-identifying IPLD data models through the added benefit of immutability.
Meet CRDTs, which stand for Conflict-Free Replicated Data Types, and offer developers the advantages of safety, predictability, collaboration, and the maintenance of historical versions of data structures.
By the end of this article, you should understand the basics of CRDTs, how they work with two different variations of Merkle trees (which we covered in our last explainer on content-addressable data), Merkle Clocks and Merkle CRDTs, and how we use CRDTs in DefraDB, our decentralized, NoSQL database.
What is Immutable, Conflict-Free Data?
First, what's immutable data, and what does it have to do with avoiding conflicts in data types? To learn about this, let's discuss traditional conflicts in data, which aren't related to data being in disagreement with each other but rather potential errors or inconsistencies that can arise when multiple users, processes, or systems attempt to modify the same data simultaneously. To help envision this process, let's look at Google Workspace — a collection of applications that save your work, communications, and content in a cloud-based storage system (Google Drive).
Google Docs is a collaborative editing application that enables real-time, synchronous editing of documents between multiple accounts. Google Docs users participate in a distributed system, collaborating across different locations and computers via an internet connection. Users can also edit offline and have their changes reflected once they're back online, regardless of who else is in the doc editing. This process is conflict resolution in a (simplified) nutshell.
Google Docs enables immediate responses or changes from connected clients via their local document version. This feature allows for real-time collaboration and editing of a Google doc down to a single word. Google does this by storing documents as a sequence of operations via something called Operational Transformation, an algorithm that, among other functions, enables conflict resolution, but depends on the existence of a centralized coordination server run by Google.
However, collaboration is only one half of the system, the other being historical versions. This feature pertains to immutable data structures, which, once created, cannot be modified but are still made accessible. For Google Doc users, immutability allows you to return to a previous edit version or undo-redo a change, where all the historical data (i.e., your content) is maintained.
In web3 applications, immutable data structures and conflict resolution are key components of how data can be most efficiently managed (without changing existing data but rather creating new versions with each change). This data management is possible across multiple locations and interconnected users, nodes, and devices who can interact with the same data consistently, predictively all while maintaining the integrity of data.
While centralized entities like Google leverage Operational Transformation for conflict resolution, we've chosen to leverage Merkle CRDTs to allow developers or users querying data to work on, share, and collaborate with our database whether you're offline, working in real-time or delayed. These capabilities are all possible with zero burden of overriding others' data to fulfill the penultimate goal of being able to merge stages with zero conflicts.
What are CRDTs?
Conflict-Free Replicated Data Types enable conflict resolution in multiple concurrent updates. Using a deterministic algorithm, CRDTs allow multiple data replicas to be updated independently and merged back together in the future. Like Google Docs, this means that data stored on a global network of nodes or computers can replicate the same data. Akin to Merkle DAGs, CRDTs enable global, peer-to-peer networks to freely make any changes to that data locally and then synchronize it at a later date.
The key aspect of CRDTs is their deterministic nature, which in computer science refers to an algorithm that, given a particular input, will provide the same output. With CRDTs, this means that, given two separate changes or conflicts, they will always merge to the same state, a key characteristic for a globally replicated database like DefraDB.
A graphical representation of a CRDT producing a deterministic state from out of order edits.
Why are CRDTs crucial to DefraDB, and how is this different from conflict resolution in traditional, centralized databases?
CRDTs exist as a stark contrast to Operational Transforms, which need strict control over your users and how they update data but fail to return ownership to the user.
At Source, we believe users should always be free to replicate their data to their machines and make updates they wish while collaborating with others, without the need for centralization and therefore free from the shackles of imprisoned data.
Merkle Clocks & CRDTs
CRDTs can be used to create traditional databases, each with its use case and semantics that can be freely edited, updated, or deleted like any other data type while preserving history in a globally replicated network. Thus, DefraDB’s core data storage engine uses CRDTs to store every data type inside a database.
However, we also need to use our content addressability and Merkle DAG system, which is why we use a unique class of CRDTs called Merkle CRDTs that are designed to work within the environment of a Merkle DAG, and that differ from traditional CRDTs by how data is changed, tracked, and versioned.
Merkle DAGs solve some of the limitations of event ordering and trust found in traditional CRDTs, where you must track when a change occurs so you know how it needs to be merged later — the most common tracking method being Vector Clocks and Logical Clocks data structures. While these data structures allow events to be tracked and ordered inside a distributed system they can be error-prone in certain environments, like high churn and globally available networks. However, these kinds of environments are exactly what's needed for user-centric peer-to-peer collaboration
Merkle DAGs — and DAGs overall, where the flow of nodes is unidirectional, enforcing its acyclic nature — provide an excellent mechanism for ordering events by extending the DAG element by element with each change or update that is made of a CRDT.
In this system, each change references the current known head of the chain of events that preceded it and, as a result, is encoded as a Merkle DAG. Since each Merkle DAG is content-addressed, the flow of events can only go in one direction, meaning a new event can never precede an older event. These event order data structures are called Merkle Clocks.
A graphical representation of a Merkle Clock head and its previous versions.
While they don’t necessarily help you tell time, Merkle Clocks are highly beneficial in the causal ordering of event ordering data structure, much like a sequential, linked list. By embedding traditional CRDTs within Merkle Clock nodes as payloads (Merkle CRDTs), they fulfill CRDT requirements, including tracked changes and event ordering.
How DefraDB Leverages Merkle CRDTs
So, how does DefraDB use Merkle CRDTs to provide the most efficient and secure data management for applications? CRDTs allow us to stay true to our technology's most important values and benefits: data ownership, decentralization, interoperability, consistency, trustless data management, seamless flow between devices, and the ability to deploy databases in any environment.
Specifically to DefraDB, CRDTs enable seamless conflict resolution and data replication, ensuring countless users can independently update and merge data consistently. It does this while benefiting from the power of a decentralized database where users are empowered to have full ownership and control of their data, where it resides, and how it evolves.
If you’re ready to connect DefraDB to your application to manage your application’s data — and scale your product — contact us on our website.
Explore our GitHub and developer portal for additional documentation on our technology.
Thanks for reading! — Source Network