Article

How DefraDB Uses Content-Addressable Data to Make Data Retrieval More Efficient, Secure, and Dynamic

// November 06, 2023

Hello, welcome back to another edition of Source Network Concept Explainers. This time, we’ll be going deep into a critical element of the Source Network — and DefraDB stack — which further reimagines and improves developers' relationships to managing data: Content Addressable Data, with a nod to Merkle DAGs and IPLD Data Models.

What is Content-Addressable Data?

Content-addressable data, sometimes referred to as content-addressable storage (CAS) or content-addressable memory (CAM), is a data storage and retrieval technique that offers a different approach to traditional — and more specifically, centralized — data storage and storage methods, which are primarily based on location address (i.e., file paths or memory addresses).

In the realm of content-addressable data, however, data is stored and retrieved/addressed based on its content, metadata, or, other diversely, data-defining attributes rather than where it’s directly located on a server.

Alongside the growth of web3 technology, Content Addressable Data (CAD) and Merkle DAGs, which are used to compose Content Addressable Data into larger, well-defined objects, have been popularized by the advent of distributed file storage protocols that allow a global, peer-to-peer network of computers to store and serve files.

A graphical depiction of how a message is transformed into content-addressable data.

This is a drastically different approach from the fragmented, centralized ways in which the traditional internet (i.e., web2) utilizes location-based hypermedia server protocols HTTP and HTTPS (and specific IP addresses or domain names and URLs) to distribute files, handle data storage, content retrieval, and distribution. This approach enables users to access files and webpages by connecting directly to those (centralized) servers.

A typical representation of how data is distributed and received in web2.

Decentralized storage protocols (often modular), instead, use a network of open, permissionless, and interconnected nodes for organizing and transferring content-addressed data. In these decentralized environments, files transform into their content-addressable nature by breaking down files into smaller blocks, calculating their identifiable hashes (a mathematical function that satisfies the encrypted demands needed to secure on-chain data), and constructing a Merkle DAG to evolve CAG into larger, well-defined objects.

What are the benefits of using Content-Addressable Data?

CAD has numerous benefits that improve on the limitations of web2, most notably the inefficiency and insecurity of centralized servers and data storage and retrieval environments. Due to its ability to use content for addressing and identification, one of CAD’s many design advantages is its ability to circumvent dead links, a major drawback of web2 which is mainly due to the slightly archaic nature of URLs.

Since the inception of the internet, location-addressable data has been used through the function of domain names and URLs, which are designed to point to data accessible on the web by defining where exactly it is: its location. Location-based data gets its name because a URL points to the data you're looking for but, unfortunately, lacks any additional information or content on the data itself.

CAD flips the approach entirely, and its intended goal is to define a system in which you use what you are looking for (the content) as an addressable format, and the network tells you where it is.

Content Addressing has two major benefits:

  1. Self-Verification. CAD is self-verifying because the identifier of what you are looking for is usually a Hash function output of the data itself.
  2. Storage is separated from infrastructure: CAD decouples infrastructure from storage. This means the owner of a Medium.com server, for example, no longer has to be the only hosting provider for the data that lives on the many millions of blog pages (a bedrock of centralized servers and data storage).

A graphical representation of how data is retrieved and verified in web3.

How DefraDB Leverages Content-Addressable Data

The two major benefits mentioned above are essential to DefraDB because they allow any node on the peer-to-peer network to provide the data someone is looking for. CAD is what enables DefraDB to become a globally replicated data storage system instead of individual, siloed data storage systems for each application (bad for scale, security, speed, and much more).

Content-addressable data networks use Content Identifiers (CIDs — lots of acronyms, bear with us!) to specifically address data. As we said above re: self-verification, CIDs are defined using a hash function over the input data to create a final, single collision-resistant identifier for any data being addressed. To locate the original data you intended to address, the developer (or user querying data) uses the CID, and then the network returns the requested data. This data is fully self-verifiable because, upon receipt of the data from the network, the process can be reversed to generate the identifying hash. You can easily compare what you requested and what you received to make sure it's what was intended.

So, now that we have CAD covered, and understand the value of being able to identify and access data via its content (rather than its location), let’s learn how we can take this data and get even more efficient with its definition.

Merkle DAGs

Let’s quickly run through the journey of content-addressable data. Step one, a file is transformed into its content-addressable representation using the CIDs we mentioned in the previous section. From there, those representations make data content-addressable by breaking down files into smaller blocks, calculating their hashes, and constructing a Merkle DAG, which stands for Merkle Directed Acyclic Graphs. A DAG is an instance where each node has an identifier, the result of the aforementioned process of hashing a node’s contents. Merkle DAG nodes are immutable, meaning any change in a node would alter its identifier, resulting in the construction of an entirely different DAG.

Merkle DAGs are used by DefraDB to compose content-addressable data into larger well-defined objects, which, like the original addressable data, are themselves addressable and verifiable. This addressable data is akin to JSON objects, but with the added benefit of newly defined CID properties. This CAD has keys and values, and can reference other Merkle DAG objects, creating a composed graph of objects that can be defined using any structure. The only rule enforced in this instance is that the data can’t be circular in its definition — if you update any data within a Merkle DAG, you would have created a new Merkle DAG, which must be identified by a new CID. This is what makes the data content addressable by nature.

Merkle DAGs are a general form of Merkle Trees and inherit all their properties. A Merkle Tree is just a balanced binary tree version of a Merkle DAG. Like its self-verifying elementary content-addressable data, the entire DAG is self-verifying as well.

A visual representation of a Merkle DAG.

IPLD

Alright folks, are you still with us? Lastly, we’ll be talking about IPLD (Inter-Planetary Linked Data format), which is another data object format that extends upon the ability of content-addressable data and also uses Merkle DAGs as well-defined objects.

DefraDB leverages IPLD to make use of several, typically siloed technologies, to compose them into a single unified interface, reaping the benefits of each part individually to create a powerful, semantically-linked system that’s greater than the sum of its parts.

To do this, DefraDB creates a NoSQL Document storage model that uses semantically-linked, content-addressable data in the form of Merkle DAGs. This process allows the data residing within a Merkle DAG to be well-defined, using schemas that exist within a semantic web, one that is self-describing and composed using content-identifying IPLD objects. And this is where things get really interesting — and where many of the technologies we chose to utilize in our stack come together to experience the full potential of a peer-to-peer, decentralized NoSQL database (note: data can be queried and interfaced like any other NoSQL Document store), where any node can host any data without fear of corruption, failures, or the need for trust.

The main drawback to the system is mutability. Since IPLD relies on CIDs, any change to the underlying data, creates a new CID, which over time can be difficult to maintain and work with. Additionally, if any node can host any data, it is difficult to collaborate, i.e., for two users to update the same data in different ways. Both these issues of mutability can be resolved with Conflict-Free Replicated Data Types.

What Content-Addressable Data Can Do For You & Your Application

All of the technologies we’ve explained in this article enable developers or even users querying data to experience the benefits of data that’s defined in a way far more efficient, secure, and dynamic than typical centralized structures native to web2. The ability to retrieve data from its content address (i.e., its hash value in web3), rather than having to know its physical location and thus, rely on centralized intermediaries, helps make data more easily available and reduce redundancy — all due to the power of a globally-replicated, distributed system.

From improved reliability to scalability made possible by the ability to easily add additional nodes or storage locations to a network, data consistency, and cost efficiency, these tools reimagine our entire relationship to data and provide developers with far more options that are typically available in web2 environments. Through using DefraDB, and leveraging the power of content-addressable data, all of these web3-native capabilities are possible without having to sacrifice usability or a familiar developer experience.

If you’re ready to connect DefraDB to your application to manage your application’s data — and scale your product — contact us on our website.

Explore our GitHub and developer portal for additional documentation on our technology.

Thanks for reading!

— Source Network

Stay up to date with latest from Source.

Unsubscribe any time. Privacy Policy