Understanding Sharding and Data Availability Sampling in Ethereum

·

Ethereum's upgrade to Eth2 introduces significant changes, with sharding being a core component alongside Proof of Stake. This proposal focuses on a specific form called "data sharding," which stores data and proves the availability of approximately 250 kB of data. This data availability layer is crucial for Layer 2 solutions like rollups, providing a secure and high-throughput foundation.

To avoid requiring every node to download all data, two techniques are combined: randomly sampled committees and Data Availability Sampling (DAS). This allows verification of large data sets efficiently and securely.

Simplifying Randomly Sampled Committees

Imagine handling 16 MB of data per slot in Eth2, represented as 64 blobs of 256 kB each. With around 6400 validators, how can we verify this data without everyone downloading everything and without giving attackers an advantage?

The solution is division of labor. Validators are grouped into committees, each responsible for one blob. For example, validators 1-100 verify the first blob, 101-200 the second, and so on. Each committee signs to confirm their blob's validity, and the network accepts it upon receiving signatures from most committee members.

However, if an attacker controls a contiguous set of validators, they might dominate a single committee. Random sampling solves this by shuffling the validator list using a hash-based random number generator. The first 100 shuffled validators form the first committee, the next 100 the second, and so on. This randomness prevents attackers from targeting specific committees, as they'd need control over about one-third of all validators to have a significant chance.

Explaining Data Availability Sampling

Data Availability Sampling (DAS) is like the mirror image of committee sampling. Instead of committees checking entire blobs, each client node samples small parts of each blob privately. For each blob, a client selects N random indices and attempts to download data at those points.

The goal is to verify that at least half of the blob's data is available. If less is available, clients will likely miss data in their samples and reject the blob. This method is efficient because clients download only tiny portions per blob, and it's secure because even a 51% attacker can't trick clients into accepting unavailable data.

Role of Erasure Coding

To handle cases where attackers provide 50-99% of data—potentially causing inconsistent acceptance—we use erasure coding. This technique encodes blobs so that if over half the data is available, anyone can reconstruct the missing parts. Once reconstructed and broadcast, initially rejecting clients will accept the blob.

Think of it as "two points define a line." If a file is represented as points on a line, having two points lets you reconstruct the others. With higher-degree polynomials, we can scale this: for 2n pieces, having n pieces allows reconstruction of the whole. Polynomial commitments, like Kate commitments, are used instead of Merkle roots to efficiently prove data correctness without complex fraud proofs.

Why Combine Committees and DAS?

Using only committees has drawbacks:

Using only DAS has challenges:

Importance of Data Availability

Data availability ensures all network participants agree on data existence, critical for Layer 2 solutions. Systems like BitTorrent or IPFS don't solve this, as they lack consensus mechanisms. Without robust techniques, nodes might disagree on data availability, breaking Layer 2 protocols.

Key resources explore this further, including initial introductions and papers on erasure coding and game theory in data availability.

How Sharding Works in P2P Layer

To achieve scalability, Eth2 uses a P2P system where nodes don't download all data. This involves "horizontal subnets" for each shard-slot pair and "vertical subnets" for each index in a blob. Each validator joins:

Blob proposers broadcast headers globally and bodies to horizontal subnets. Samples are then distributed to vertical subnets via direct and indirect sharing, ensuring wide dissemination. If a blob is partially available, a self-healing process reconstructs and redistributes it.

Beacon Chain Integration

Each slot, a proposer is randomly selected per shard to create a blob. They broadcast it, and the shard header is included in the beacon chain. Validators attest to headers, confirming availability. If 2/3 of a committee supports a header, it's confirmed immediately or by the end of the epoch.

Fork choice rules change: chains referencing invalid blobs are entirely invalid, ensuring tight coupling between main and shard chains.

Low Validator Counts

If validators are fewer than 262,144, proposers are assigned to a subset of shards cyclically to maintain committee sizes. This ensures security even with lower participation.

Economic Model

An EIP-1559-like mechanism adjusts fees per byte based on usage: if average block capacity exceeds 50%, fees rise; else, they fall. This targets 50% average block size.

Security Assumptions

Data sharding relies on weak assumptions: no honest majority needed (DAS detects unavailable data), and no timing assumptions (using Kate commitments). A 51% coalition can censor but not compromise data integrity. The "honest minority DAS assumption" requires enough clients sampling to force attackers to publish over half the data.

Future Compatibility

This design is forward-compatible with adding execution to shards, using fraud proofs or ZK-SNARKs for state transitions without honest majority assumptions.

Frequently Asked Questions

What is data sharding?
Data sharding is a method to scale Ethereum by splitting data into manageable pieces called blobs. Each blob's availability is verified, providing a high-throughput data layer for Layer 2 solutions without requiring every node to store all data.

How does Data Availability Sampling work?
Clients randomly sample small parts of each blob to check if enough data is available. If over half is accessible, they accept the blob; else, they reject it. This ensures security with minimal data download.

Why use both committees and DAS?
Committees provide a tested fallback and handle edge cases, while DAS offers stronger security against majority attacks and better scalability. Together, they balance reliability and innovation.

What is erasure coding?
Erasure coding encodes data so that if most of it is available, the rest can be reconstructed. This prevents partial data availability from causing inconsistencies across the network.

Can attackers compromise data availability?
A 51% attacker can censor data but cannot trick nodes into accepting unavailable data due to DAS. The system requires attackers to publish over half the data to deceive any client.

How does sharding affect Layer 2 solutions?
By ensuring data availability, sharding enables Layer 2 protocols like rollups to operate securely at high throughput, reducing costs and increasing transaction speed on Ethereum.

For deeper insights into these mechanisms, 👉 explore advanced technical resources.