Commit Graph

1564 Commits

Author SHA1 Message Date
kevaundray
613ce3c011 chore!: remove pub visibility on OVERFLOW_LRU_CAPACITY and STATE_LRU_CAPACITY_NON_ZERO (#8234)
- Renames `OVERFLOW_LRU_CAPACITY` to `OVERFLOW_LRU_CAPACITY_NON_ZERO` to follow naming convention of `STATE_LRU_CAPACITY_NON_ZERO`
- Makes  `OVERFLOW_LRU_CAPACITY_NON_ZERO` and `STATE_LRU_CAPACITY_NON_ZERO` private since they are only used in this module
- Moves `STATE_LRU_CAPACITY` into test module since it is only used for tests


  


Co-Authored-By: Kevaundray Wedderburn <kevtheappdev@gmail.com>
2025-10-27 11:23:45 +00:00
Pawan Dhananjay
c668cb7d9a Only publish reconstructed columns that we need to sample (#8269)
N/A


  We were publishing columns all columns that we didn't already have in the da cache when reconstructing. This is unnecessary outbound bandwidth for the node that is supposed to sample fewer columns.
This PR changes the behaviour to publish only columns that we are supposed to sample in the topics that we are subscribed to.


Co-Authored-By: Pawan Dhananjay <pawandhananjay@gmail.com>
2025-10-23 05:05:08 +00:00
Jimmy Chen
d8c6c57029 Trigger backfill on startup if user switches to a supernode or semi-supernode (#8265)
This PR adds backfill functionality to nodes switching to become a supernode or semi-supernode. Please note that we currently only support a CGC increase, i.e. if the node's already custodying 67 columns, switching to semi-supernode (64) will have no effect.


  From @eserilev
> if a node's cgc increases on start up, we just need two things for custody backfill to do its thing
>
> - data column custody info needs to be updated to reflect the cgc change
> - `CustodyContext::validator_registrations::epoch_validator_custody_requirements` needs to be updated to reflect the cgc change

- [x] Add tests
- [x] Test on devnet-3
- [x] switch to supernode
- [x] switch to semisupernode
- [x] Test on live testnets
- [x] Update docs (functions)


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
2025-10-23 02:56:09 +00:00
Jimmy Chen
43c5e924d7 Add --semi-supernode support (#8254)
Addresses #8218

A simplified version of #8241 for the initial release.

I've tried to minimise the logic change in this PR, although introducing the `NodeCustodyType` enum still result in quite a bit a of diff, but the actual logic change in `CustodyContext` is quite small.

The main changes are in the `CustdoyContext` struct
* ~~combining `validator_custody_count` and `current_is_supernode` fields into a single `custody_group_count_at_head` field. We persist the cgc of the initial cli values into the `custody_group_count_at_head` field and only allow for increase (same behaviour as before).~~
* I noticed the above approach caused a backward compatibility issue, I've [made a fix](15569bc085) and changed the approach slightly (which was actually what I had originally in mind):
* when initialising, only override the  `validator_custody_count` value if either flag `--supernode` or `--semi-supernode` is used; otherwise leave it as the existing default `0`. Most other logic remains unchanged.

All existing validator custody unit tests are still all passing, and I've added additional tests to cover semi-supernode, and restoring `CustodyContext` from disk.

Note: I've added a `WARN` if the user attempts to switch to a `--semi-supernode` or `--supernode` - this currently has no effect, but once @eserilev column backfill is merged, we should be able to support this quite easily.

Things to test
- [x] cgc in metadata / enr
- [x] cgc in metrics
- [x] subscribed subnets
- [x] getBlobs endpoint


  


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
2025-10-22 05:23:17 +00:00
Eitan Seri-Levi
33e21634cb Custody backfill sync (#7907)
#7603


  #### Custody backfill sync service
Similar in many ways to the current backfill service. There may be ways to unify the two services. The difficulty there is that the current backfill service tightly couples blocks and their associated blobs/data columns. Any attempts to unify the two services should be left to a separate PR in my opinion.

#### `SyncNeworkContext`
`SyncNetworkContext` manages custody sync data columns by range requests separetly from other sync RPC requests. I think this is a nice separation considering that custody backfill is its own service.

#### Data column import logic
The import logic verifies KZG committments and that the data columns block root matches the block root in the nodes store before importing columns

#### New channel to send messages to `SyncManager`
Now external services can communicate with the `SyncManager`. In this PR this channel is used to trigger a custody sync. Alternatively we may be able to use the existing `mpsc` channel that the `SyncNetworkContext` uses to communicate with the `SyncManager`. I will spend some time reviewing this.


Co-Authored-By: Eitan Seri-Levi <eserilev@ucsc.edu>

Co-Authored-By: Eitan Seri- Levi <eserilev@gmail.com>

Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
2025-10-22 03:51:34 +00:00
Eitan Seri-Levi
46dde9afee Fix data column rpc request (#8247)
Fixes an issue mentioned in this comment regarding data column rpc requests:
https://github.com/sigp/lighthouse/issues/6572#issuecomment-3400076236


  


Co-Authored-By: Eitan Seri-Levi <eserilev@ucsc.edu>

Co-Authored-By: Michael Sproul <micsproul@gmail.com>
2025-10-21 23:54:35 +00:00
Michael Sproul
21bab0899a Improve block header signature handling (#8253)
Closes:

- https://github.com/sigp/lighthouse/issues/7650


  Reject blob and data column sidecars from RPC with invalid signatures.


Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
2025-10-21 13:58:12 +00:00
Michael Sproul
2f8587301d More proposer shuffling cleanup (#8130)
Addressing more review comments from:

- https://github.com/sigp/lighthouse/pull/8101

I've also tweaked a few more things that I think are minor bugs.


  - Instrument `ensure_state_can_determine_proposers_for_epoch`
- Fix `block_root` usage in `compute_proposer_duties_from_head`. This was a regression introduced in 8101 😬 .
- Update the `state_advance_timer` to prime the next-epoch proposer cache post-Fulu.


Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
2025-10-20 03:14:14 +00:00
Jimmy Chen
76a37a0aef Revert incorrect fix made in #8179 (#8215)
This PR reverts #8179.

It turns out that the fix was invalid because an unknown root is always not a finalized descendant:

522bd9e9c6/consensus/proto_array/src/proto_array.rs (L976-L979)

so for any data columns with unknown parents, it will always penalise the gossip peer and disconnect it pretty quickly. On a small network, the node may lose all of its peers.

The impact is pretty obvious when the peer count is small and sync speed is slow, and is therefore easily reproducible by running a fresh supernode on devnet-3.

This isn't as obvious on a live testnet like holesky / sepolia, we haven't noticed this, probably due to its high peer count and sync speed - the nodes might be able to reach head quickly before losing too many peers.


  The previous behaviour isn't ideal but safe:  triggering unknown parent lookup and penalise the bad peer if it happens to be malicious or faulty. So for now it's safer to revert the change and plan for a proper fix after the v8 release.


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
2025-10-16 23:25:30 +00:00
SunnysidedJ
d1e06dc40d #6853 Adding store tests for data column pruning (#7228)
#6853 Update store tests to cover data column pruning


  Created a helper function `check_data_column_existence` which is a copy of `check_blob_existence` but checking data columns instead.
The helper function is then used to check whether data columns are also pruned when blobs are pruned if PeerDAS is enabled.


Co-Authored-By: SunnysidedJ <j@testinprod.io>

Co-Authored-By: Eitan Seri-Levi <eserilev@ucsc.edu>

Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
2025-10-16 15:20:26 +00:00
Pawan Dhananjay
73e75e3e69 Ignore extra columns in da cache (#8201)
N/A


  Found this issue in sepolia. Note: the custody requirement for this node is 100.
```
Oct 14 11:25:40.053 DEBUG Reconstructed columns                         count: 28, block_root: 0x4d7946dec0ab59f2afd46610d7c54af555cb4c2851d9eea7d83dd17cf6e96aae, slot: 8725628
Oct 14 11:25:45.568 WARN  Internal availability check failure           block_root: 0x4d7946dec0ab59f2afd46610d7c54af555cb4c2851d9eea7d83dd17cf6e96aae, error: Unexpected("too many columns got 128 expected 100")
```

So if any of the block components arrives late, then we reconstruct all 128 columns and try to add it to da cache and have more columns than needed for availability in the cache.

There are 2 ways I can think of fixing this:
1. pass only the required columns to the da cache after reconstruction here 60df5f4ab6/beacon_node/beacon_chain/src/data_availability_checker.rs (L647-L648)
2. Ensure that we add only columns that we need to sample in the da cache. I think this is safer since we can add columns to the cache from multiple code paths and this fixes it at the source.

~~This PR implements (2).~~ Thought more about it, I think (1) is cleaner since we filter gossip and rpc columns also before calling `put_kzg_verified_data_columns`/


Co-Authored-By: Pawan Dhananjay <pawandhananjay@gmail.com>
2025-10-16 09:25:44 +00:00
Jimmy Chen
5886a48d96 Add max_blobs_per_block check to data column gossip validation (#8198)
Addresses this spec change
https://github.com/ethereum/consensus-specs/pull/4650

Add `max_blobs_per_block` to gossip data column check so we reject large columns before processing. (we currently do this check during processing)


  


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
2025-10-15 01:52:35 +00:00
Pawan Dhananjay
2c328e32a6 Persist only custody columns in db (#8188)
* Only persist custody columns

* Get claude to write tests

* lint

* Address review comments and fix tests.

* Use supernode only when building chain segments

* Clean up

* Rewrite tests.

* Fix tests

* Clippy

---------

Co-authored-by: Jimmy Chen <jchen.tc@gmail.com>
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
2025-10-13 20:32:13 +11:00
Jimmy Chen
538b70495c Reject data columns that does not descend from finalize root instead of ignoring it (#8179)
This issue was identified during the fusaka audit competition.

The [`verify_parent_block_and_finalized_descendant`](62d9302e0f/beacon_node/beacon_chain/src/data_column_verification.rs (L606-L627)) in data column gossip verification currently load the parent first before checking if the column descends from the finalized root.

However, the `fork_choice.get_block(&block_parent_root)` function also make the same check internally:

8a4f6cf0d5/consensus/fork_choice/src/fork_choice.rs (L1242-L1249)

Therefore, if the column does not descend from the finalized root, we return an `UnknownParent` error, before hitting the `is_finalized_checkpoint_or_descendant` check just below.

Which means we `IGNORE` the gossip message instead `REJECT`, and the gossip peer is not _immediately_ penalised. This deviates from the spec.

However, worth noting that lighthouse will currently attempt to request the parent from this peer, and if the peer is not able to serve the parent, it gets penalised with a `LowToleranceError`, and will get banned after ~5 occurences.

ffa7b2b2b9/beacon_node/network/src/sync/network_context.rs (L1530-L1532)

This PR will penalise the bad peer immediately instead of performing block lookups before penalising it.


  


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
2025-10-09 07:32:43 +00:00
chonghe
3110ca325b Implement /eth/v1/beacon/blobs endpoint (#8103)
* #8085


  


Co-Authored-By: Tan Chee Keong <tanck@sigmaprime.io>

Co-Authored-By: chonghe <44791194+chong-he@users.noreply.github.com>
2025-10-09 05:01:30 +00:00
Michael Sproul
13dfa9200f Block proposal optimisations (#8156)
Closes:

- https://github.com/sigp/lighthouse/issues/4412

This should reduce Lighthouse's block proposal times on Holesky and prevent us getting reorged.


  - [x] Allow the head state to be advanced further than 1 slot. This lets us avoid epoch processing on hot paths including block production, by having new epoch boundaries pre-computed and available in the state cache.
- [x] Use the finalized state to prune the op pool. We were previously using the head state and trying to infer slashing/exit relevance based on `exit_epoch`. However some exit epochs are far in the future, despite occurring recently.


Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
2025-10-08 06:09:12 +00:00
Michael Sproul
38fdaf791c Fix proposer shuffling decision slot at boundary (#8128)
Follow-up to the bug fixed in:

- https://github.com/sigp/lighthouse/pull/8121

This fixes the root cause of that bug, which was introduced by me in:

- https://github.com/sigp/lighthouse/pull/8101

Lion identified the issue here:

- https://github.com/sigp/lighthouse/pull/8101#discussion_r2382710356


  In the methods that compute the proposer shuffling decision root, ensure we don't use lookahead for the Fulu fork epoch itself. This is accomplished by checking if Fulu is enabled at `epoch - 1`, i.e. if `epoch > fulu_fork_epoch`.

I haven't updated the methods that _compute_ shufflings to use these new corrected bounds (e.g. `BeaconState::compute_proposer_indices`), although we could make this change in future. The `get_beacon_proposer_indices` method already gracefully handles the Fulu boundary case by using the `proposer_lookahead` field (if initialised).


Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
2025-09-29 01:13:33 +00:00
Pawan Dhananjay
edcfee636c Fix bug in fork calculation at fork boundaries (#8121)
N/A


  In #8101 , when we modified the logic to get the proposer index post fulu, we seem to have missed advancing the state at the fork boundaries to get the right `Fork` for signature verification.
This led to lighthouse failing all gossip verification right after transitioning to fulu that was observed on the holesky shadow fork
```
Sep 26 14:24:00.088 DEBUG Rejected gossip block                         error: "InvalidSignature(ProposerSignature)", graffiti: "grandine-geth-super-1", slot: 640
Sep 26 14:24:00.099 WARN  Could not verify block for gossip. Rejecting the block  error: InvalidSignature(ProposerSignature)
```

I'm not completely sure this is the correct fix, but this fixes the issue with `InvalidProposerSignature` on the holesky shadow fork.

Thanks to @eserilev for helping debug this


Co-Authored-By: Pawan Dhananjay <pawandhananjay@gmail.com>
2025-09-28 04:03:25 +00:00
Michael Sproul
c754234b2c Fix bugs in proposer calculation post-Fulu (#8101)
As identified by a researcher during the Fusaka security competition, we were computing the proposer index incorrectly in some places by computing without lookahead.


  - [x] Add "low level" checks to computation functions in `consensus/types` to ensure they error cleanly
- [x] Re-work the determination of proposer shuffling decision roots, which are now fork aware.
- [x] Re-work and simplify the beacon proposer cache to be fork-aware.
- [x] Optimise `with_proposer_cache` to use `OnceCell`.
- [x] All tests passing.
- [x] Resolve all remaining `FIXME(sproul)`s.
- [x] Unit tests for `ProtoBlock::proposer_shuffling_root_for_child_block`.
- [x] End-to-end regression test.
- [x] Test on pre-Fulu network.
- [x] Test on post-Fulu network.


Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
2025-09-26 14:44:50 +00:00
Lion - dapplion
ffa7b2b2b9 Only mark block lookups as pending if block is importing from gossip (#8112)
- PR https://github.com/sigp/lighthouse/pull/8045 introduced a regression of how lookup sync interacts with the da_checker.

Now in unstable block import from the HTTP API also insert the block in the da_checker while the block is being execution verified. If lookup sync finds the block in the da_checker in `NotValidated` state it expects a `GossipBlockProcessResult` message sometime later. That message is only sent after block import in gossip.

I confirmed in our node's logs for 4/4 cases of stuck lookups are caused by this sequence of events:
- Receive block through API, insert into da_checker in fn process_block in put_pre_execution_block
- Create lookup and leave in AwaitingDownload(block in processing cache) state
- Block from HTTP API finishes importing
- Lookup is left stuck

Closes https://github.com/sigp/lighthouse/issues/8104


  - https://github.com/sigp/lighthouse/pull/8110 was my initial solution attempt but we can't send the `GossipBlockProcessResult` event from the `http_api` crate without adding new channels, which seems messy.

For a given node it's rare that a lookup is created at the same time that a block is being published. This PR solves https://github.com/sigp/lighthouse/issues/8104 by allowing lookup sync to import the block twice in that case.


Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
2025-09-25 03:52:27 +00:00
Eitan Seri-Levi
af274029e8 Run reconstruction inside a scoped rayon pool (#8075)
Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>

Co-Authored-By: Eitan Seri- Levi <eserilev@gmail.com>

Co-Authored-By: Eitan Seri-Levi <eserilev@ucsc.edu>
2025-09-24 06:37:34 +00:00
Jimmy Chen
78d330e4b7 Consolidate reqresp_pre_import_cache into data_availability_checker (#8045)
This PR consolidates the `reqresp_pre_import_cache` into the `data_availability_checker` for the following reasons:
- the `reqresp_pre_import_cache` suffers from the same TOCTOU bug we had with `data_availability_checker` earlier, and leads to unbounded memory leak, which we have observed over the last 6 months on some nodes.
- the `reqresp_pre_import_cache` is no longer necessary, because we now hold blocks in the `data_availability_checker` for longer since (#7961), and recent blocks can be served from the DA checker.

This PR also maintains the following functionalities
- Serving pre-executed blocks over RPC, and they're now served from the `data_availability_checker` instead.
- Using the cache for de-duplicating lookup requests.


  


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>

Co-Authored-By: Jimmy Chen <jimmy@sigmaprime.io>
2025-09-19 07:01:13 +00:00
Michael Sproul
3543a20192 Add experimental complete-blob-backfill flag (#7751)
A different (and complementary) approach for:

- https://github.com/sigp/lighthouse/issues/5391


  This PR adds a flag to set the DA boundary to the Deneb fork. The effect of this change is that Lighthouse will try to backfill _all_ blobs.

Most peers do not have this data, but I'm thinking that combined with `trusted-peers` this could be quite effective.


Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
2025-09-18 05:17:03 +00:00
Eitan Seri-Levi
521be2b757 Prevent silently dropping cell proof chunks (#8023)
Co-Authored-By: Eitan Seri- Levi <eserilev@gmail.com>
2025-09-18 01:33:42 +00:00
Jimmy Chen
3de646c8b3 Enable reconstruction for nodes custodying more than 50% of columns and instrument tracing (#8052)
Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>

Co-Authored-By: Jimmy Chen <jimmy@sigmaprime.io>
2025-09-16 08:17:43 +00:00
Eitan Seri-Levi
242bdfcf12 Add instrumentation to recompute_head_at_slot (#8049)
Co-Authored-By: Eitan Seri- Levi <eserilev@gmail.com>
2025-09-16 05:18:31 +00:00
Michael Sproul
f04d5ecddd Another check to prevent duplicate block imports (#8050)
Attempt to address performance issues caused by importing the same block multiple times.


  - Check fork choice "after" obtaining the fork choice write lock in `BeaconChain::import_block`. We actually use an upgradable read lock, but this is semantically equivalent (the upgradable read has the advantage of not excluding regular reads).

The hope is that this change has several benefits:

1. By preventing duplicate block imports we save time repeating work inside `import_block` that is unnecessary, e.g. writing the state to disk. Although the store itself now takes some measures to avoid re-writing diffs, it is even better if we avoid a disk write entirely.
2. By returning `DuplicateFullyImported`, we reduce some duplicated work downstream. E.g. if multiple threads importing columns trigger `import_block`, now only _one_ of them will get a notification of the block import completing successfully, and only this one will run `recompute_head`. This should help avoid a situation where multiple beacon processor workers are consumed by threads blocking on the `recompute_head_lock`. However, a similar block-fest is still possible with the upgradable fork choice lock (a large number of threads can be blocked waiting for the first thread to complete block import).


Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
2025-09-16 04:10:42 +00:00
Jimmy Chen
fb77ce9e19 Add missing event in PendingComponent span and clean up sync logs (#8033)
I was looking into some long `PendingComponents` span and noticed the block event wasn't added to the span, so it wasn't possible to see when the block was added from the trace view, this PR fixes this.

<img width="637" height="430" alt="image" src="https://github.com/user-attachments/assets/65040b1c-11e7-43ac-951b-bdfb34b665fb" />

Additionally I've noticed a lot of noises and confusion in sync logs due to the initial`peer_id` being included as part of the syncing chain span, causing all logs under the span to have that `peer_id`, which may not be accurate for some sync logs, I've removed `peer_id` from the `SyncingChain` span, and also cleaned up a bunch of spans to use `%` (display) for slots and epochs to make logs easier to read.


  


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
2025-09-12 05:11:30 +00:00
kevaundray
f71d69755d chore: add comment to PendingComponents (#7979)
Adds doc comment


  


Co-Authored-By: Kevaundray Wedderburn <kevtheappdev@gmail.com>

Co-Authored-By: Jimmy Chen <jimmy@sigmaprime.io>
2025-09-10 13:48:11 +00:00
Eitan Seri-Levi
caa1df6fc3 Skip column gossip verification logic during block production (#7973)
#7950


  Skip column gossip verification logic during block production as its redundant and potentially computationally expensive.


Co-Authored-By: Eitan Seri- Levi <eserilev@gmail.com>

Co-Authored-By: Eitan Seri-Levi <eserilev@ucsc.edu>

Co-Authored-By: Jimmy Chen <jimmy@sigmaprime.io>

Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
2025-09-10 12:29:56 +00:00
hopinheimer
38205192ca Fix http api tests ci (#7943)
Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>

Co-Authored-By: Michael Sproul <micsproul@gmail.com>

Co-Authored-By: Michael Sproul <michael@sigmaprime.io>

Co-Authored-By: hopinheimer <knmanas6@gmail.com>
2025-09-10 06:46:48 +00:00
Jimmy Chen
8a4f6cf0d5 Instrument tracing on block production code path (#8017)
Partially #7814. Instrument block production code path.

New root spans:
* `produce_block_v3`
* `produce_block_v2`

Example traces:

<img width="518" height="432" alt="image" src="https://github.com/user-attachments/assets/a9413d25-501c-49dc-95cc-623db5988981" />


  


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
2025-09-10 03:30:51 +00:00
Eitan Seri-Levi
8ec2640e04 Don't penalize peers if locally constructed light client data is stale (#7996)
#7994


  We seem to be penalizing peers in situations where locally constructed light client data is stale. This PR ignores incoming light client data if our locally constructed light client data isn't up to date.


Co-Authored-By: Eitan Seri-Levi <eserilev@ucsc.edu>
2025-09-05 03:23:34 +00:00
Jimmy Chen
9d2f55a399 Fix data column reconstruction error (#7998)
Addresses #7991
2025-09-04 20:17:52 +00:00
Jimmy Chen
eef02afc93 Fix data availability checker race condition causing partial data columns to be served over RPC (#7961)
Partially resolves #6439, an simpler alternative to #7931.

Race condition occurs when RPC data columns arrives after a block has been imported and removed from the DA checker:
1. Block becomes available via gossip
2. RPC columns arrive and pass fork choice check (block hasn't been imported)
3. Block import completes (removing block from DA checker)
4. RPC data columns finish verification and get imported into DA checker

This causes two issues:
1. **Partial data serving**: Already imported components get re-inserted, potentially causing LH to serve incomplete data
2. **State cache misses**: Leads to state reconstruction, holding the availability cache write lock longer and increasing race likelihood

### Proposed Changes

1. Never manually remove pending components from DA checker. Components are only removed via LRU eviction as finality advances. This makes sure we don't run into the issue described above.
2. Use `get` instead of `pop` when recovering the executed block, this prevents cache misses in race condition. This should reduce the likelihood of the race condition
3. Refactor DA checker to drop write lock as soon as components are added. This should also reduce the likelihood of the race condition

**Trade-offs:**

This solution eliminates a few nasty race conditions while allowing simplicity, with the cost of allowing block re-import (already existing).

The increase in memory in DA checker can be partially offset by a reduction in block cache size if this really comes an issue (as we now serve recent blocks from DA checker).
2025-09-02 07:18:23 +00:00
Jimmy Chen
979ed2557c Remove expect usage in kzg_utils (#7957)
Remove `expect` usage in `kzg_utils` to handle the case where EL sends us invalid proof size instead of crashing.
2025-09-01 09:21:26 +00:00
kevaundray
9cc3c0553b chore: small refactor of epoch method (#7902)
Stylistic; mostly using early returns to avoid the nested logic

Which issue # does this PR address?


  Please list or describe the changes introduced by this PR.
2025-09-01 09:21:23 +00:00
Jimmy Chen
a134d43446 Use rayon to speed up batch KZG verification (#7921)
Addresses #7866.


  Use Rayon to speed up batch KZG verification during range / backfill sync.

While I was analysing the traces, I also discovered a bug that resulted in only the first 128 columns in a chain segment batch being verified. This PR fixes it, so we might actually observe slower range sync due to more cells being KZG verified.

I've also updated the handling of batch KZG failure to only find the first invalid KZG column when verification fails as this gets very expensive during range/backfill sync.
2025-08-29 00:59:40 +00:00
Jimmy Chen
c13fb2fb46 Instrument publish_block code path (#7945)
Instrument `publish_block` code path and log dropped data columns when publishing.

Example spans (running the devnet from my laptop, so the numbers aren't great)

<img width="734" height="296" alt="image" src="https://github.com/user-attachments/assets/20620bf7-2b38-4392-aa75-9ba96d3a7f0d" />

<img width="718" height="625" alt="image" src="https://github.com/user-attachments/assets/61e1ff1c-65b5-4ad4-981a-d0fadc9829e1" />
2025-08-28 03:31:29 +00:00
Mac L
e438691683 Add Gloas boilerplate (#7728)
Adds the required boilerplate code for the Gloas (Glamsterdam) hard fork. This allows PRs testing Gloas-candidate features to test fork transition.

This also includes de-duplication of post-Bellatrix readiness notifiers from #6797 (credit to @dapplion)
2025-08-26 02:49:48 +00:00
Jimmy Chen
daf1c7c3af Fix RPC blocks not getting fully KZG verified (#7927)
Fix RPC blocks not getting fully KZG verified due to incorrect list truncation.
2025-08-25 16:46:16 +00:00
Jimmy Chen
f19d4f6af1 Implement tracing spans for data columm RPC requests and responses (#7831)
#7830
2025-08-20 23:35:51 +00:00
Jimmy Chen
2d223575d6 Avoid unnecessary database lookups in data column RPC requests (#7897)
This PR is an optimisation to avoid unnecessary database lookups when peer requests data columns that the node doesn't custody (advertised via `cgc`).

e.g. an extreme but realistic example - a full node only store 4 custody columns by default, but it may receive a range request of 32 slots with all 128 columns, and this would result in 4096 database lookups but the node is only able to get 128 (4 * 32) of them.


  - Filter data column RPC requests (`DataColumnsByRoot`, `DataColumnsByRange`) to only lookup columns the node custodies
- Prevents unnecessary database queries that would always fail for non-custody columns
2025-08-20 05:08:53 +00:00
Jimmy Chen
b4704eab4a Fulu update to spec v1.6.0-alpha.4 (#7890)
Fulu update to spec [v1.6.0-alpha.4](https://github.com/ethereum/consensus-specs/releases/tag/v1.6.0-alpha.4).
- Make `number_of_columns` a preset
- Optimise `get_custody_groups` to avoid computing if cgc = 128
- Add support for additional typenum values in type_dispatch macro
2025-08-20 02:05:04 +00:00
Jimmy Chen
34dd1b27ae Revise data column rpc limits and queue sizes (#7887)
Revise data column rpc limits and queue sizes. Also removed some outdated TODOs for Fulu / das.
2025-08-19 03:48:08 +00:00
Michael Sproul
836c39efaa Shrink persisted fork choice data (#7805)
Closes:

- https://github.com/sigp/lighthouse/issues/7760


  - [x] Remove `balances_cache` from `PersistedForkChoiceStore` (~65 MB saving on mainnet)
- [x] Remove `justified_balances` from `PersistedForkChoiceStore` (~16 MB saving on mainnet)
- [x] Remove `balances` from `ProtoArray`/`SszContainer`.
- [x] Implement zstd compression for votes
- [x] Fix bug in justified state usage
- [x] Bump schema version to V28 and implement migration.
2025-08-18 06:03:28 +00:00
Jimmy Chen
aa8cba3741 Upgrade rust-eth-kzg to 0.8.0 (#7870)
#7864

The main breaking change in v0.8.0 is the `TrustedSetup` initialisation - it now requires a json string via `PeerDASTrustedSetup::from_json`.
2025-08-18 02:52:39 +00:00
chonghe
522bd9e9c6 Update Rust Edition to 2024 (#7766)
* #7749

Thanks @dknopik and @michaelsproul for your help!
2025-08-13 03:04:31 +00:00
Pawan Dhananjay
4262ad3e01 Add a flag to disable getBlobs (#7853)
N/A


  Add a flag to disable get blobs. I configured the flag to disable it regardless of version because its most likely something we use for testing anyway.
2025-08-11 23:17:00 +00:00
Jimmy Chen
40c2fd5ff4 Instrument tracing spans for block processing and import (#7816)
#7815

- removes all existing spans, so some span fields that appear in logs like `service_name` may be lost.
- instruments a few key code paths in the beacon node, starting from **root spans** named below:

* Gossip block and blobs
* `process_gossip_data_column_sidecar`
* `process_gossip_blob`
* `process_gossip_block`
* Rpc block and blobs
* `process_rpc_block`
* `process_rpc_blobs`
* `process_rpc_custody_columns`
* Rpc blocks (range and backfill)
* `process_chain_segment`
* `PendingComponents` lifecycle
* `pending_components`

To test locally:
* Run Grafana and Tempo with https://github.com/sigp/lighthouse-metrics/pull/57
* Run Lighthouse BN with `--telemetry-collector-url http://localhost:4317`

Some captured traces can be found here: https://hackmd.io/@jimmygchen/r1sLOxPPeg

Removing the old spans seem to have reduced the memory usage quite a lot - i think we were using them on long running tasks and too excessively:
<img width="910" height="495" alt="image" src="https://github.com/user-attachments/assets/5208bbe4-53b2-4ead-bc71-0b782c788669" />
2025-08-08 05:32:22 +00:00