Commit Graph

3570 Commits

Author SHA1 Message Date
chonghe
3110ca325b Implement /eth/v1/beacon/blobs endpoint (#8103)
* #8085


  


Co-Authored-By: Tan Chee Keong <tanck@sigmaprime.io>

Co-Authored-By: chonghe <44791194+chong-he@users.noreply.github.com>
2025-10-09 05:01:30 +00:00
Michael Sproul
13dfa9200f Block proposal optimisations (#8156)
Closes:

- https://github.com/sigp/lighthouse/issues/4412

This should reduce Lighthouse's block proposal times on Holesky and prevent us getting reorged.


  - [x] Allow the head state to be advanced further than 1 slot. This lets us avoid epoch processing on hot paths including block production, by having new epoch boundaries pre-computed and available in the state cache.
- [x] Use the finalized state to prune the op pool. We were previously using the head state and trying to infer slashing/exit relevance based on `exit_epoch`. However some exit epochs are far in the future, despite occurring recently.


Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
2025-10-08 06:09:12 +00:00
Jimmy Chen
2a433bc406 Remove deprecated CLI flags and references for v8.0.0 (#8142)
Closes #8131

- [x] Remove deprecated flags from beacon_node/src/cli.rs:
- [x] eth1-purge-cache
- [x] eth1-blocks-per-log-query
- [x] eth1-cache-follow-distance
- [x] disable-deposit-contract-sync
- [x] light-client-server
- [x] Remove deprecated flags from lighthouse/src/main.rs:
- [x] logfile
- [x] terminal-total-difficulty-override
- [x] terminal-block-hash-override
- [x] terminal-block-hash-epoch-override
- [x] safe-slots-to-import-optimistically
- [x] Remove references to deprecated flags in config.rs files
- [x] Remove warning messages for deprecated flags in main.rs
- [x] Update/remove related tests in beacon_node.rs


  


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>

Co-Authored-By: Jimmy Chen <jimmy@sigmaprime.io>
2025-10-08 01:52:41 +00:00
Pawan Dhananjay
a4ad3e492f Fallback to getPayload v1 if v2 fails (#8163)
N/A


  Post fulu, we should be calling the v2 api on the relays that doesn't return the blobs/data columns.

However, we decided to start hitting the v2 api as soon as fulu is scheduled to avoid unexpected surprises at the fork.
In the ACDT call, it seems like most clients are calling v2 only after the fulu fork.
This PR aims to be the best of both worlds where we fallback to hitting v1 api if v2 fails. This way, we know beforehand if relays don't support it and can potentially alert them.


Co-Authored-By: Pawan Dhananjay <pawandhananjay@gmail.com>
2025-10-07 14:32:41 +00:00
Eitan Seri-Levi
4eb89604f8 Fulu ASCII art (#8151)
Co-Authored-By: Eitan Seri- Levi <eserilev@gmail.com>
2025-10-07 14:32:35 +00:00
Jimmy Chen
ff8b514b3f Remove unnecessary warning logs and update logging levels (#8145)
@michaelsproul noticed this warning on a devnet-3 node

```
Oct 01 16:37:29.896 WARN  Error when importing rpc custody columns      error: ParentUnknown { parent_root: 0xe4cc85a2137b76eb083d7076255094a90f10caaec0afc8fd36807db742f6ff13 }, block_hash: 0x43ce63b2344990f5f4d8911b8f14e3d3b6b006edc35bbc833360e667df0edef7
```

We're also seeing similar `WARN` logs for blobs on our live nodes.

It's normal to get parent unknown in lookups and it's handled here
a134d43446/beacon_node/network/src/sync/block_lookups/mod.rs (L611-L619)

These shouldn't be a `WARN`, and we also log the same error in block lookups at `DEBUG` level here:
a134d43446/beacon_node/network/src/sync/block_lookups/mod.rs (L643-L648)

So i've removed these extra WARN logs.

I've also lower the level of an `ERROR` log when unable to serve data column root requests - it's unexpected, but is unlikely to impact the nodes performance, so I think we can downgrade this.


  


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
2025-10-06 19:26:37 +00:00
Jimmy Chen
e5b4983d6b Release v8.0.0 rc.0 (#8127)
Testnet release for the upcoming Fusaka fork.


  


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>

Co-Authored-By: Jimmy Chen <jimmy@sigmaprime.io>
2025-09-29 02:17:30 +00:00
Michael Sproul
38fdaf791c Fix proposer shuffling decision slot at boundary (#8128)
Follow-up to the bug fixed in:

- https://github.com/sigp/lighthouse/pull/8121

This fixes the root cause of that bug, which was introduced by me in:

- https://github.com/sigp/lighthouse/pull/8101

Lion identified the issue here:

- https://github.com/sigp/lighthouse/pull/8101#discussion_r2382710356


  In the methods that compute the proposer shuffling decision root, ensure we don't use lookahead for the Fulu fork epoch itself. This is accomplished by checking if Fulu is enabled at `epoch - 1`, i.e. if `epoch > fulu_fork_epoch`.

I haven't updated the methods that _compute_ shufflings to use these new corrected bounds (e.g. `BeaconState::compute_proposer_indices`), although we could make this change in future. The `get_beacon_proposer_indices` method already gracefully handles the Fulu boundary case by using the `proposer_lookahead` field (if initialised).


Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
2025-09-29 01:13:33 +00:00
Pawan Dhananjay
edcfee636c Fix bug in fork calculation at fork boundaries (#8121)
N/A


  In #8101 , when we modified the logic to get the proposer index post fulu, we seem to have missed advancing the state at the fork boundaries to get the right `Fork` for signature verification.
This led to lighthouse failing all gossip verification right after transitioning to fulu that was observed on the holesky shadow fork
```
Sep 26 14:24:00.088 DEBUG Rejected gossip block                         error: "InvalidSignature(ProposerSignature)", graffiti: "grandine-geth-super-1", slot: 640
Sep 26 14:24:00.099 WARN  Could not verify block for gossip. Rejecting the block  error: InvalidSignature(ProposerSignature)
```

I'm not completely sure this is the correct fix, but this fixes the issue with `InvalidProposerSignature` on the holesky shadow fork.

Thanks to @eserilev for helping debug this


Co-Authored-By: Pawan Dhananjay <pawandhananjay@gmail.com>
2025-09-28 04:03:25 +00:00
Michael Sproul
c754234b2c Fix bugs in proposer calculation post-Fulu (#8101)
As identified by a researcher during the Fusaka security competition, we were computing the proposer index incorrectly in some places by computing without lookahead.


  - [x] Add "low level" checks to computation functions in `consensus/types` to ensure they error cleanly
- [x] Re-work the determination of proposer shuffling decision roots, which are now fork aware.
- [x] Re-work and simplify the beacon proposer cache to be fork-aware.
- [x] Optimise `with_proposer_cache` to use `OnceCell`.
- [x] All tests passing.
- [x] Resolve all remaining `FIXME(sproul)`s.
- [x] Unit tests for `ProtoBlock::proposer_shuffling_root_for_child_block`.
- [x] End-to-end regression test.
- [x] Test on pre-Fulu network.
- [x] Test on post-Fulu network.


Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
2025-09-26 14:44:50 +00:00
Lion - dapplion
ffa7b2b2b9 Only mark block lookups as pending if block is importing from gossip (#8112)
- PR https://github.com/sigp/lighthouse/pull/8045 introduced a regression of how lookup sync interacts with the da_checker.

Now in unstable block import from the HTTP API also insert the block in the da_checker while the block is being execution verified. If lookup sync finds the block in the da_checker in `NotValidated` state it expects a `GossipBlockProcessResult` message sometime later. That message is only sent after block import in gossip.

I confirmed in our node's logs for 4/4 cases of stuck lookups are caused by this sequence of events:
- Receive block through API, insert into da_checker in fn process_block in put_pre_execution_block
- Create lookup and leave in AwaitingDownload(block in processing cache) state
- Block from HTTP API finishes importing
- Lookup is left stuck

Closes https://github.com/sigp/lighthouse/issues/8104


  - https://github.com/sigp/lighthouse/pull/8110 was my initial solution attempt but we can't send the `GossipBlockProcessResult` event from the `http_api` crate without adding new channels, which seems messy.

For a given node it's rare that a lookup is created at the same time that a block is being published. This PR solves https://github.com/sigp/lighthouse/issues/8104 by allowing lookup sync to import the block twice in that case.


Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
2025-09-25 03:52:27 +00:00
Jimmy Chen
79b33214ea Only send data coumn subnet discovery requests after peerdas is scheduled (#8109)
#8105 (to be confirmed)

I noticed a large number of failed discovery requests after deploying latest `unstable` to some of our testnet and mainnet nodes. This is because of a recent PeerDAS change to attempt to maintain sufficient peers across data column subnets - this shouldn't be enabled on network without peerdas scheduled, otherwise it will keep retrying discovery on these subnets and never succeed.

Also removed some unused files.


  


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>

Co-Authored-By: Jimmy Chen <jimmy@sigmaprime.io>
2025-09-25 02:52:07 +00:00
Eitan Seri-Levi
af274029e8 Run reconstruction inside a scoped rayon pool (#8075)
Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>

Co-Authored-By: Eitan Seri- Levi <eserilev@gmail.com>

Co-Authored-By: Eitan Seri-Levi <eserilev@ucsc.edu>
2025-09-24 06:37:34 +00:00
Michael Sproul
1dbc4f861b Refine HTTP status logs (#8098)
Ensure that we don't log a warning for HTTP 202s, which are expected on the blinded block endpoints after Fulu.


Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
2025-09-22 05:03:47 +00:00
Jimmy Chen
4efe47b3c3 Rename --subscribe-all-data-column-subnets to --supernode and make it visible in help (#8083)
Rename `--subscribe-all-data-column-subnets` to `--supernode` as it's now been officially accepted in the spec. Also make it visible in help in preparation for the fusaka release.

https://github.com/ethereum/consensus-specs/blob/dev/specs/fulu/p2p-interface.md#supernodes


  


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
2025-09-19 07:01:16 +00:00
Jimmy Chen
78d330e4b7 Consolidate reqresp_pre_import_cache into data_availability_checker (#8045)
This PR consolidates the `reqresp_pre_import_cache` into the `data_availability_checker` for the following reasons:
- the `reqresp_pre_import_cache` suffers from the same TOCTOU bug we had with `data_availability_checker` earlier, and leads to unbounded memory leak, which we have observed over the last 6 months on some nodes.
- the `reqresp_pre_import_cache` is no longer necessary, because we now hold blocks in the `data_availability_checker` for longer since (#7961), and recent blocks can be served from the DA checker.

This PR also maintains the following functionalities
- Serving pre-executed blocks over RPC, and they're now served from the `data_availability_checker` instead.
- Using the cache for de-duplicating lookup requests.


  


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>

Co-Authored-By: Jimmy Chen <jimmy@sigmaprime.io>
2025-09-19 07:01:13 +00:00
Jimmy Chen
4111bcb39b Use scoped rayon pool for backfill chain segment processing (#7924)
Part of #7866

- Continuation of #7921

In the above PR, we enabled rayon for batch KZG verification in chain segment processing. However, using the global rayon thread pool for backfill is likely to create resource contention with higher-priority beacon processor work.


  This PR introduces a dedicated low-priority rayon thread pool `LOW_PRIORITY_RAYON_POOL` and uses it for processing backfill chain segments.

This prevents backfill KZG verification from using the global rayon thread pool and competing with high-priority beacon processor tasks for CPU resources.

However, this PR by itself doesn't prevent CPU oversubscription because other tasks could still fill up the global rayon thread pool, and having an extra thread pool could make things worse. To address this we need the beacon
processor to coordinate total CPU allocation across all tasks, which is covered in:
- #7789


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>

Co-Authored-By: Eitan Seri- Levi <eserilev@gmail.com>

Co-Authored-By: Eitan Seri-Levi <eserilev@ucsc.edu>
2025-09-18 07:10:23 +00:00
Michael Sproul
51321daabb Make the block cache optional (#8066)
Address contention on the store's `block_cache` by allowing it to be disabled when `--block-cache-size 0` is provided, and also making this the default.


Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
2025-09-18 07:10:18 +00:00
Michael Sproul
3543a20192 Add experimental complete-blob-backfill flag (#7751)
A different (and complementary) approach for:

- https://github.com/sigp/lighthouse/issues/5391


  This PR adds a flag to set the DA boundary to the Deneb fork. The effect of this change is that Lighthouse will try to backfill _all_ blobs.

Most peers do not have this data, but I'm thinking that combined with `trusted-peers` this could be quite effective.


Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
2025-09-18 05:17:03 +00:00
Michael Sproul
684632df73 Fix reprocess queue memory leak (#8065)
Fix a memory leak in the reprocess queue.


  If the vec of attestation IDs for a block is never evicted from the reprocess queue by a `BlockImported` event, then it stays in the map forever consuming memory. The fix is to remove the entry when its last attestation times out. We do similarly for light client updates.

In practice this will only occur if there is a race between adding an attestation to the queue and processing the `BlockImported` event, or if there are attestations for block roots that we never import (e.g. random block roots, block roots of invalid blocks).


Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
2025-09-18 05:16:59 +00:00
Eitan Seri-Levi
521be2b757 Prevent silently dropping cell proof chunks (#8023)
Co-Authored-By: Eitan Seri- Levi <eserilev@gmail.com>
2025-09-18 01:33:42 +00:00
Toki
5928407ce4 fix(rate_limiter): add missing prune calls for light client protocols (#8058)
Co-Authored-By: Jimmy Chen <jimmy@sigmaprime.io>

Co-Authored-By: gitToki <tokipro@proton.me>
2025-09-17 04:51:43 +00:00
Lion - dapplion
b7d78a91e0 Don't penalize peers for extending ignored chains (#8042)
Lookup sync has a cache of block roots "failed_chains". If a peer triggers a lookup for a block or descendant of a root in failed_chains the lookup is dropped and the peer penalized. However blocks are inserted into failed_chains for a single reason:
- If a chain is longer than 32 blocks the lookup is dropped to prevent OOM risks. However the peer is not at fault, since discovering an unknown chain longer than 32 blocks is not malicious. We just drop the lookup to sync the blocks from range forward sync.

This discrepancy is probably an oversight when changing old code. Before we used to add blocks that failed too many times to process to that cache. However, we don't do that anymore. Adding a block that fails too many times to process is an optimization to save resources in rare cases where peers keep sending us invalid blocks. In case that happens, today we keep trying to process the block, downscoring the peers and eventually disconnecting them. _IF_ we found that optimization to be necessary we should merge this PR (_Stricter match of BlockError in lookup sync_) first. IMO we are fine without the failed_chains cache and the ignored_chains cache will be obsolete with [tree sync](https://github.com/sigp/lighthouse/issues/7678) as the OOM risk of long lookup chains does not exist anymore.

Closes https://github.com/sigp/lighthouse/issues/7577


  Rename `failed_chains` for `ignored_chains` and don't penalize peers that trigger lookups for those blocks


Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
2025-09-17 01:02:29 +00:00
Jimmy Chen
3de646c8b3 Enable reconstruction for nodes custodying more than 50% of columns and instrument tracing (#8052)
Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>

Co-Authored-By: Jimmy Chen <jimmy@sigmaprime.io>
2025-09-16 08:17:43 +00:00
Eitan Seri-Levi
242bdfcf12 Add instrumentation to recompute_head_at_slot (#8049)
Co-Authored-By: Eitan Seri- Levi <eserilev@gmail.com>
2025-09-16 05:18:31 +00:00
Eitan Seri-Levi
aba3627099 Reduce reconstruction queue capacity (#8053)
Co-Authored-By: Eitan Seri- Levi <eserilev@gmail.com>
2025-09-16 05:18:28 +00:00
Eitan Seri-Levi
4409500f63 Remove column reconstruction when processing rpc requests (#8051)
Co-Authored-By: Eitan Seri- Levi <eserilev@gmail.com>
2025-09-16 05:18:25 +00:00
Michael Sproul
f04d5ecddd Another check to prevent duplicate block imports (#8050)
Attempt to address performance issues caused by importing the same block multiple times.


  - Check fork choice "after" obtaining the fork choice write lock in `BeaconChain::import_block`. We actually use an upgradable read lock, but this is semantically equivalent (the upgradable read has the advantage of not excluding regular reads).

The hope is that this change has several benefits:

1. By preventing duplicate block imports we save time repeating work inside `import_block` that is unnecessary, e.g. writing the state to disk. Although the store itself now takes some measures to avoid re-writing diffs, it is even better if we avoid a disk write entirely.
2. By returning `DuplicateFullyImported`, we reduce some duplicated work downstream. E.g. if multiple threads importing columns trigger `import_block`, now only _one_ of them will get a notification of the block import completing successfully, and only this one will run `recompute_head`. This should help avoid a situation where multiple beacon processor workers are consumed by threads blocking on the `recompute_head_lock`. However, a similar block-fest is still possible with the upgradable fork choice lock (a large number of threads can be blocked waiting for the first thread to complete block import).


Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
2025-09-16 04:10:42 +00:00
Jimmy Chen
b8178515cd Update engine methods in notifier (#8038)
Fulu uses `getPayloadV5`, this PR updates the notifier logging prior to the fork.


  


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
2025-09-14 23:41:12 +00:00
Eitan Seri-Levi
aef8291f94 Add max delay to reconstruction (#7976)
#7697


  If we're three seconds into the current slot just trigger reconstruction. I don't know what the correct reconstruction deadline number is, but it should probably be at least half a second before the attestation deadline


Co-Authored-By: Eitan Seri- Levi <eserilev@gmail.com>

Co-Authored-By: Eitan Seri-Levi <eserilev@ucsc.edu>
2025-09-12 06:05:42 +00:00
Jimmy Chen
fb77ce9e19 Add missing event in PendingComponent span and clean up sync logs (#8033)
I was looking into some long `PendingComponents` span and noticed the block event wasn't added to the span, so it wasn't possible to see when the block was added from the trace view, this PR fixes this.

<img width="637" height="430" alt="image" src="https://github.com/user-attachments/assets/65040b1c-11e7-43ac-951b-bdfb34b665fb" />

Additionally I've noticed a lot of noises and confusion in sync logs due to the initial`peer_id` being included as part of the syncing chain span, causing all logs under the span to have that `peer_id`, which may not be accurate for some sync logs, I've removed `peer_id` from the `SyncingChain` span, and also cleaned up a bunch of spans to use `%` (display) for slots and epochs to make logs easier to read.


  


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
2025-09-12 05:11:30 +00:00
Michael Sproul
a080bb5cee Increase HTTP timeouts on CI (#8031)
Since we re-enabled HTTP API tests on CI (https://github.com/sigp/lighthouse/pull/7943) there have been a few spurious failures:

- https://github.com/sigp/lighthouse/actions/runs/17608432465/job/50024519938?pr=7783

That error is awkward, but running locally with a short timeout confirms it to be a timeout.


  Change the request timeout to 5s everywhere. We had kept it shorter to try to detect performance regressions, but I think this is better suited to being done with metrics & traces. On CI we really just want things to pass reliably without flakiness, so I think a longer timeout to handle slower test code (like mock-builder) and overworked CI boxes makes sense.


Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
2025-09-11 00:47:39 +00:00
kevaundray
f71d69755d chore: add comment to PendingComponents (#7979)
Adds doc comment


  


Co-Authored-By: Kevaundray Wedderburn <kevtheappdev@gmail.com>

Co-Authored-By: Jimmy Chen <jimmy@sigmaprime.io>
2025-09-10 13:48:11 +00:00
Daniel Knopik
ee1b6bc81b Create network_utils crate (#7761)
Anchor currently depends on `lighthouse_network` for a few types and utilities that live within. As we use our own libp2p behaviours, we actually do not use the core logic in that crate. This makes us transitively depend on a bunch of unneeded crates (even a whole separate libp2p if the versions mismatch!)


  Move things we require into it's own lightweight crate.


Co-Authored-By: Daniel Knopik <daniel@dknopik.de>
2025-09-10 12:59:24 +00:00
Eitan Seri-Levi
caa1df6fc3 Skip column gossip verification logic during block production (#7973)
#7950


  Skip column gossip verification logic during block production as its redundant and potentially computationally expensive.


Co-Authored-By: Eitan Seri- Levi <eserilev@gmail.com>

Co-Authored-By: Eitan Seri-Levi <eserilev@ucsc.edu>

Co-Authored-By: Jimmy Chen <jimmy@sigmaprime.io>

Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
2025-09-10 12:29:56 +00:00
hopinheimer
38205192ca Fix http api tests ci (#7943)
Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>

Co-Authored-By: Michael Sproul <micsproul@gmail.com>

Co-Authored-By: Michael Sproul <michael@sigmaprime.io>

Co-Authored-By: hopinheimer <knmanas6@gmail.com>
2025-09-10 06:46:48 +00:00
Jimmy Chen
8a4f6cf0d5 Instrument tracing on block production code path (#8017)
Partially #7814. Instrument block production code path.

New root spans:
* `produce_block_v3`
* `produce_block_v2`

Example traces:

<img width="518" height="432" alt="image" src="https://github.com/user-attachments/assets/a9413d25-501c-49dc-95cc-623db5988981" />


  


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
2025-09-10 03:30:51 +00:00
Jimmy Chen
ee734d1456 Fix stuck data column lookups by improving peer selection and retry logic (#8005)
Fixes the issue described in #7980 where Lighthouse repeatedly sends `DataColumnsByRoot` requests to the same peers that return empty responses, causing sync to get stuck.

The root cause was we don't count empty responses as failures, leading to excessive retries to unresponsive peers.


  - Track per peer attempts to limit retry attempts per peer (`MAX_CUSTODY_PEER_ATTEMPTS = 3`)
- Replaced random peer selection with hashing within each lookup to prevent splitting lookup into too many small requests and improve request batching efficiency.
- Added `single_block_lookup` root span to track all lookups created and added more debug logs:

<img width="1264" height="501" alt="image" src="https://github.com/user-attachments/assets/983629ba-b6d0-41cf-8e93-88a5b96c2f31" />


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>

Co-Authored-By: Jimmy Chen <jimmy@sigmaprime.io>
2025-09-09 06:18:05 +00:00
Eitan Seri-Levi
8ec2640e04 Don't penalize peers if locally constructed light client data is stale (#7996)
#7994


  We seem to be penalizing peers in situations where locally constructed light client data is stale. This PR ignores incoming light client data if our locally constructed light client data isn't up to date.


Co-Authored-By: Eitan Seri-Levi <eserilev@ucsc.edu>
2025-09-05 03:23:34 +00:00
Jimmy Chen
9d2f55a399 Fix data column reconstruction error (#7998)
Addresses #7991
2025-09-04 20:17:52 +00:00
Jimmy Chen
677de70025 Fix incorrect prune test logic (#7999)
I just noticed that one of the tests i added in #7915 is incorrect, after it was running flaky for a bit.
This PR fixes the scenario and ensure the outcome will always be the same.
2025-09-04 19:53:38 +00:00
Pawan Dhananjay
84ec209eba Allow AwaitingDownload to be a valid in-between state (#7984)
N/A


  Extracts (3) from https://github.com/sigp/lighthouse/pull/7946.

Prior to peerdas, a batch should never have been in `AwaitingDownload` state because we immediataly try to move from `AwaitingDownload` to `Downloading` state by sending batches. This was always possible as long as we had peers in the `SyncingChain` in the pre-peerdas world.

However, this is no longer the case as a batch can be stuck waiting in `AwaitingDownload` state if we have no peers to request the columns from. This PR makes `AwaitingDownload` to be an allowable in between state. If a batch is found to be in this state, then we attempt to send the batch instead of erroring like before.
Note to reviewer: We need to make sure that this doesn't lead to a bunch of batches stuck in `AwaitingDownload` state if the chain can be progressed.

Backfill already retries all batches in AwaitingDownload state so we just need to make `AwaitingDownload` a valid state during processing and validation.

This PR explicitly adds the same logic for forward sync to download batches stuck in `AwaitingDownload`.
Apart from that, we also force download of the `processing_target` when sync stops progressing. This is required in cases where `self.batches` has > `BATCH_BUFFER_SIZE` batches that are waiting to get processed but the `processing_batch` has repeatedly failed at download/processing stage. This leads to sync getting stuck and never recovering.
2025-09-04 07:39:16 +00:00
Jimmy Chen
c2a92f1a8c Maintain peers across all data column subnets (#7915)
Closes:
- #7865
- #7855

Changes extracted from earlier PR #7876

This PR fixes two main things with a few other improvements mentioned below:
- Prevent Lighthouse from repeatedly sending `DataColumnByRoot` requests to an unsynced peer, causing lookup sync to get stuck
- Allows Lighthouse to send discovery requests if there isn't enough **synced** peers in the required sampling subnets - this fixes the stuck sync scenario where there isn't enough usable peers in sampling subnet but no discovery is attempted.


  - Make peer discovery queries if custody subnet peer count drops below the minimum threshold
- Update peer pruning logic to prioritise uniform distribution across all data column subnets and avoid pruning sampling peers if the count is below the target threshold (2)
- Check sync status when making discovery requests, to make sure we don't ignore requests if there isn't enough synced peers in the required sampling subnets
- Optimise some of the `PeerDB` functions checking custody peers
- Only send lookup requests to peers that are synced or advanced
2025-09-04 05:36:20 +00:00
Akihito Nakano
7b5be8b1e7 Remove ttfb_timeout and resp_timeout (#7925)
`TTFB_TIMEOUT` was deprecated in https://github.com/ethereum/consensus-specs/pull/3767.


  Remove `ttfb_timeout` from `InboundUpgrade` and other related structs.

(Update)
Also removed `resp_timeout` and also removed the `NetworkParams` struct since its fields are no longer used. https://github.com/sigp/lighthouse/pull/7925#issuecomment-3226886352
2025-09-03 02:00:15 +00:00
Jimmy Chen
eef02afc93 Fix data availability checker race condition causing partial data columns to be served over RPC (#7961)
Partially resolves #6439, an simpler alternative to #7931.

Race condition occurs when RPC data columns arrives after a block has been imported and removed from the DA checker:
1. Block becomes available via gossip
2. RPC columns arrive and pass fork choice check (block hasn't been imported)
3. Block import completes (removing block from DA checker)
4. RPC data columns finish verification and get imported into DA checker

This causes two issues:
1. **Partial data serving**: Already imported components get re-inserted, potentially causing LH to serve incomplete data
2. **State cache misses**: Leads to state reconstruction, holding the availability cache write lock longer and increasing race likelihood

### Proposed Changes

1. Never manually remove pending components from DA checker. Components are only removed via LRU eviction as finality advances. This makes sure we don't run into the issue described above.
2. Use `get` instead of `pop` when recovering the executed block, this prevents cache misses in race condition. This should reduce the likelihood of the race condition
3. Refactor DA checker to drop write lock as soon as components are added. This should also reduce the likelihood of the race condition

**Trade-offs:**

This solution eliminates a few nasty race conditions while allowing simplicity, with the cost of allowing block re-import (already existing).

The increase in memory in DA checker can be partially offset by a reduction in block cache size if this really comes an issue (as we now serve recent blocks from DA checker).
2025-09-02 07:18:23 +00:00
Jimmy Chen
979ed2557c Remove expect usage in kzg_utils (#7957)
Remove `expect` usage in `kzg_utils` to handle the case where EL sends us invalid proof size instead of crashing.
2025-09-01 09:21:26 +00:00
kevaundray
9cc3c0553b chore: small refactor of epoch method (#7902)
Stylistic; mostly using early returns to avoid the nested logic

Which issue # does this PR address?


  Please list or describe the changes introduced by this PR.
2025-09-01 09:21:23 +00:00
Jimmy Chen
438fb65d45 Avoid serving validator endpoints while the node is far behind syncing head (#7962)
A performance issue was discovered when devnet-3 was under non-finality - some of the lighthouse nodes are "stuck" with syncing because of handling proposer duties HTTP requests.

These validator requests are higher priority than Status processing, and if they are taking a long time to process, the node won't be able to progress. What's worse is - under long period of non finality, the proposer duties calculation function tries to do state advance for a large number of [slots](d545ddcbc7/beacon_node/beacon_chain/src/beacon_proposer_cache.rs (L183)) here, causing the node to spend all its CPU time on a task that doesn't really help, e.g. the computed duties aren't useful if the node is 20000 slots behind.

To solve this issue, we use the `not_while_syncing` filter to prevent serving these requests, until the node is synced. This should allow the node to focus on sync under non-finality situations.
2025-08-29 03:01:40 +00:00
Jimmy Chen
a134d43446 Use rayon to speed up batch KZG verification (#7921)
Addresses #7866.


  Use Rayon to speed up batch KZG verification during range / backfill sync.

While I was analysing the traces, I also discovered a bug that resulted in only the first 128 columns in a chain segment batch being verified. This PR fixes it, so we might actually observe slower range sync due to more cells being KZG verified.

I've also updated the handling of batch KZG failure to only find the first invalid KZG column when verification fails as this gets very expensive during range/backfill sync.
2025-08-29 00:59:40 +00:00
Pawan Dhananjay
b6792d85d2 Reduce backfill batch buffer size (#7958)
N/A


  Currently, backfill is allowed to create upto 20 pending batches which is unnecessarily high imo. Forward sync also allows a max of 5 batches to be buffered at a time.  This PR reduces the batch size to match with forward sync.

Having high number of batches is a little annoying with peerdas because we try to create and send 20 requests (even though we are processing them in a rate limited manner). Requests with peerdas is a lot more heavy as we distribute requests across multiple peers leading to lot of requests that may keep getting retried. This could take resources away from processing at head.
2025-08-28 03:31:31 +00:00