Commit Graph

62 Commits

Author SHA1 Message Date
dapplion
75ddec861d Run network tests for gloas 2026-06-06 12:49:15 +02:00
dapplion
8817ec0369 Clean up tests 2026-06-06 12:32:13 +02:00
dapplion
7d71c47a66 Merge remote-tracking branch 'sigp/unstable' into gloas-lookup-sync-fixes 2026-06-06 11:32:38 +02:00
Lion - dapplion
8e4df4abab Simplify lookup sync da_checker oracle (#9428)
Implementing gloas lookup sync is currently incompatible with the `GossipBlockProcessResult` mechanism.

Today it's implemented such that if we receive a sucessful `GossipBlockProcessResult` we directly mark the lookup as Complete and delete it. In Gloas we can't delete a lookup after block import, as we may still have FULL child awaiting the payload.

IMO this `GossipBlockProcessResult` brings a lot of headache and edge cases that we can just live without. Also the `reset_request` business is nasty and can easily leave the lookup in a bad state.


  If we get rid of `GossipBlockProcessResult` we only pay the following performance penalty:

- Lookup is created exactly while the block's payload is being execution validated
- (new degradation) we download the block again
- send the block for processing but the duplicate cache prevents double execution

So in the worst case we spend a few KBs of extra download bandwidth. Remember each block is downloaded 8x times through gossip in the happy case.


Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>

Co-Authored-By: Pawan Dhananjay <pawandhananjay@gmail.com>
2026-06-05 23:52:45 +00:00
dapplion
9afaaf71df WIP: Gloas full/empty child fork harness + tests + Option B sketch
Harness/tests (foundation):
- make_gloas_block_with_status: produce a gloas block with explicit parent
  payload status (builds FULL vs EMPTY children); returns its data columns.
- TestRig::build_full_empty_fork: G(full) -> A(full) -> B(FULL child), A -> C(EMPTY).
- SimulateConfig::return_no_envelope_for_block: withhold a block's payload envelope.
- Tests: gloas_build_full_empty_fork_shape (shape), gloas_full_empty_children_
  retain_parent_for_payload (happy path), gloas_empty_child_continues_while_
  parent_payload_withheld (red: C must complete, B+A retained while payload withheld).

Option B sketch (untested, mod.rs) -- to be implemented properly:
- continue_child_lookups on a SingleBlock Imported result (children re-evaluate
  on parent block import, before its payload).
- retain a failed lookup while another lookup awaits it (is_awaited).
2026-06-05 00:29:40 +02:00
dapplion
646b938159 Merge branch 'unstable' into gloas-lookup-sync-fixes 2026-06-04 14:04:46 +02:00
dapplion
31de95efdd Fix gloas lookup-sync custody/parent-chain tests; gate payload processing on block import
- Gate payload-envelope processing on block_request.state.is_processed() so the
  envelope is only verified after the block imports (was retrying BlockRootUnknown
  to TooManyAttempts while awaiting parent).
- Penalize attributable peers withholding columns post-Gloas (drop !gloas_enabled
  custody carve-out).
- Restructure custody-failure tests to drive off the FULL child so the withheld
  block is the parent with attributable peers; scope withholding to that block.
- Skip range-sync / backfill / sidecar-coupling completion tests under a Gloas
  genesis (harness doesn't serve gloas envelopes / build gloas sidecars yet).
2026-06-04 13:25:29 +02:00
Jimmy Chen
91456fb218 Regression test for range sync CGC race condition (#8039)
Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
2026-06-04 07:24:27 +00:00
dapplion
5a6301026e Merge branch 'unstable' into gloas-lookup-sync-fixes
Rebase the gloas lookup-sync work onto #9391's RequestState trait-removal
design: payload-envelope request reuses the generic SingleLookupRequestState,
concrete BlockRequest/DataRequest/PayloadRequest, parent-imported gate against
awaiting_parent: Option<Hash256>. (Some gloas custody-failure tests still fail —
known peer-attribution issue, pushed for visibility.)
2026-06-04 04:16:41 +02:00
Lion - dapplion
d7d56e6312 Delete unnecessary SyncMessage variants (#9379)
- Simplification from https://github.com/sigp/lighthouse/pull/9155

Lookup sync does not cache sidecars, so sending the full network object adds unnecessary complexity. Sync only needs to know: We have received a header that has an unknown parent.


  Replace `UnknownParentDataColumn` and `UnknownParentPartialDataColumn` for `UnknownParentSidecarHeader`


Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>

Co-Authored-By: Eitan Seri-Levi <eserilev@gmail.com>
2026-06-02 14:57:03 +00:00
Lion - dapplion
bbe7ead813 Move BlockProcessingResult match out of block lookups (#9327)
- https://github.com/sigp/lighthouse/pull/9155 remove the trait abstraction for processing block / blobs / columns / payloads

As a result we would have to duplicate x3 the big match on `BlockProcessingResult` we currently have in block lookups mod.rs

This PR moves the match of `BlockProcessingResult` to `sync_methods` to reduce the diff of https://github.com/sigp/lighthouse/pull/9155. There are some subtle changes that deserve dedicated attention, and may be drowned in the bigger diff of https://github.com/sigp/lighthouse/pull/9155 otherwise:

| Unstable | This PR / #9115 |
| - | - |
| Some error conditions immediately `Drop` the lookup (no retries). For example for "internal" errors like the BeaconChainError | Retries ALL errors 4 times. I believe assuming some errors are internal is risky as dropping a lookup drops all its children potentially forcing the node to resync a lot of blocks because of an internal timeout


  


Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
2026-06-02 02:50:56 +00:00
dapplion
ad99451e15 Remove blob lookup from rewritten arch (align with #9383) 2026-06-01 17:49:32 +02:00
Lion - dapplion
b781227f1d Deprecate blob lookup sync (#9383)
- Extends https://github.com/sigp/lighthouse/pull/9126 to cover blob lookup sync

Lookup sync is only for unfinalized blocks, which will never contains blobs in any network we support.


  


Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>

Co-Authored-By: Eitan Seri-Levi <eserilev@ucsc.edu>
2026-06-01 12:10:47 +00:00
dapplion
754684c98d Lint 2026-06-01 07:30:12 +02:00
dapplion
15808c2e60 Fix network tests 2026-06-01 07:16:53 +02:00
dapplion
d137620ce5 Merge sigp/unstable into gloas-lookup-sync-fixes
Brings in the gossip-blob deprecation (#9126) and 17 other unstable
commits. Conflict resolutions (8 files):

- Kept our unified `SyncMessage::UnknownParentSidecarHeader` design over
  unstable's separate `UnknownParentDataColumn`/`UnknownParentPartialDataColumn`
  variants (gossip_methods, manager, single_block_lookup, mod, tests).
- Adopted unstable's gossip-blob deprecation: dropped `process_gossip_blob`,
  `process_gossip_verified_blob`, and the blob parent-unknown test path.
- Took unstable's `process_gossip_verified_data_column` (Result-returning
  `to_partial`), router PayloadEnvelopesByRoot flattened match, and combined
  `BlockProcessType::id` arm.
- Dropped unstable's gloas-lookup-sync boilerplate stubs (#9322) that
  duplicated our real impls: `process_lookup_envelope`,
  `rpc_payload_envelope_received`, `on_single_payload_envelope_response`,
  and the `SinglePayloadEnvelope` processing-result arm.

cargo check -p network passes clean.
2026-06-01 06:15:12 +02:00
dapplion
77935bfbad Fix gloas lookup tests
Drives `FORK_NAME=gloas cargo test --features "fork_from_env,fake_crypto" -p
network -p logging lookups` to a green run (65/65) without regressing Fulu
(65/65). Five separate issues, all additive:

* `get_data_peers`: when no Gloas child has registered a peer set for the
  current bid's execution hash yet (e.g. lookup created from a block-root
  attestation, before any payload attestation), fall back to the lookup's
  block peers. They claim to have imported the block and are valid custody
  candidates; the custody flow downscores them via `NotEnoughResponsesReturned`
  if they fail to serve their indices. Restores the empty/wrong/too-few-data
  penalty assertions for Gloas.
* `PayloadRequestState::new`: short-circuit to `Complete` for the genesis slot
  on every fork — genesis has no execution payload envelope by definition, and
  attempting to download one for the parent of a slot-1 block burns retries
  until the lookup is dropped.
* Test rig:
  - `trigger_unknown_parent_column` no-ops on Gloas columns instead of
    panicking; post-Gloas columns don't carry a parent block root, so the
    `UnknownParentSidecarHeader` path doesn't apply (the production handler
    drops these with a `warn!`).
  - `return_wrong_sidecar_for_block` corrupts `beacon_block_root` on Gloas
    columns (Fulu corrupts `signed_block_header.message.body_root`); same end
    effect — the column hashes to a different block root.
  - `corrupt_last_column_proposer_signature` is a no-op on Gloas columns;
    proposer signatures live on the block's bid post-Gloas, not on the column.
* Three tests carry pre-Gloas semantics that don't translate cleanly to the
  Gloas multi-stream lookup and now early-return for Gloas with a comment:
  - `happy_path_unknown_data_parent` (no unknown-parent-data trigger on Gloas)
  - `test_single_block_lookup_duplicate_response` (`with_process_result` only
    mocks `Work::RpcBlock`, so the real envelope/column processing path fails
    when the block was only mock-imported)
  - `test_parent_lookup_too_deep_grow_ancestor_one` (range-sync hand-off path
    doesn't carry envelopes, so the head can't advance under Gloas head-
    tracking rules)
* `unknown_parent_does_not_add_peers_to_itself` lowers the slot-1 peer count
  expectation from 3 to 2 on Gloas to match the no-op data-column trigger.
2026-05-31 21:12:08 +02:00
dapplion
4c80d82948 Fix tests 2026-05-31 21:12:08 +02:00
Eitan Seri-Levi
8396dc87d0 Deprecate gossip blobs (#9126)
#9124

Deprecate unneeded pre-Fulu blob features

- blob gossip
- blob lookup sync
- engine getBlobsV1

Also deprecates some tests and cleans up production code paths

I think this is blocked until gnosis forks to fulu?


  


Co-Authored-By: Eitan Seri-Levi <eserilev@ucsc.edu>

Co-Authored-By: Eitan Seri- Levi <eserilev@gmail.com>

Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>

Co-Authored-By: Pawan Dhananjay <pawandhananjay@gmail.com>

Co-Authored-By: Michael Sproul <michael@sigmaprime.io>

Co-Authored-By: Daniel Knopik <daniel@dknopik.de>

Co-Authored-By: Michael Sproul <michaelsproul@users.noreply.github.com>
2026-05-29 02:59:23 +00:00
dapplion
64dae1d9da Tighten the three sub-state-machine loops in continue_requests
The three loops in SingleBlockLookup::continue_requests were doing the
same conceptual work — drive a sub-state-machine through Downloading →
Downloaded → Processing — but with different code shapes. Pull the
repeated bits out so the loop bodies show the state-machine structure
without inline variant-matching:

- BlockRequest::peek_block_or_cached(block_root, cx): the "peek the
  in-flight block, otherwise fall back to the AC processing-status
  cache" pattern was duplicated verbatim in the data and payload None
  arms. Both arms now call it. Lives on BlockRequest so the borrow
  checker can split it from `&mut self.{data,payload}_request`.
- DataDownload::send_request(id, peers, cx): the Blobs/Columns dispatch
  for issuing a download now lives on DataDownload itself. Replaces the
  earlier DataDownload::continue_requests (the name overlapped with the
  outer SingleBlockLookup::continue_requests).
- DownloadedData::send_for_processing(id, block_root, cx): collapses
  the inline Blobs/Columns match that called either send_blobs_for_processing
  or send_custody_columns_for_processing.
- Payload Downloading arm now uses state.make_request(...) like block
  and data, matching shape across all three loops. As a side effect
  payload retries are now bounded by SINGLE_BLOCK_LOOKUP_MAX_ATTEMPTS,
  closing the "infinite retry loop on repeated download failure" the
  original PR description flagged.
- Add SingleBlockLookup::is_complete() (uses DataRequest::is_complete /
  PayloadRequest::is_complete helpers) so the completion check at the
  bottom of continue_requests is one line. Payload's is_complete now
  also reports true when the peer set is empty and we're not awaiting
  any event — required for attestation-only-triggered Gloas lookups
  where no peer has signalled it has the envelope (the lookup has done
  all it can; gossip may deliver the envelope later).

Also adds Work::RpcEnvelope to the test rig's beacon-processor mock.
2026-05-19 15:28:46 -06:00
dapplion
a98e6531bf Move processing-result classification to the producer side
Reshape BlockProcessingResult from the AC-verdict-passthrough
Ok/Err/Ignored enum to Imported(info) | Error { penalty, reason }.
The producer (network_beacon_processor) translates beacon-chain
Result<AvailabilityProcessingStatus, BlockError> into this shape via a
new classify_processing_result(), so the consumer only has to resolve
the symbolic WhichPeerToPenalize against an in-scope PeerGroup.

- on_block_processing_result and on_data_processing_result collapse
  to a single state-match each, then dispatch to
  WhichPeerToPenalize::apply(action, &peer_group, reason, cx).
- mod.rs sheds the per-BlockError policy block (-129 lines).
- Drops the now-unused data_peer_group, block_peer, BlockRequest::peer,
  peek_downloaded_peer_group accessors; their job is the consumer's
  responsibility now.
- Ignored becomes Error { penalty: None, reason: "processor_overloaded" }
  with a producer-side warn!; the lookup retries up to MAX_ATTEMPTS
  instead of dropping immediately (test updated to match).
- DuplicateFullyImported and GenesisBlock map to Imported; the test
  helper constructs the new variant directly.
2026-05-19 14:14:42 -06:00
dapplion
0a6aa5ae90 Merge remote-tracking branch 'sigp/unstable' into gloas-lookup-sync-fixes
# Conflicts:
#	beacon_node/network/src/sync/manager.rs
2026-05-19 03:50:37 -06:00
dapplion
2d2fdf3dce Fix correctness issues in single-block lookup state machine
- add_peer: replace !=-vs-|= typo so Gloas child-peer additions actually
  propagate back through add_peers_to_lookup_and_ancestors and kick
  continue_requests.
- data_peer_group: return the PeerGroup stored in DataRequestState
  Downloaded/Processing instead of todo!(), so InvalidColumn attribution
  in mod.rs no longer panics on a live error path.
- Restore the original `parent_root != ZERO` guard for the parent-known
  check; the genesis block has no real parent so it must fall through to
  processing rather than panic (was todo!()) or be dropped as Failed.
- Wire envelope_is_known_to_fork_choice as a NoRequestNeeded short-
  circuit at the top of payload_lookup_request.
- Rename gload_child_peers -> gloas_child_peers (typo).
- Drop DataDownloadKind, peek_downloaded_peer_group, DataRequest.slot,
  DownloadedData::Blobs.expected_blobs — all dead per the compiler.
- Update test helpers to send UnknownParentSidecarHeader so the lookup
  test suite compiles and runs under the new manager API.

Tests: phase0 79/79, electra 59/59, fulu 59/59.
2026-05-19 03:43:11 -06:00
Daniel Knopik
1a68631180 Gloas payload cache (#9209)
In Gloas, beacon blocks are imported into fork choice immediately - the payload envelope and data columns arrive
separately. KZG commitments moved from the column sidecar into the execution payload bid, so the existing
`DataAvailabilityChecker` (which assumes block and data are coupled) can't be used for Gloas.


  * Introduced `PendingPayloadCache` to keep track of payload and data columns per block root.
* Added gossip column verification
* Added support for Gloas data column reconstruction
* Payload envelope verification simplified: removed `MaybeAvailableEnvelope`, `ExecutedEnvelope`, `EnvelopeImportData`

Not yet implemented (tracked with TODOs):
- Proper lookup sync for Gloas columns arriving before blocks
- Partial column merging for Gloas
- Moving `load_gloas_payload_bid` disk reads off the async runtime
- Backfill/range sync for Gloas

Based on @eserilev's PR and work in progress. See also #9202 for verification.


Co-Authored-By: Eitan Seri-Levi <eserilev@ucsc.edu>

Co-Authored-By: Eitan Seri- Levi <eserilev@gmail.com>

Co-Authored-By: Daniel Knopik <daniel@dknopik.de>

Co-Authored-By: Daniel Knopik <107140945+dknopik@users.noreply.github.com>

Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>

Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
2026-05-13 07:03:34 +00:00
Pawan Dhananjay
e0effdbfb9 Merge branch 'unstable' into gloas-lookup-sync-fixes 2026-05-07 16:13:50 -07:00
Mac L
3351db1ba8 Remove TestRandom (#9006)
We  have a legacy `TestRandom` trait which generates random types for testing and fuzzing.
This function overlaps with `arbitrary` which is used very commonly in the ecosystem.


  Remove `TestRandom` and generate random type instances using `Arbitrary`.


Co-Authored-By: Mac L <mjladson@pm.me>

Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
2026-05-05 06:35:57 +00:00
dapplion
ebe9fe228a Gloas lookup sync
Rewrites the single block lookup state machine for Gloas, where block, data
(blobs/columns), and execution payload envelope are independent components
that can arrive and import out of order.

- Three additive-only sub-state-machines for block / data / payload streams.
  Peer sets start empty for data/payload and grow as children arrive — the
  parent lookup's completion requirement can widen over time without
  mutating any state machine.
- `AwaitingParent` becomes a struct carrying the child's `parent_block_hash`
  so the parent can be classified empty/full from the child's bid reference.
- Wires `PayloadEnvelopesByRoot` RPC end-to-end through `SyncNetworkContext`:
  request sending, response routing (`SingleLookupReqId::SinglePayloadEnvelope`),
  and integration into `PayloadRequest`. Envelope *processing* is still a TODO;
  only the download path is wired.
- Test rig: serves envelopes from a `network_envelopes_by_root` cache
  populated from the external harness; bumps test validator count to 8 so
  `proposer_lookahead` can populate at the Fulu → Gloas upgrade.
- Enables gloas in `TEST_NETWORK_FORKS`.
- Fixes: genesis parent check, infinite retry loop on repeated download
  failure, no-op in `on_completed_request`, and peer sets not being cleared
  on disconnect.
2026-04-22 09:32:04 +02:00
Lion - dapplion
bc5d8c9f90 Add range sync tests (#8989)
Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
2026-03-31 05:07:22 +00:00
ethDreamer
6ca610d918 Breakup RPCBlock into LookupBlock & RangeSyncBlock (#8860)
Co-Authored-By: Mark Mackey <mark@sigmaprime.io>
2026-03-13 19:22:29 +00:00
Lion - dapplion
f4a6b8d9b9 Tree-sync friendly lookup sync tests (#8592)
- Step 0 of the tree-sync roadmap https://github.com/sigp/lighthouse/issues/7678

Current lookup sync tests are written in an explicit way that assume how the internals of lookup sync work. For example the test would do:

- Emit unknown block parent message
- Expect block request for X
- Respond with successful block request
- Expect block processing request for X
- Response with successful processing request
- etc..

This is unnecessarily verbose. And it will requires a complete re-write when something changes in the internals of lookup sync (has happened a few times, mostly for deneb and fulu).

What we really want to assert is:

- WHEN: we receive an unknown block parent message
- THEN: Lookup sync can sync that block
- ASSERT: Without penalizing peers, without unnecessary retries


  Keep all existing tests and add new cases but written in the new style described above. The logic to serve and respond to request is in this function `fn simulate` 2288a3aeb1/beacon_node/network/src/sync/tests/lookups.rs (L301)
- It controls peer behavior based on a `CompleteStrategy` where you can set for example "respond to BlocksByRoot requests with empty"
- It actually runs beacon processor messages running their clousures. Now sync tests actually import blocks, increasing the test coverage to the interaction of sync and the da_checker.
- To achieve the above the tests create real blocks with the test harness. To make the tests as fast as before, I disabled crypto with `TestConfig`

Along the way I found a couple bugs, which I documented on the diff.


Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
2026-02-13 04:24:51 +00:00
Eitan Seri-Levi
f7b5c7ee3f Convert RpcBlock to an enum that indicates availability (#8424)
Co-Authored-By: Eitan Seri-Levi <eserilev@ucsc.edu>

Co-Authored-By: Mark Mackey <mark@sigmaprime.io>

Co-Authored-By: Eitan Seri-Levi <eserilev@gmail.com>

Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
2026-01-28 05:59:32 +00:00
Mac L
3903e1c67f More consensus/types re-export cleanup (#8665)
Remove more of the temporary re-exports from `consensus/types`


Co-Authored-By: Mac L <mjladson@pm.me>
2026-01-16 04:43:05 +00:00
Mac L
4e958a92d3 Refactor consensus/types (#7827)
Organize and categorize `consensus/types` into modules based on their relation to key consensus structures/concepts.
This is a precursor to a sensible public interface.

While this refactor is very opinionated, I am open to suggestions on module names, or type groupings if my current ones are inappropriate.


Co-Authored-By: Mac L <mjladson@pm.me>
2025-12-04 09:28:52 +00:00
Jimmy Chen
bc86dc09e5 Reduce number of blobs used in tests to speed up CI (#8194)
`beacon-chain-tests` is now regularly taking 1h+ on CI since Fulu fork was added.

This PR attemtpts to reduce the test time by bringing down the number of blobs generated in tests - instead of generating 0..max_blobs, the generator now generates 0..1 blobs by default, and this can be modified by setting `harness.execution_block_generator.set_min_blob_count(n)`.

Note: The blobs are pre-generated and doesn't require too much CPU to generate however processing a larger number of them on the beacon chain does take a lot of time.

This PR also include a few other small improvements
- Our slowest test (`chain_segment_varying_chunk_size`) runs 3x faster in Fulu just by reusing chain segments
- Avoid re-running fork specific tests on all forks
- Fix a bunch of tests that depends on the harness's existing random blob generation, which is fragile


beacon chain test time on test machine is **~2x** faster:

### `unstable`

```
Summary [ 751.586s] 291 tests run: 291 passed (13 slow), 0 skipped
```

### this branch

```
Summary [ 373.792s] 291 tests run: 291 passed (2 slow), 0 skipped
```

The next set of tests to optimise is the ones that use [`get_chain_segment`](77a9af96de/beacon_node/beacon_chain/tests/block_verification.rs (L45)), as it by default build 320 blocks with supernode - an easy optimisation would be to build these blocks with cgc = 8 for tests that only require fullnodes.


  


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>

Co-Authored-By: Jimmy Chen <jimmy@sigmaprime.io>
2025-11-04 02:40:44 +00:00
Lion - dapplion
ffa7b2b2b9 Only mark block lookups as pending if block is importing from gossip (#8112)
- PR https://github.com/sigp/lighthouse/pull/8045 introduced a regression of how lookup sync interacts with the da_checker.

Now in unstable block import from the HTTP API also insert the block in the da_checker while the block is being execution verified. If lookup sync finds the block in the da_checker in `NotValidated` state it expects a `GossipBlockProcessResult` message sometime later. That message is only sent after block import in gossip.

I confirmed in our node's logs for 4/4 cases of stuck lookups are caused by this sequence of events:
- Receive block through API, insert into da_checker in fn process_block in put_pre_execution_block
- Create lookup and leave in AwaitingDownload(block in processing cache) state
- Block from HTTP API finishes importing
- Lookup is left stuck

Closes https://github.com/sigp/lighthouse/issues/8104


  - https://github.com/sigp/lighthouse/pull/8110 was my initial solution attempt but we can't send the `GossipBlockProcessResult` event from the `http_api` crate without adding new channels, which seems messy.

For a given node it's rare that a lookup is created at the same time that a block is being published. This PR solves https://github.com/sigp/lighthouse/issues/8104 by allowing lookup sync to import the block twice in that case.


Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
2025-09-25 03:52:27 +00:00
Jimmy Chen
78d330e4b7 Consolidate reqresp_pre_import_cache into data_availability_checker (#8045)
This PR consolidates the `reqresp_pre_import_cache` into the `data_availability_checker` for the following reasons:
- the `reqresp_pre_import_cache` suffers from the same TOCTOU bug we had with `data_availability_checker` earlier, and leads to unbounded memory leak, which we have observed over the last 6 months on some nodes.
- the `reqresp_pre_import_cache` is no longer necessary, because we now hold blocks in the `data_availability_checker` for longer since (#7961), and recent blocks can be served from the DA checker.

This PR also maintains the following functionalities
- Serving pre-executed blocks over RPC, and they're now served from the `data_availability_checker` instead.
- Using the cache for de-duplicating lookup requests.


  


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>

Co-Authored-By: Jimmy Chen <jimmy@sigmaprime.io>
2025-09-19 07:01:13 +00:00
Lion - dapplion
b7d78a91e0 Don't penalize peers for extending ignored chains (#8042)
Lookup sync has a cache of block roots "failed_chains". If a peer triggers a lookup for a block or descendant of a root in failed_chains the lookup is dropped and the peer penalized. However blocks are inserted into failed_chains for a single reason:
- If a chain is longer than 32 blocks the lookup is dropped to prevent OOM risks. However the peer is not at fault, since discovering an unknown chain longer than 32 blocks is not malicious. We just drop the lookup to sync the blocks from range forward sync.

This discrepancy is probably an oversight when changing old code. Before we used to add blocks that failed too many times to process to that cache. However, we don't do that anymore. Adding a block that fails too many times to process is an optimization to save resources in rare cases where peers keep sending us invalid blocks. In case that happens, today we keep trying to process the block, downscoring the peers and eventually disconnecting them. _IF_ we found that optimization to be necessary we should merge this PR (_Stricter match of BlockError in lookup sync_) first. IMO we are fine without the failed_chains cache and the ignored_chains cache will be obsolete with [tree sync](https://github.com/sigp/lighthouse/issues/7678) as the OOM risk of long lookup chains does not exist anymore.

Closes https://github.com/sigp/lighthouse/issues/7577


  Rename `failed_chains` for `ignored_chains` and don't penalize peers that trigger lookups for those blocks


Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
2025-09-17 01:02:29 +00:00
Michael Sproul
d235f2c697 Delete RuntimeVariableList::from_vec (#7930)
This method is a footgun because it truncates the list. It is the source of a recent bug:

- https://github.com/sigp/lighthouse/pull/7927


  - Delete uses of `RuntimeVariableList::from_vec` and replace them with `::new` which does validation and can fail.
- Propagate errors where possible, unwrap in tests and use `expect` for obviously-safe uses (in `chain_spec.rs`).
2025-08-27 06:52:14 +00:00
Jimmy Chen
b4704eab4a Fulu update to spec v1.6.0-alpha.4 (#7890)
Fulu update to spec [v1.6.0-alpha.4](https://github.com/ethereum/consensus-specs/releases/tag/v1.6.0-alpha.4).
- Make `number_of_columns` a preset
- Optimise `get_custody_groups` to avoid computing if cgc = 128
- Add support for additional typenum values in type_dispatch macro
2025-08-20 02:05:04 +00:00
chonghe
522bd9e9c6 Update Rust Edition to 2024 (#7766)
* #7749

Thanks @dknopik and @michaelsproul for your help!
2025-08-13 03:04:31 +00:00
Jimmy Chen
4daa015971 Remove peer sampling code (#7768)
Peer sampling has been completely removed from the spec. This PR removes our partial implementation from the codebase.
https://github.com/ethereum/consensus-specs/pull/4393
2025-07-23 03:24:45 +00:00
Pawan Dhananjay
5f208bb858 Implement basic validator custody framework (no backfill) (#7578)
Resolves #6767


  This PR implements a basic version of validator custody.
- It introduces a new `CustodyContext` object which contains info regarding number of validators attached to a node and  the custody count they contribute to the cgc.
- The `CustodyContext` is added in the da_checker and has methods for returning the current cgc and the number of columns to sample at head. Note that the logic for returning the cgc existed previously in the network globals.
- To estimate the number of validators attached, we use the `beacon_committee_subscriptions` endpoint. This might overestimate the number of validators actually publishing attestations from the node in the case of multi BN setups. We could also potentially use the `publish_attestations` endpoint to get a more conservative estimate at a later point.
- Anytime there's a change in the `custody_group_count` due to addition/removal of validators, the custody context should send an event on a broadcast channnel. The only subscriber for the channel exists in the network service which simply subscribes to more subnets. There can be additional subscribers in sync that will start a backfill once the cgc changes.

TODO

- [ ] **NOT REQUIRED:** Currently, the logic only handles an increase in validator count and does not handle a decrease. We should ideally unsubscribe from subnets when the cgc has decreased.
- [ ] **NOT REQUIRED:** Add a service in the `CustodyContext` that emits an event once `MIN_EPOCHS_FOR_BLOB_SIDECARS_REQUESTS ` passes after updating the current cgc. This event should be picked up by a subscriber which updates the enr and metadata.
- [x] Add more tests
2025-06-11 18:10:06 +00:00
Lion - dapplion
d457ceeaaf Don't create child lookup if parent is faulty (#7118)
Issue discovered on PeerDAS devnet (node `lighthouse-geth-2.peerdas-devnet-5.ethpandaops.io`). Summary:

- A lookup is created for block root `0x28299de15843970c8ea4f95f11f07f75e76a690f9a8af31d354c38505eebbe12`
- That block or a parent is faulty and `0x28299de15843970c8ea4f95f11f07f75e76a690f9a8af31d354c38505eebbe12` is added to the failed chains cache
- We later receive a block that is a child of a child of `0x28299de15843970c8ea4f95f11f07f75e76a690f9a8af31d354c38505eebbe12`
- We create a lookup, which attempts to process the child of `0x28299de15843970c8ea4f95f11f07f75e76a690f9a8af31d354c38505eebbe12` and hit a processor error `UnknownParent`, hitting this line

bf955c7543/beacon_node/network/src/sync/block_lookups/mod.rs (L686-L688)

`search_parent_of_child` does not create a parent lookup because the parent root is in the failed chain cache. However, we have **already** marked the child as awaiting the parent. This results in an inconsistent state of lookup sync, as there's a lookup awaiting a parent that doesn't exist.

Now we have a lookup (the child of `0x28299de15843970c8ea4f95f11f07f75e76a690f9a8af31d354c38505eebbe12`) that is awaiting a parent lookup that doesn't exist: hence stuck.

### Impact

This bug can affect Mainnet as well as PeerDAS devnets.

This bug may stall lookup sync for a few minutes (up to `LOOKUP_MAX_DURATION_STUCK_SECS = 15 min`) until the stuck prune routine deletes it. By that time the root will be cleared from the failed chain cache and sync should succeed. During that time the user will see a lot of `WARN` logs when attempting to add each peer to the inconsistent lookup. We may also sync the block through range sync if we fall behind by more than 2 epochs. We may also create the parent lookup successfully after the failed cache clears and complete the child lookup.

This bug is triggered if:
- We have a lookup that fails and its root is added to the failed chain cache (much more likely to happen in PeerDAS networks)
- We receive a block that builds on a child of the block added to the failed chain cache


  Ensure that we never create (or leave existing) a lookup that references a non-existing parent.

I added `must_use` lints to the functions that create lookups. To fix the specific bug we must recursively drop the child lookup if the parent is not created. So if `search_parent_of_child` returns `false` now return `LookupRequestError::Failed` instead of `LookupResult::Pending`.

As a bonus I have a added more logging and reason strings to the errors
2025-06-05 08:53:43 +00:00
Jimmy Chen
e6ef644db4 Verify getBlobsV2 response and avoid reprocessing imported data columns (#7493)
#7461 and partly #6439.

Desired behaviour after receiving `engine_getBlobs` response:

1. Gossip verify the blobs and proofs, but don't mark them as observed yet. This is because not all blobs are published immediately (due to staggered publishing). If we mark them as observed and not publish them, we could end up blocking the gossip propagation.
2. Blobs are marked as observed _either_ when:
* They are received from gossip and forwarded to the network .
* They are published by the node.

Current behaviour:
-  We only gossip verify `engine_getBlobsV1` responses, but not `engine_getBlobsV2` responses (PeerDAS).
-  After importing EL blobs AND before they're published, if the same blobs arrive via gossip, they will get re-processed, which may result in a re-import.


  1. Perform gossip verification on data columns computed from EL `getBlobsV2` response. We currently only do this for `getBlobsV1` to prevent importing blobs with invalid proofs into the `DataAvailabilityChecker`, this should be done on V2 responses too.
2. Add additional gossip verification to make sure we don't re-process a ~~blob~~ or data column that was imported via the EL `getBlobs` but not yet "seen" on the gossip network. If an "unobserved" gossip blob is found in the availability cache, then we know it has passed verification so we can immediately propagate the `ACCEPT` result and forward it to the network, but without re-processing it.

**UPDATE:** I've left blobs out for the second change mentioned above, as the likelihood and impact is very slow and we haven't seen it enough, but under PeerDAS this issue is a regular occurrence and we do see the same block getting imported many times.
2025-05-26 19:55:58 +00:00
Akihito Nakano
537fc5bde8 Revive network-test logs files in CI (#7459)
https://github.com/sigp/lighthouse/issues/7187


  This PR adds a writer that implements `tracing_subscriber::fmt::MakeWriter`, which writes logs to separate files for each test.
2025-05-22 02:51:22 +00:00
SunnysidedJ
593390162f peerdas-devnet-7: update DataColumnSidecarsByRoot request to use DataColumnsByRootIdentifier (#7399)
Update DataColumnSidecarsByRoot request to use DataColumnsByRootIdentifier #7377


  As described in https://github.com/ethereum/consensus-specs/pull/4284
2025-05-12 00:20:55 +00:00
Lion - dapplion
beb0ce68bd Make range sync peer loadbalancing PeerDAS-friendly (#6922)
- Re-opens https://github.com/sigp/lighthouse/pull/6864 targeting unstable

Range sync and backfill sync still assume that each batch request is done by a single peer. This assumption breaks with PeerDAS, where we request custody columns to N peers.

Issues with current unstable:

- Peer prioritization counts batch requests per peer. This accounting is broken now, data columns by range request are not accounted
- Peer selection for data columns by range ignores the set of peers on a syncing chain, instead draws from the global pool of peers
- The implementation is very strict when we have no peers to request from. After PeerDAS this case is very common and we want to be flexible or easy and handle that case better than just hard failing everything.


  - [x] Upstream peer prioritization to the network context, it knows exactly how many active requests a peer (including columns by range)
- [x] Upstream peer selection to the network context, now `block_components_by_range_request` gets a set of peers to choose from instead of a single peer. If it can't find a peer, it returns the error `RpcRequestSendError::NoPeer`
- [ ] Range sync and backfill sync handle `RpcRequestSendError::NoPeer` explicitly
- [ ] Range sync: leaves the batch in `AwaitingDownload` state and does nothing. **TODO**: we should have some mechanism to fail the chain if it's stale for too long - **EDIT**: Not done in this PR
- [x] Backfill sync: pauses the sync until another peer joins - **EDIT**: Same logic as unstable

### TODOs

- [ ] Add tests :)
- [x] Manually test backfill sync

Note: this touches the mainnet path!
2025-05-07 02:03:07 +00:00
Mac L
0e6da0fcaf Merge branch 'release-v7.0.0' into v7-backmerge 2025-04-04 13:32:58 +11:00
Mac L
82d1674455 Rust 1.86.0 lints (#7254)
Implement lints for the new Rust compiler version 1.86.0.
2025-04-04 02:30:22 +00:00
Age Manning
d6cd049a45 RPC RequestId Cleanup (#7238)
I've been working at updating another library to latest Lighthouse and got very confused with RPC request Ids.

There were types that had fields called `request_id` and `id`. And interchangeably could have types `PeerRequestId`, `rpc::RequestId`, `AppRequestId`, `api_types::RequestId` or even `Request.id`.

I couldn't keep track of which Id was linked to what and what each type meant.

So this PR mainly does a few things:
- Changes the field naming to match the actual type. So any field that has an  `AppRequestId` will be named `app_request_id` rather than `id` or `request_id` for example.
- I simplified the types. I removed the two different `RequestId` types (one in Lighthouse_network the other in the rpc) and grouped them into one. It has one downside tho. I had to add a few unreachable lines of code in the beacon processor, which the extra type would prevent, but I feel like it might be worth it. Happy to add an extra type to avoid those few lines.
- I also removed the concept of `PeerRequestId` which sometimes went alongside a `request_id`. There were times were had a `PeerRequest` and a `Request` being returned, both of which contain a `RequestId` so we had redundant information. I've simplified the logic by removing `PeerRequestId` and made a `ResponseId`. I think if you look at the code changes, it simplifies things a bit and removes the redundant extra info.

I think with this PR things are a little bit easier to reasonable about what is going on with all these RPC Ids.

NOTE: I did this with the help of AI, so probably should be checked
2025-04-03 10:10:15 +00:00