Commit Graph

558 Commits

Author SHA1 Message Date
Eitan Seri-Levi
74938661d9 Merge branch 'unstable' of https://github.com/sigp/lighthouse into glamsterdam-devnet-5 2026-06-03 22:26:52 +03:00
Lion - dapplion
eab5163d68 Remove RequestState trait from lookup sync (#9391)
Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
2026-06-03 15:29:04 +00:00
Eitan Seri-Levi
c4fc6fb0c5 Disable peer penalties 2026-06-03 10:13:15 +03:00
Lion - dapplion
d7d56e6312 Delete unnecessary SyncMessage variants (#9379)
- Simplification from https://github.com/sigp/lighthouse/pull/9155

Lookup sync does not cache sidecars, so sending the full network object adds unnecessary complexity. Sync only needs to know: We have received a header that has an unknown parent.


  Replace `UnknownParentDataColumn` and `UnknownParentPartialDataColumn` for `UnknownParentSidecarHeader`


Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>

Co-Authored-By: Eitan Seri-Levi <eserilev@gmail.com>
2026-06-02 14:57:03 +00:00
Lion - dapplion
bbe7ead813 Move BlockProcessingResult match out of block lookups (#9327)
- https://github.com/sigp/lighthouse/pull/9155 remove the trait abstraction for processing block / blobs / columns / payloads

As a result we would have to duplicate x3 the big match on `BlockProcessingResult` we currently have in block lookups mod.rs

This PR moves the match of `BlockProcessingResult` to `sync_methods` to reduce the diff of https://github.com/sigp/lighthouse/pull/9155. There are some subtle changes that deserve dedicated attention, and may be drowned in the bigger diff of https://github.com/sigp/lighthouse/pull/9155 otherwise:

| Unstable | This PR / #9115 |
| - | - |
| Some error conditions immediately `Drop` the lookup (no retries). For example for "internal" errors like the BeaconChainError | Retries ALL errors 4 times. I believe assuming some errors are internal is risky as dropping a lookup drops all its children potentially forcing the node to resync a lot of blocks because of an internal timeout


  


Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
2026-06-02 02:50:56 +00:00
Lion - dapplion
b781227f1d Deprecate blob lookup sync (#9383)
- Extends https://github.com/sigp/lighthouse/pull/9126 to cover blob lookup sync

Lookup sync is only for unfinalized blocks, which will never contains blobs in any network we support.


  


Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>

Co-Authored-By: Eitan Seri-Levi <eserilev@ucsc.edu>
2026-06-01 12:10:47 +00:00
Lion - dapplion
f0aaf65553 Use correct slot in custody request (#9380)
Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
2026-06-01 06:17:00 +00:00
Eitan Seri-Levi
8396dc87d0 Deprecate gossip blobs (#9126)
#9124

Deprecate unneeded pre-Fulu blob features

- blob gossip
- blob lookup sync
- engine getBlobsV1

Also deprecates some tests and cleans up production code paths

I think this is blocked until gnosis forks to fulu?


  


Co-Authored-By: Eitan Seri-Levi <eserilev@ucsc.edu>

Co-Authored-By: Eitan Seri- Levi <eserilev@gmail.com>

Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>

Co-Authored-By: Pawan Dhananjay <pawandhananjay@gmail.com>

Co-Authored-By: Michael Sproul <michael@sigmaprime.io>

Co-Authored-By: Daniel Knopik <daniel@dknopik.de>

Co-Authored-By: Michael Sproul <michaelsproul@users.noreply.github.com>
2026-05-29 02:59:23 +00:00
Eitan Seri-Levi
0565a01633 Gloas dont enforce peer column custody on block import (#9341)
Peers that advertise that they have imported a block may not have the columns for that slot available post-Gloas. Ensure that we dont penalize them.


  


Co-Authored-By: Eitan Seri-Levi <eserilev@ucsc.edu>
2026-05-25 00:21:17 +00:00
Eitan Seri-Levi
5045e8dd85 Custody backfill sync only penalize peers once per batch (#9340)
During custody backfill sync if a peer fails to serve columns for a batch don't penalize them more than once per batch


  


Co-Authored-By: Eitan Seri-Levi <eserilev@ucsc.edu>
2026-05-22 17:50:50 +00:00
Lion - dapplion
1caaa10fa8 Drop unused EthSpec generic from Stores (#9281)
Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
2026-05-21 08:35:35 +00:00
Lion - dapplion
2c76ee5b6b Gloas lookup sync boilerplate (#9322)
Implements the boring boilerplate to send envelopes by root requests and process them. Pre-step to

- https://github.com/sigp/lighthouse/pull/9155


  


Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
2026-05-20 12:56:49 +00:00
Daniel Knopik
1a68631180 Gloas payload cache (#9209)
In Gloas, beacon blocks are imported into fork choice immediately - the payload envelope and data columns arrive
separately. KZG commitments moved from the column sidecar into the execution payload bid, so the existing
`DataAvailabilityChecker` (which assumes block and data are coupled) can't be used for Gloas.


  * Introduced `PendingPayloadCache` to keep track of payload and data columns per block root.
* Added gossip column verification
* Added support for Gloas data column reconstruction
* Payload envelope verification simplified: removed `MaybeAvailableEnvelope`, `ExecutedEnvelope`, `EnvelopeImportData`

Not yet implemented (tracked with TODOs):
- Proper lookup sync for Gloas columns arriving before blocks
- Partial column merging for Gloas
- Moving `load_gloas_payload_bid` disk reads off the async runtime
- Backfill/range sync for Gloas

Based on @eserilev's PR and work in progress. See also #9202 for verification.


Co-Authored-By: Eitan Seri-Levi <eserilev@ucsc.edu>

Co-Authored-By: Eitan Seri- Levi <eserilev@gmail.com>

Co-Authored-By: Daniel Knopik <daniel@dknopik.de>

Co-Authored-By: Daniel Knopik <107140945+dknopik@users.noreply.github.com>

Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>

Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
2026-05-13 07:03:34 +00:00
Mac L
3351db1ba8 Remove TestRandom (#9006)
We  have a legacy `TestRandom` trait which generates random types for testing and fuzzing.
This function overlaps with `arbitrary` which is used very commonly in the ecosystem.


  Remove `TestRandom` and generate random type instances using `Arbitrary`.


Co-Authored-By: Mac L <mjladson@pm.me>

Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
2026-05-05 06:35:57 +00:00
chonghe
8d77b1c08d Remove test_logger feature (#9125)
- #9107


  Remove all instances of `test_logger` in the code


Co-Authored-By: Tan Chee Keong <tanck@sigmaprime.io>
2026-04-30 04:57:47 +00:00
Daniel Knopik
8a384ff445 Cell Dissemination (Partial messages) (#8314)
- https://github.com/ethereum/consensus-specs/pull/4558
- https://eips.ethereum.org/EIPS/eip-8136


  


Co-Authored-By: Daniel Knopik <daniel@dknopik.de>

Co-Authored-By: Pawan Dhananjay <pawandhananjay@gmail.com>

Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
2026-04-23 18:52:28 +00:00
Pawan Dhananjay
4cb3ffed8d Rust 1.95 lints (#9142)
N/A


  Adds lints for rust 1.95. Mostly cosmetic.
1. .zip(a.into_iter()) -> .zip(a) . Also a few more places where into_iter is not required
2. replace sort_by with sort_by_key
3. move if statements inside match block.
4. use checked_div instead of if statements. I think this is debatable in terms of being better, happy to remove it if others also feel its unnecessary


Co-Authored-By: Pawan Dhananjay <pawandhananjay@gmail.com>
2026-04-16 23:50:20 +00:00
Mac L
7c2dcfc0d6 Refactor timestamp_now (#9094)
#9077


  Where possible replaces all instances of `validator_monitor::timestamp_now` with `chain.slot_clock.now_duration().unwrap_or_default()`.
Where chain/slot_clock is not available, instead replace it with a convenience function `slot_clock::timestamp_now`.
Remove the `validator_monitor::timestamp_now` function.


Co-Authored-By: Mac L <mjladson@pm.me>
2026-04-09 08:41:02 +00:00
Lion - dapplion
bc5d8c9f90 Add range sync tests (#8989)
Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
2026-03-31 05:07:22 +00:00
ethDreamer
6ca610d918 Breakup RPCBlock into LookupBlock & RangeSyncBlock (#8860)
Co-Authored-By: Mark Mackey <mark@sigmaprime.io>
2026-03-13 19:22:29 +00:00
Michael Sproul
7e69f6ca1c Merge remote-tracking branch 'origin/release-v8.1' into back-merge-v8.1.1 2026-03-02 09:20:25 +11:00
Jimmy Chen
341682e719 Add unit tests for BatchInfo and fix doc comments (#8873)
Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
2026-02-24 00:15:39 +00:00
0xMushow
2b214175d5 Enforce stricter checks on certain constants (#8500)
Which issue # does this PR address?
None


  All of these are performing a check, and adding a batch, or creating a new lookup, or a new query, etc..
Hence all of these limits would be off by one.

Example:

```rust
// BACKFILL_BATCH_BUFFER_SIZE = 5
if self.batches.iter().filter(...).count() >= BACKFILL_BATCH_BUFFER_SIZE {
return None;  // ← REJECT
}
// ... later adds batch via Entry::Vacant(entry).insert(...)
```

Without the `>` being changed to a `>=` , we would allow 6. The same idea applies to all changes proposed.


Co-Authored-By: Antoine James <antoine@ethereum.org>

Co-Authored-By: Jimmy Chen <jimmy@sigmaprime.io>

Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
2026-02-23 02:02:56 +00:00
Jimmy Chen
4588971085 Add sync batch state metrics (#8847)
Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
2026-02-19 01:57:53 +00:00
Pawan Dhananjay
561898fc1c Process head_chains in descending order of number of peers (#8859)
N/A


  Another find by @gitToki. Sort the preferred_ids in descending order as originally intended from the comment in the function.


Co-Authored-By: Pawan Dhananjay <pawandhananjay@gmail.com>
2026-02-19 00:38:56 +00:00
Lion - dapplion
f4a6b8d9b9 Tree-sync friendly lookup sync tests (#8592)
- Step 0 of the tree-sync roadmap https://github.com/sigp/lighthouse/issues/7678

Current lookup sync tests are written in an explicit way that assume how the internals of lookup sync work. For example the test would do:

- Emit unknown block parent message
- Expect block request for X
- Respond with successful block request
- Expect block processing request for X
- Response with successful processing request
- etc..

This is unnecessarily verbose. And it will requires a complete re-write when something changes in the internals of lookup sync (has happened a few times, mostly for deneb and fulu).

What we really want to assert is:

- WHEN: we receive an unknown block parent message
- THEN: Lookup sync can sync that block
- ASSERT: Without penalizing peers, without unnecessary retries


  Keep all existing tests and add new cases but written in the new style described above. The logic to serve and respond to request is in this function `fn simulate` 2288a3aeb1/beacon_node/network/src/sync/tests/lookups.rs (L301)
- It controls peer behavior based on a `CompleteStrategy` where you can set for example "respond to BlocksByRoot requests with empty"
- It actually runs beacon processor messages running their clousures. Now sync tests actually import blocks, increasing the test coverage to the interaction of sync and the da_checker.
- To achieve the above the tests create real blocks with the test harness. To make the tests as fast as before, I disabled crypto with `TestConfig`

Along the way I found a couple bugs, which I documented on the diff.


Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
2026-02-13 04:24:51 +00:00
Lion - dapplion
8d72cc34eb Add sync request metrics (#7790)
Add error rates metrics on unstable to benchmark against tree-sync. In my branch there are frequent errors but mostly connections errors as the node is still finding it set of stable peers.

These metrics are very useful and unstable can benefit from them ahead of tree-sync


  Add three new metrics:
- sync_rpc_requests_success_total: Total count of sync RPC requests successes
- sync_rpc_requests_error_total: Total count of sync RPC requests errors
- sync_rpc_request_duration_sec: Time to complete a successful sync RPC requesst


Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
2026-02-10 23:40:01 +00:00
Eitan Seri-Levi
f7b5c7ee3f Convert RpcBlock to an enum that indicates availability (#8424)
Co-Authored-By: Eitan Seri-Levi <eserilev@ucsc.edu>

Co-Authored-By: Mark Mackey <mark@sigmaprime.io>

Co-Authored-By: Eitan Seri-Levi <eserilev@gmail.com>

Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
2026-01-28 05:59:32 +00:00
Eitan Seri-Levi
9bec8df37a Add Gloas data column support (#8682)
Co-Authored-By: Eitan Seri-Levi <eserilev@ucsc.edu>

Co-Authored-By: Eitan Seri- Levi <eserilev@gmail.com>
2026-01-28 04:52:12 +00:00
Jimmy Chen
7f065009a7 Implement custom OpenTelemetry sampler to filter uninstrumented traces (#8647)
Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
2026-01-22 05:11:26 +00:00
Mac L
3903e1c67f More consensus/types re-export cleanup (#8665)
Remove more of the temporary re-exports from `consensus/types`


Co-Authored-By: Mac L <mjladson@pm.me>
2026-01-16 04:43:05 +00:00
Lion - dapplion
6166ad2eb2 Replace tracing::debug! with debug! same for other levels (#8300)
Just visual clean-up, making logging statements look uniform. There's no reason to use `tracing::debug` instead of `debug`. If we ever need to migrate our logging lib in the future it would make things easier too.


  


Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>

Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>

Co-Authored-By: Michael Sproul <michaelsproul@users.noreply.github.com>
2026-01-08 07:04:44 +00:00
0xMushow
afa6457acf fix visual bug on visualize_batch_state leading to a non-wanted comma (#8499)
Which issue # does this PR address?
None


  The `visualize_batch_state`  functions uses the following loop `for mut batch_index in 0..BATCH_BUFFER_SIZE`, making it from `0` to `BATCH_BUFFER_SIZE - 1` (behind the scenes).

Hence we would never hit the following condition:

```rust
if batch_index != BATCH_BUFFER_SIZE {
visualization_string.push(',');
}
```

Replacing `!=` with `<` & `BATCH_BUFFER_SIZE -1` allows for the following change:

`[A,B,C,D,E,]` to become: `[A,B,C,D,E]`


Co-Authored-By: Antoine James <antoine@ethereum.org>
2025-12-15 05:14:15 +00:00
Mac L
f3fd1f210b Remove consensus/types re-exports (#8540)
There are certain crates which we re-export within `types` which creates a fragmented DevEx, where there are various ways to import the same crates.

```rust
// consensus/types/src/lib.rs
pub use bls::{
AggregatePublicKey, AggregateSignature, Error as BlsError, Keypair, PUBLIC_KEY_BYTES_LEN,
PublicKey, PublicKeyBytes, SIGNATURE_BYTES_LEN, SecretKey, Signature, SignatureBytes,
get_withdrawal_credentials,
};
pub use context_deserialize::{ContextDeserialize, context_deserialize};
pub use fixed_bytes::FixedBytesExtended;
pub use milhouse::{self, List, Vector};
pub use ssz_types::{BitList, BitVector, FixedVector, VariableList, typenum, typenum::Unsigned};
pub use superstruct::superstruct;
```

This PR removes these re-exports and makes it explicit that these types are imported from a non-`consensus/types` crate.


Co-Authored-By: Mac L <mjladson@pm.me>
2025-12-09 07:13:41 +00:00
Mac L
4e958a92d3 Refactor consensus/types (#7827)
Organize and categorize `consensus/types` into modules based on their relation to key consensus structures/concepts.
This is a precursor to a sensible public interface.

While this refactor is very opinionated, I am open to suggestions on module names, or type groupings if my current ones are inappropriate.


Co-Authored-By: Mac L <mjladson@pm.me>
2025-12-04 09:28:52 +00:00
Michael Sproul
261322c3e3 Merge remote-tracking branch 'origin/stable' into unstable 2025-11-20 13:04:32 +11:00
Pawan Dhananjay
02d0c6a8ce Compute missing_columns correctly (#8425)
N/A


  The difference is computed by taking the difference of expected with received. We were doing the inverse.

Thanks to Yassine for finding the issue.


Co-Authored-By: Pawan Dhananjay <pawandhananjay@gmail.com>
2025-11-18 07:43:28 +00:00
Javier Chávarri
2c1f1c1605 Migrate derivative to educe (#8125)
Fixes #7001.


  Mostly mechanical replacement of `derivative` attributes with `educe` ones.

### **Attribute Syntax Changes**

```rust
// Bounds: = "..." → (...)
#[derivative(Hash(bound = "E: EthSpec"))]
#[educe(Hash(bound(E: EthSpec)))]

// Ignore: = "ignore" → (ignore)
#[derivative(PartialEq = "ignore")]
#[educe(PartialEq(ignore))]

// Default values: value = "..." → expression = ...
#[derivative(Default(value = "ForkName::Base"))]
#[educe(Default(expression = ForkName::Base))]

// Methods: format_with/compare_with = "..." → method(...)
#[derivative(Debug(format_with = "fmt_peer_set_as_len"))]
#[educe(Debug(method(fmt_peer_set_as_len)))]

// Empty bounds: removed entirely, educe can infer appropriate bounds
#[derivative(Default(bound = ""))]
#[educe(Default)]

// Transparent debug: manual implementation (educe doesn't support it)
#[derivative(Debug = "transparent")]
// Replaced with manual Debug impl that delegates to inner field
```

**Note**: Some bounds use strings (`bound("E: EthSpec")`) for superstruct compatibility (`expected ','` errors).


Co-Authored-By: Javier Chávarri <javier.chavarri@gmail.com>

Co-Authored-By: Mac L <mjladson@pm.me>
2025-11-06 14:13:57 +00:00
Michael Sproul
0507eca7b4 Merge remote-tracking branch 'origin/stable' into unstable-merge-v8 2025-11-04 16:08:34 +11:00
Jimmy Chen
bc86dc09e5 Reduce number of blobs used in tests to speed up CI (#8194)
`beacon-chain-tests` is now regularly taking 1h+ on CI since Fulu fork was added.

This PR attemtpts to reduce the test time by bringing down the number of blobs generated in tests - instead of generating 0..max_blobs, the generator now generates 0..1 blobs by default, and this can be modified by setting `harness.execution_block_generator.set_min_blob_count(n)`.

Note: The blobs are pre-generated and doesn't require too much CPU to generate however processing a larger number of them on the beacon chain does take a lot of time.

This PR also include a few other small improvements
- Our slowest test (`chain_segment_varying_chunk_size`) runs 3x faster in Fulu just by reusing chain segments
- Avoid re-running fork specific tests on all forks
- Fix a bunch of tests that depends on the harness's existing random blob generation, which is fragile


beacon chain test time on test machine is **~2x** faster:

### `unstable`

```
Summary [ 751.586s] 291 tests run: 291 passed (13 slow), 0 skipped
```

### this branch

```
Summary [ 373.792s] 291 tests run: 291 passed (2 slow), 0 skipped
```

The next set of tests to optimise is the ones that use [`get_chain_segment`](77a9af96de/beacon_node/beacon_chain/tests/block_verification.rs (L45)), as it by default build 320 blocks with supernode - an easy optimisation would be to build these blocks with cgc = 8 for tests that only require fullnodes.


  


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>

Co-Authored-By: Jimmy Chen <jimmy@sigmaprime.io>
2025-11-04 02:40:44 +00:00
Eitan Seri-Levi
5d0f8a083a Ensure custody backfill sync couples all responses before importing (#8339)
Custody backfill sync has a bug when we request columns from more than one peer per batch. The fix here ensures we wait for all requests to be completed before performing verification and importing the responses.

I've also added an endpoint `lighthouse/custody/backfill` that resets a nodes earliest available data column to the current epoch so that custody backfill can be triggered. This endpoint is needed to rescue any nodes that may have missing columns due to the custody backfill sync bug without requiring a full re-sync.


  


Co-Authored-By: Eitan Seri- Levi <eserilev@gmail.com>

Co-Authored-By: Eitan Seri-Levi <eserilev@ucsc.edu>

Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>

Co-Authored-By: Michael Sproul <michaelsproul@users.noreply.github.com>
2025-11-03 08:06:06 +00:00
Eitan Seri-Levi
b57d046c4a Fix CGC backfill race condition (#8267)
During custody backfill sync there could be an edge case where we update CGC at the same time where we are importing a batch of columns which may cause us to incorrectly overwrite values when calling `backfill_validator_custody_requirements`. To prevent this race condition, the expected cgc is now passed into this function and is used to check if the expected cgc == the current validator cgc. If the values arent equal, this probably indicates that a very recent CGC occurred so we do not prune/update values in the `epoch_validator_custody_requirements` map.


  


Co-Authored-By: Eitan Seri-Levi <eserilev@ucsc.edu>
2025-11-03 00:51:42 +00:00
Lion - dapplion
da5b231720 Prevent dropping large binary data to logs (#8290)
Testing non finalized checkpoint sync noticed this log that dumps blob data in Debug format to the logs.


  Log only block root and commitment of each blob


Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
2025-10-26 23:47:25 +00:00
Eitan Seri-Levi
33e21634cb Custody backfill sync (#7907)
#7603


  #### Custody backfill sync service
Similar in many ways to the current backfill service. There may be ways to unify the two services. The difficulty there is that the current backfill service tightly couples blocks and their associated blobs/data columns. Any attempts to unify the two services should be left to a separate PR in my opinion.

#### `SyncNeworkContext`
`SyncNetworkContext` manages custody sync data columns by range requests separetly from other sync RPC requests. I think this is a nice separation considering that custody backfill is its own service.

#### Data column import logic
The import logic verifies KZG committments and that the data columns block root matches the block root in the nodes store before importing columns

#### New channel to send messages to `SyncManager`
Now external services can communicate with the `SyncManager`. In this PR this channel is used to trigger a custody sync. Alternatively we may be able to use the existing `mpsc` channel that the `SyncNetworkContext` uses to communicate with the `SyncManager`. I will spend some time reviewing this.


Co-Authored-By: Eitan Seri-Levi <eserilev@ucsc.edu>

Co-Authored-By: Eitan Seri- Levi <eserilev@gmail.com>

Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
2025-10-22 03:51:34 +00:00
Pawan Dhananjay
092aaae961 Sync cleanups (#8230)
N/A


  1. In the batch retry logic, we were failing to set the batch state to `AwaitingDownload` before attempting a retry. This PR sets it to `AwaitingDownload` before the retry and sets it back to `Downloading` if the retry suceeded in sending out a request
2. Remove all peer scoring logic from retrying and rely on just de priorotizing the failed peer. I finally concede the point to @dapplion 😄
3. Changes `block_components_by_range_request` to accept `block_peers` and `column_peers`. This is to ensure that we use the full synced peerset for requesting columns in order to avoid splitting the column peers among multiple head chains. During forward sync, we want the block peers to be the peers from the syncing chain and column peers to be all synced peers from the peerdb.
Also, fixes a typo and calls `attempt_send_awaiting_download_batches` from more places


Co-Authored-By: Pawan Dhananjay <pawandhananjay@gmail.com>
2025-10-20 11:50:00 +00:00
Lion - dapplion
ffa7b2b2b9 Only mark block lookups as pending if block is importing from gossip (#8112)
- PR https://github.com/sigp/lighthouse/pull/8045 introduced a regression of how lookup sync interacts with the da_checker.

Now in unstable block import from the HTTP API also insert the block in the da_checker while the block is being execution verified. If lookup sync finds the block in the da_checker in `NotValidated` state it expects a `GossipBlockProcessResult` message sometime later. That message is only sent after block import in gossip.

I confirmed in our node's logs for 4/4 cases of stuck lookups are caused by this sequence of events:
- Receive block through API, insert into da_checker in fn process_block in put_pre_execution_block
- Create lookup and leave in AwaitingDownload(block in processing cache) state
- Block from HTTP API finishes importing
- Lookup is left stuck

Closes https://github.com/sigp/lighthouse/issues/8104


  - https://github.com/sigp/lighthouse/pull/8110 was my initial solution attempt but we can't send the `GossipBlockProcessResult` event from the `http_api` crate without adding new channels, which seems messy.

For a given node it's rare that a lookup is created at the same time that a block is being published. This PR solves https://github.com/sigp/lighthouse/issues/8104 by allowing lookup sync to import the block twice in that case.


Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
2025-09-25 03:52:27 +00:00
Jimmy Chen
78d330e4b7 Consolidate reqresp_pre_import_cache into data_availability_checker (#8045)
This PR consolidates the `reqresp_pre_import_cache` into the `data_availability_checker` for the following reasons:
- the `reqresp_pre_import_cache` suffers from the same TOCTOU bug we had with `data_availability_checker` earlier, and leads to unbounded memory leak, which we have observed over the last 6 months on some nodes.
- the `reqresp_pre_import_cache` is no longer necessary, because we now hold blocks in the `data_availability_checker` for longer since (#7961), and recent blocks can be served from the DA checker.

This PR also maintains the following functionalities
- Serving pre-executed blocks over RPC, and they're now served from the `data_availability_checker` instead.
- Using the cache for de-duplicating lookup requests.


  


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>

Co-Authored-By: Jimmy Chen <jimmy@sigmaprime.io>
2025-09-19 07:01:13 +00:00
Lion - dapplion
b7d78a91e0 Don't penalize peers for extending ignored chains (#8042)
Lookup sync has a cache of block roots "failed_chains". If a peer triggers a lookup for a block or descendant of a root in failed_chains the lookup is dropped and the peer penalized. However blocks are inserted into failed_chains for a single reason:
- If a chain is longer than 32 blocks the lookup is dropped to prevent OOM risks. However the peer is not at fault, since discovering an unknown chain longer than 32 blocks is not malicious. We just drop the lookup to sync the blocks from range forward sync.

This discrepancy is probably an oversight when changing old code. Before we used to add blocks that failed too many times to process to that cache. However, we don't do that anymore. Adding a block that fails too many times to process is an optimization to save resources in rare cases where peers keep sending us invalid blocks. In case that happens, today we keep trying to process the block, downscoring the peers and eventually disconnecting them. _IF_ we found that optimization to be necessary we should merge this PR (_Stricter match of BlockError in lookup sync_) first. IMO we are fine without the failed_chains cache and the ignored_chains cache will be obsolete with [tree sync](https://github.com/sigp/lighthouse/issues/7678) as the OOM risk of long lookup chains does not exist anymore.

Closes https://github.com/sigp/lighthouse/issues/7577


  Rename `failed_chains` for `ignored_chains` and don't penalize peers that trigger lookups for those blocks


Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
2025-09-17 01:02:29 +00:00
Jimmy Chen
fb77ce9e19 Add missing event in PendingComponent span and clean up sync logs (#8033)
I was looking into some long `PendingComponents` span and noticed the block event wasn't added to the span, so it wasn't possible to see when the block was added from the trace view, this PR fixes this.

<img width="637" height="430" alt="image" src="https://github.com/user-attachments/assets/65040b1c-11e7-43ac-951b-bdfb34b665fb" />

Additionally I've noticed a lot of noises and confusion in sync logs due to the initial`peer_id` being included as part of the syncing chain span, causing all logs under the span to have that `peer_id`, which may not be accurate for some sync logs, I've removed `peer_id` from the `SyncingChain` span, and also cleaned up a bunch of spans to use `%` (display) for slots and epochs to make logs easier to read.


  


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
2025-09-12 05:11:30 +00:00
Jimmy Chen
ee734d1456 Fix stuck data column lookups by improving peer selection and retry logic (#8005)
Fixes the issue described in #7980 where Lighthouse repeatedly sends `DataColumnsByRoot` requests to the same peers that return empty responses, causing sync to get stuck.

The root cause was we don't count empty responses as failures, leading to excessive retries to unresponsive peers.


  - Track per peer attempts to limit retry attempts per peer (`MAX_CUSTODY_PEER_ATTEMPTS = 3`)
- Replaced random peer selection with hashing within each lookup to prevent splitting lookup into too many small requests and improve request batching efficiency.
- Added `single_block_lookup` root span to track all lookups created and added more debug logs:

<img width="1264" height="501" alt="image" src="https://github.com/user-attachments/assets/983629ba-b6d0-41cf-8e93-88a5b96c2f31" />


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>

Co-Authored-By: Jimmy Chen <jimmy@sigmaprime.io>
2025-09-09 06:18:05 +00:00