Commit Graph

1112 Commits

Author SHA1 Message Date
Mac L
3903e1c67f More consensus/types re-export cleanup (#8665)
Remove more of the temporary re-exports from `consensus/types`


Co-Authored-By: Mac L <mjladson@pm.me>
2026-01-16 04:43:05 +00:00
Lion - dapplion
6166ad2eb2 Replace tracing::debug! with debug! same for other levels (#8300)
Just visual clean-up, making logging statements look uniform. There's no reason to use `tracing::debug` instead of `debug`. If we ever need to migrate our logging lib in the future it would make things easier too.


  


Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>

Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>

Co-Authored-By: Michael Sproul <michaelsproul@users.noreply.github.com>
2026-01-08 07:04:44 +00:00
Abhivansh
9b3d7e3a54 refactor: remove service_name (#8606)
Which issue # does this PR address?
#8586


  Please list or describe the changes introduced by this PR.
Remove `service_name` from `TaskExecutor`


Co-Authored-By: Abhivansh <31abhivanshj@gmail.com>
2026-01-02 00:07:40 +00:00
ethDreamer
a39e991557 Gloas(EIP-7732): Containers / Constants (#7923)
* #7850

This is the first round of the conga line! 🎉

Just spec constants and container changes so far.


  


Co-Authored-By: shane-moore <skm1790@gmail.com>

Co-Authored-By: Mark Mackey <mark@sigmaprime.io>

Co-Authored-By: Shane K Moore <41407272+shane-moore@users.noreply.github.com>

Co-Authored-By: Eitan Seri- Levi <eserilev@gmail.com>

Co-Authored-By: ethDreamer <37123614+ethDreamer@users.noreply.github.com>

Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>

Co-Authored-By: Jimmy Chen <jimmy@sigmaprime.io>

Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
2025-12-16 06:45:45 +00:00
0xMushow
afa6457acf fix visual bug on visualize_batch_state leading to a non-wanted comma (#8499)
Which issue # does this PR address?
None


  The `visualize_batch_state`  functions uses the following loop `for mut batch_index in 0..BATCH_BUFFER_SIZE`, making it from `0` to `BATCH_BUFFER_SIZE - 1` (behind the scenes).

Hence we would never hit the following condition:

```rust
if batch_index != BATCH_BUFFER_SIZE {
visualization_string.push(',');
}
```

Replacing `!=` with `<` & `BATCH_BUFFER_SIZE -1` allows for the following change:

`[A,B,C,D,E,]` to become: `[A,B,C,D,E]`


Co-Authored-By: Antoine James <antoine@ethereum.org>
2025-12-15 05:14:15 +00:00
Mac L
f3fd1f210b Remove consensus/types re-exports (#8540)
There are certain crates which we re-export within `types` which creates a fragmented DevEx, where there are various ways to import the same crates.

```rust
// consensus/types/src/lib.rs
pub use bls::{
AggregatePublicKey, AggregateSignature, Error as BlsError, Keypair, PUBLIC_KEY_BYTES_LEN,
PublicKey, PublicKeyBytes, SIGNATURE_BYTES_LEN, SecretKey, Signature, SignatureBytes,
get_withdrawal_credentials,
};
pub use context_deserialize::{ContextDeserialize, context_deserialize};
pub use fixed_bytes::FixedBytesExtended;
pub use milhouse::{self, List, Vector};
pub use ssz_types::{BitList, BitVector, FixedVector, VariableList, typenum, typenum::Unsigned};
pub use superstruct::superstruct;
```

This PR removes these re-exports and makes it explicit that these types are imported from a non-`consensus/types` crate.


Co-Authored-By: Mac L <mjladson@pm.me>
2025-12-09 07:13:41 +00:00
Mac L
4e958a92d3 Refactor consensus/types (#7827)
Organize and categorize `consensus/types` into modules based on their relation to key consensus structures/concepts.
This is a precursor to a sensible public interface.

While this refactor is very opinionated, I am open to suggestions on module names, or type groupings if my current ones are inappropriate.


Co-Authored-By: Mac L <mjladson@pm.me>
2025-12-04 09:28:52 +00:00
Michael Sproul
261322c3e3 Merge remote-tracking branch 'origin/stable' into unstable 2025-11-20 13:04:32 +11:00
Lion - dapplion
74b8c02630 Reimport the checkpoint sync block (#8417)
We want to not require checkpoint sync starts to include the required custody data columns, and instead fetch them from p2p.


Closes https://github.com/sigp/lighthouse/issues/6837


  The checkpoint sync slot can:
1. Be the first slot in the epoch, such that the epoch of the block == the start checkpoint epoch
2. Be in an epoch prior to the start checkpoint epoch

In both cases backfill sync already fetches that epoch worth of blocks with current code. This PR modifies the backfill import filter function to allow to re-importing the oldest block slot in the DB.

I feel this solution is sufficient unless I'm missing something. ~~I have not tested this yet!~~ Michael has tested this and it works.


Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>

Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
2025-11-19 11:00:38 +00:00
Pawan Dhananjay
02d0c6a8ce Compute missing_columns correctly (#8425)
N/A


  The difference is computed by taking the difference of expected with received. We were doing the inverse.

Thanks to Yassine for finding the issue.


Co-Authored-By: Pawan Dhananjay <pawandhananjay@gmail.com>
2025-11-18 07:43:28 +00:00
Jimmy Chen
af1d9b9991 Fix custody context initialization race condition that caused panic (#8391)
Take 2 of #8390.

Fixes the race condition properly instead of propagating the error. I think this is a better alternative, and doesn't seem to look that bad.


  * Lift node id loading or generation from `NetworkService ` startup to the `ClientBuilder`, so that it can be used to compute custody columns for the beacon chain without waiting for Network bootstrap.

I've considered and implemented a few alternatives:
1. passing `node_id` to beacon chain builder and compute columns when creating `CustodyContext`. This approach isn't good for separation of concerns and isn't great for testability
2. passing `ordered_custody_groups` to beacon chain. `CustodyContext` only uses this to compute ordered custody columns, so we might as well lift this logic out, so we don't have to do error handling in `CustodyContext` construction. Less tests to update;.


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
2025-11-17 05:23:12 +00:00
Jimmy Chen
1bd4ac2113 Fix flaky reconstruction test (#8321)
FIx flaky tests that depends on timing. Previously the test processes all 128 columns and expect reconstruction to happen after all columns are processed. There is a race here, and reconstruction could be triggered before all columns are processed.

I've updated the tests to process 64 columns, just enough for reconstruction and wait for 50ms for reconstruction to be triggered.

This PR requires the change made in https://github.com/sigp/lighthouse/pull/8194 for the test to pass consistently (blob count set to 1 for all blocks instead of random blob count between 0..max)


  


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>

Co-Authored-By: Jimmy Chen <jimmy@sigmaprime.io>
2025-11-10 04:06:42 +00:00
Javier Chávarri
2c1f1c1605 Migrate derivative to educe (#8125)
Fixes #7001.


  Mostly mechanical replacement of `derivative` attributes with `educe` ones.

### **Attribute Syntax Changes**

```rust
// Bounds: = "..." → (...)
#[derivative(Hash(bound = "E: EthSpec"))]
#[educe(Hash(bound(E: EthSpec)))]

// Ignore: = "ignore" → (ignore)
#[derivative(PartialEq = "ignore")]
#[educe(PartialEq(ignore))]

// Default values: value = "..." → expression = ...
#[derivative(Default(value = "ForkName::Base"))]
#[educe(Default(expression = ForkName::Base))]

// Methods: format_with/compare_with = "..." → method(...)
#[derivative(Debug(format_with = "fmt_peer_set_as_len"))]
#[educe(Debug(method(fmt_peer_set_as_len)))]

// Empty bounds: removed entirely, educe can infer appropriate bounds
#[derivative(Default(bound = ""))]
#[educe(Default)]

// Transparent debug: manual implementation (educe doesn't support it)
#[derivative(Debug = "transparent")]
// Replaced with manual Debug impl that delegates to inner field
```

**Note**: Some bounds use strings (`bound("E: EthSpec")`) for superstruct compatibility (`expected ','` errors).


Co-Authored-By: Javier Chávarri <javier.chavarri@gmail.com>

Co-Authored-By: Mac L <mjladson@pm.me>
2025-11-06 14:13:57 +00:00
Jimmy Chen
7b1cbca264 Downgrade and remove unnecessary logs (#8367)
### Downgrade a non error to `Debug`

I noticed this error on one of our hoodi nodes:

```
Nov 04 05:13:38.892 ERROR Error during data column reconstruction       block_root: 0x4271b9efae7deccec3989bd2418e998b83ce8144210c2b17200abb62b7951190, error: DuplicateFullyImported(0x4271b9efae7deccec3989bd2418e998b83ce8144210c2b17200abb62b7951190)
```

This shouldn't be logged as an error and it's due to a normal race condition, and it doesn't impact the node negatively.

### Remove spammy logs

This logs is filling up the log files quite quickly and it is also something we'd expect during normal operation - getting columns via EL before gossip. We haven't found this debug log to be useful, so I propose we remove it to avoid spamming debug logs.

```
Received already available column sidecar. Ignoring the column sidecar
```

In the process of removing this, I noticed we aren't propagating the validation result, which I think we should so I've added this. The impact should be quite minimal - the message will stay in the gossip memcache for a bit longer but should be evicted in the next heartbeat.


  


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
2025-11-05 07:46:30 +00:00
Michael Sproul
0507eca7b4 Merge remote-tracking branch 'origin/stable' into unstable-merge-v8 2025-11-04 16:08:34 +11:00
Jimmy Chen
bc86dc09e5 Reduce number of blobs used in tests to speed up CI (#8194)
`beacon-chain-tests` is now regularly taking 1h+ on CI since Fulu fork was added.

This PR attemtpts to reduce the test time by bringing down the number of blobs generated in tests - instead of generating 0..max_blobs, the generator now generates 0..1 blobs by default, and this can be modified by setting `harness.execution_block_generator.set_min_blob_count(n)`.

Note: The blobs are pre-generated and doesn't require too much CPU to generate however processing a larger number of them on the beacon chain does take a lot of time.

This PR also include a few other small improvements
- Our slowest test (`chain_segment_varying_chunk_size`) runs 3x faster in Fulu just by reusing chain segments
- Avoid re-running fork specific tests on all forks
- Fix a bunch of tests that depends on the harness's existing random blob generation, which is fragile


beacon chain test time on test machine is **~2x** faster:

### `unstable`

```
Summary [ 751.586s] 291 tests run: 291 passed (13 slow), 0 skipped
```

### this branch

```
Summary [ 373.792s] 291 tests run: 291 passed (2 slow), 0 skipped
```

The next set of tests to optimise is the ones that use [`get_chain_segment`](77a9af96de/beacon_node/beacon_chain/tests/block_verification.rs (L45)), as it by default build 320 blocks with supernode - an easy optimisation would be to build these blocks with cgc = 8 for tests that only require fullnodes.


  


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>

Co-Authored-By: Jimmy Chen <jimmy@sigmaprime.io>
2025-11-04 02:40:44 +00:00
Eitan Seri-Levi
5d0f8a083a Ensure custody backfill sync couples all responses before importing (#8339)
Custody backfill sync has a bug when we request columns from more than one peer per batch. The fix here ensures we wait for all requests to be completed before performing verification and importing the responses.

I've also added an endpoint `lighthouse/custody/backfill` that resets a nodes earliest available data column to the current epoch so that custody backfill can be triggered. This endpoint is needed to rescue any nodes that may have missing columns due to the custody backfill sync bug without requiring a full re-sync.


  


Co-Authored-By: Eitan Seri- Levi <eserilev@gmail.com>

Co-Authored-By: Eitan Seri-Levi <eserilev@ucsc.edu>

Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>

Co-Authored-By: Michael Sproul <michaelsproul@users.noreply.github.com>
2025-11-03 08:06:06 +00:00
Eitan Seri-Levi
b57d046c4a Fix CGC backfill race condition (#8267)
During custody backfill sync there could be an edge case where we update CGC at the same time where we are importing a batch of columns which may cause us to incorrectly overwrite values when calling `backfill_validator_custody_requirements`. To prevent this race condition, the expected cgc is now passed into this function and is used to check if the expected cgc == the current validator cgc. If the values arent equal, this probably indicates that a very recent CGC occurred so we do not prune/update values in the `epoch_validator_custody_requirements` map.


  


Co-Authored-By: Eitan Seri-Levi <eserilev@ucsc.edu>
2025-11-03 00:51:42 +00:00
Mac L
f5809aff87 Bump ssz_types to v0.12.2 (#8032)
https://github.com/sigp/lighthouse/issues/8012


  Replace all instances of `VariableList::from` and `FixedVector::from` to their `try_from` variants.

While I tried to use proper error handling in most cases, there were certain situations where adding an `expect` for situations where `try_from` can trivially never fail avoided adding a lot of extra complexity.


Co-Authored-By: Mac L <mjladson@pm.me>

Co-Authored-By: Michael Sproul <michaelsproul@users.noreply.github.com>

Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
2025-10-28 04:01:09 +00:00
Lion - dapplion
5db1dff8a6 Downgrade gossip logs set to INFO level (#8288)
Testing non-finality checkpoint synced these logs showed up in my INFO grep and were noisy. INFO should only include the notifier and exceptional events. I don't see why the user would care about this info.


  Downgrade to debug


Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
2025-10-27 23:33:58 +00:00
chonghe
ba706ce3bf Revise logging in BlobsByRoot requests (#8296)
#7756 introduces a logging issue, where the relevant log:
da5b231720/beacon_node/network/src/network_beacon_processor/rpc_methods.rs (L380-L385)

obtains the `block_root` from `slots_by_block_root.keys()`. If the `block_root` is empty (block not found in the data availability checker), then the log will not show any block root:

`DEBUG BlobsByRoot outgoing response processed       peer_id: 16Uiu2HAmCBxs1ZFfsbAfhSA98rUUL8Q1egLPb6WpGdKZxX6HqQYX, block_root: [], returned: 4`

This PR revises to return the `block_root` in the request as a vector of block root


  


Co-Authored-By: Tan Chee Keong <tanck@sigmaprime.io>
2025-10-27 08:48:10 +00:00
Lion - dapplion
da5b231720 Prevent dropping large binary data to logs (#8290)
Testing non finalized checkpoint sync noticed this log that dumps blob data in Debug format to the logs.


  Log only block root and commitment of each blob


Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
2025-10-26 23:47:25 +00:00
Jimmy Chen
43c5e924d7 Add --semi-supernode support (#8254)
Addresses #8218

A simplified version of #8241 for the initial release.

I've tried to minimise the logic change in this PR, although introducing the `NodeCustodyType` enum still result in quite a bit a of diff, but the actual logic change in `CustodyContext` is quite small.

The main changes are in the `CustdoyContext` struct
* ~~combining `validator_custody_count` and `current_is_supernode` fields into a single `custody_group_count_at_head` field. We persist the cgc of the initial cli values into the `custody_group_count_at_head` field and only allow for increase (same behaviour as before).~~
* I noticed the above approach caused a backward compatibility issue, I've [made a fix](15569bc085) and changed the approach slightly (which was actually what I had originally in mind):
* when initialising, only override the  `validator_custody_count` value if either flag `--supernode` or `--semi-supernode` is used; otherwise leave it as the existing default `0`. Most other logic remains unchanged.

All existing validator custody unit tests are still all passing, and I've added additional tests to cover semi-supernode, and restoring `CustodyContext` from disk.

Note: I've added a `WARN` if the user attempts to switch to a `--semi-supernode` or `--supernode` - this currently has no effect, but once @eserilev column backfill is merged, we should be able to support this quite easily.

Things to test
- [x] cgc in metadata / enr
- [x] cgc in metrics
- [x] subscribed subnets
- [x] getBlobs endpoint


  


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
2025-10-22 05:23:17 +00:00
Eitan Seri-Levi
33e21634cb Custody backfill sync (#7907)
#7603


  #### Custody backfill sync service
Similar in many ways to the current backfill service. There may be ways to unify the two services. The difficulty there is that the current backfill service tightly couples blocks and their associated blobs/data columns. Any attempts to unify the two services should be left to a separate PR in my opinion.

#### `SyncNeworkContext`
`SyncNetworkContext` manages custody sync data columns by range requests separetly from other sync RPC requests. I think this is a nice separation considering that custody backfill is its own service.

#### Data column import logic
The import logic verifies KZG committments and that the data columns block root matches the block root in the nodes store before importing columns

#### New channel to send messages to `SyncManager`
Now external services can communicate with the `SyncManager`. In this PR this channel is used to trigger a custody sync. Alternatively we may be able to use the existing `mpsc` channel that the `SyncNetworkContext` uses to communicate with the `SyncManager`. I will spend some time reviewing this.


Co-Authored-By: Eitan Seri-Levi <eserilev@ucsc.edu>

Co-Authored-By: Eitan Seri- Levi <eserilev@gmail.com>

Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
2025-10-22 03:51:34 +00:00
Eitan Seri-Levi
46dde9afee Fix data column rpc request (#8247)
Fixes an issue mentioned in this comment regarding data column rpc requests:
https://github.com/sigp/lighthouse/issues/6572#issuecomment-3400076236


  


Co-Authored-By: Eitan Seri-Levi <eserilev@ucsc.edu>

Co-Authored-By: Michael Sproul <micsproul@gmail.com>
2025-10-21 23:54:35 +00:00
Pawan Dhananjay
092aaae961 Sync cleanups (#8230)
N/A


  1. In the batch retry logic, we were failing to set the batch state to `AwaitingDownload` before attempting a retry. This PR sets it to `AwaitingDownload` before the retry and sets it back to `Downloading` if the retry suceeded in sending out a request
2. Remove all peer scoring logic from retrying and rely on just de priorotizing the failed peer. I finally concede the point to @dapplion 😄
3. Changes `block_components_by_range_request` to accept `block_peers` and `column_peers`. This is to ensure that we use the full synced peerset for requesting columns in order to avoid splitting the column peers among multiple head chains. During forward sync, we want the block peers to be the peers from the syncing chain and column peers to be all synced peers from the peerdb.
Also, fixes a typo and calls `attempt_send_awaiting_download_batches` from more places


Co-Authored-By: Pawan Dhananjay <pawandhananjay@gmail.com>
2025-10-20 11:50:00 +00:00
chonghe
2b30c96f16 Avoid attempting to serve blobs after Fulu fork (#7756)
* #7122


  


Co-Authored-By: Tan Chee Keong <tanck@sigmaprime.io>

Co-Authored-By: chonghe <44791194+chong-he@users.noreply.github.com>
2025-10-20 06:29:21 +00:00
Jimmy Chen
5886a48d96 Add max_blobs_per_block check to data column gossip validation (#8198)
Addresses this spec change
https://github.com/ethereum/consensus-specs/pull/4650

Add `max_blobs_per_block` to gossip data column check so we reject large columns before processing. (we currently do this check during processing)


  


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
2025-10-15 01:52:35 +00:00
Eitan Seri-Levi
4eb89604f8 Fulu ASCII art (#8151)
Co-Authored-By: Eitan Seri- Levi <eserilev@gmail.com>
2025-10-07 14:32:35 +00:00
Jimmy Chen
ff8b514b3f Remove unnecessary warning logs and update logging levels (#8145)
@michaelsproul noticed this warning on a devnet-3 node

```
Oct 01 16:37:29.896 WARN  Error when importing rpc custody columns      error: ParentUnknown { parent_root: 0xe4cc85a2137b76eb083d7076255094a90f10caaec0afc8fd36807db742f6ff13 }, block_hash: 0x43ce63b2344990f5f4d8911b8f14e3d3b6b006edc35bbc833360e667df0edef7
```

We're also seeing similar `WARN` logs for blobs on our live nodes.

It's normal to get parent unknown in lookups and it's handled here
a134d43446/beacon_node/network/src/sync/block_lookups/mod.rs (L611-L619)

These shouldn't be a `WARN`, and we also log the same error in block lookups at `DEBUG` level here:
a134d43446/beacon_node/network/src/sync/block_lookups/mod.rs (L643-L648)

So i've removed these extra WARN logs.

I've also lower the level of an `ERROR` log when unable to serve data column root requests - it's unexpected, but is unlikely to impact the nodes performance, so I think we can downgrade this.


  


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
2025-10-06 19:26:37 +00:00
Lion - dapplion
ffa7b2b2b9 Only mark block lookups as pending if block is importing from gossip (#8112)
- PR https://github.com/sigp/lighthouse/pull/8045 introduced a regression of how lookup sync interacts with the da_checker.

Now in unstable block import from the HTTP API also insert the block in the da_checker while the block is being execution verified. If lookup sync finds the block in the da_checker in `NotValidated` state it expects a `GossipBlockProcessResult` message sometime later. That message is only sent after block import in gossip.

I confirmed in our node's logs for 4/4 cases of stuck lookups are caused by this sequence of events:
- Receive block through API, insert into da_checker in fn process_block in put_pre_execution_block
- Create lookup and leave in AwaitingDownload(block in processing cache) state
- Block from HTTP API finishes importing
- Lookup is left stuck

Closes https://github.com/sigp/lighthouse/issues/8104


  - https://github.com/sigp/lighthouse/pull/8110 was my initial solution attempt but we can't send the `GossipBlockProcessResult` event from the `http_api` crate without adding new channels, which seems messy.

For a given node it's rare that a lookup is created at the same time that a block is being published. This PR solves https://github.com/sigp/lighthouse/issues/8104 by allowing lookup sync to import the block twice in that case.


Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
2025-09-25 03:52:27 +00:00
Jimmy Chen
79b33214ea Only send data coumn subnet discovery requests after peerdas is scheduled (#8109)
#8105 (to be confirmed)

I noticed a large number of failed discovery requests after deploying latest `unstable` to some of our testnet and mainnet nodes. This is because of a recent PeerDAS change to attempt to maintain sufficient peers across data column subnets - this shouldn't be enabled on network without peerdas scheduled, otherwise it will keep retrying discovery on these subnets and never succeed.

Also removed some unused files.


  


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>

Co-Authored-By: Jimmy Chen <jimmy@sigmaprime.io>
2025-09-25 02:52:07 +00:00
Eitan Seri-Levi
af274029e8 Run reconstruction inside a scoped rayon pool (#8075)
Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>

Co-Authored-By: Eitan Seri- Levi <eserilev@gmail.com>

Co-Authored-By: Eitan Seri-Levi <eserilev@ucsc.edu>
2025-09-24 06:37:34 +00:00
Jimmy Chen
78d330e4b7 Consolidate reqresp_pre_import_cache into data_availability_checker (#8045)
This PR consolidates the `reqresp_pre_import_cache` into the `data_availability_checker` for the following reasons:
- the `reqresp_pre_import_cache` suffers from the same TOCTOU bug we had with `data_availability_checker` earlier, and leads to unbounded memory leak, which we have observed over the last 6 months on some nodes.
- the `reqresp_pre_import_cache` is no longer necessary, because we now hold blocks in the `data_availability_checker` for longer since (#7961), and recent blocks can be served from the DA checker.

This PR also maintains the following functionalities
- Serving pre-executed blocks over RPC, and they're now served from the `data_availability_checker` instead.
- Using the cache for de-duplicating lookup requests.


  


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>

Co-Authored-By: Jimmy Chen <jimmy@sigmaprime.io>
2025-09-19 07:01:13 +00:00
Jimmy Chen
4111bcb39b Use scoped rayon pool for backfill chain segment processing (#7924)
Part of #7866

- Continuation of #7921

In the above PR, we enabled rayon for batch KZG verification in chain segment processing. However, using the global rayon thread pool for backfill is likely to create resource contention with higher-priority beacon processor work.


  This PR introduces a dedicated low-priority rayon thread pool `LOW_PRIORITY_RAYON_POOL` and uses it for processing backfill chain segments.

This prevents backfill KZG verification from using the global rayon thread pool and competing with high-priority beacon processor tasks for CPU resources.

However, this PR by itself doesn't prevent CPU oversubscription because other tasks could still fill up the global rayon thread pool, and having an extra thread pool could make things worse. To address this we need the beacon
processor to coordinate total CPU allocation across all tasks, which is covered in:
- #7789


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>

Co-Authored-By: Eitan Seri- Levi <eserilev@gmail.com>

Co-Authored-By: Eitan Seri-Levi <eserilev@ucsc.edu>
2025-09-18 07:10:23 +00:00
Lion - dapplion
b7d78a91e0 Don't penalize peers for extending ignored chains (#8042)
Lookup sync has a cache of block roots "failed_chains". If a peer triggers a lookup for a block or descendant of a root in failed_chains the lookup is dropped and the peer penalized. However blocks are inserted into failed_chains for a single reason:
- If a chain is longer than 32 blocks the lookup is dropped to prevent OOM risks. However the peer is not at fault, since discovering an unknown chain longer than 32 blocks is not malicious. We just drop the lookup to sync the blocks from range forward sync.

This discrepancy is probably an oversight when changing old code. Before we used to add blocks that failed too many times to process to that cache. However, we don't do that anymore. Adding a block that fails too many times to process is an optimization to save resources in rare cases where peers keep sending us invalid blocks. In case that happens, today we keep trying to process the block, downscoring the peers and eventually disconnecting them. _IF_ we found that optimization to be necessary we should merge this PR (_Stricter match of BlockError in lookup sync_) first. IMO we are fine without the failed_chains cache and the ignored_chains cache will be obsolete with [tree sync](https://github.com/sigp/lighthouse/issues/7678) as the OOM risk of long lookup chains does not exist anymore.

Closes https://github.com/sigp/lighthouse/issues/7577


  Rename `failed_chains` for `ignored_chains` and don't penalize peers that trigger lookups for those blocks


Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
2025-09-17 01:02:29 +00:00
Jimmy Chen
3de646c8b3 Enable reconstruction for nodes custodying more than 50% of columns and instrument tracing (#8052)
Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>

Co-Authored-By: Jimmy Chen <jimmy@sigmaprime.io>
2025-09-16 08:17:43 +00:00
Eitan Seri-Levi
4409500f63 Remove column reconstruction when processing rpc requests (#8051)
Co-Authored-By: Eitan Seri- Levi <eserilev@gmail.com>
2025-09-16 05:18:25 +00:00
Eitan Seri-Levi
aef8291f94 Add max delay to reconstruction (#7976)
#7697


  If we're three seconds into the current slot just trigger reconstruction. I don't know what the correct reconstruction deadline number is, but it should probably be at least half a second before the attestation deadline


Co-Authored-By: Eitan Seri- Levi <eserilev@gmail.com>

Co-Authored-By: Eitan Seri-Levi <eserilev@ucsc.edu>
2025-09-12 06:05:42 +00:00
Jimmy Chen
fb77ce9e19 Add missing event in PendingComponent span and clean up sync logs (#8033)
I was looking into some long `PendingComponents` span and noticed the block event wasn't added to the span, so it wasn't possible to see when the block was added from the trace view, this PR fixes this.

<img width="637" height="430" alt="image" src="https://github.com/user-attachments/assets/65040b1c-11e7-43ac-951b-bdfb34b665fb" />

Additionally I've noticed a lot of noises and confusion in sync logs due to the initial`peer_id` being included as part of the syncing chain span, causing all logs under the span to have that `peer_id`, which may not be accurate for some sync logs, I've removed `peer_id` from the `SyncingChain` span, and also cleaned up a bunch of spans to use `%` (display) for slots and epochs to make logs easier to read.


  


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
2025-09-12 05:11:30 +00:00
hopinheimer
38205192ca Fix http api tests ci (#7943)
Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>

Co-Authored-By: Michael Sproul <micsproul@gmail.com>

Co-Authored-By: Michael Sproul <michael@sigmaprime.io>

Co-Authored-By: hopinheimer <knmanas6@gmail.com>
2025-09-10 06:46:48 +00:00
Jimmy Chen
ee734d1456 Fix stuck data column lookups by improving peer selection and retry logic (#8005)
Fixes the issue described in #7980 where Lighthouse repeatedly sends `DataColumnsByRoot` requests to the same peers that return empty responses, causing sync to get stuck.

The root cause was we don't count empty responses as failures, leading to excessive retries to unresponsive peers.


  - Track per peer attempts to limit retry attempts per peer (`MAX_CUSTODY_PEER_ATTEMPTS = 3`)
- Replaced random peer selection with hashing within each lookup to prevent splitting lookup into too many small requests and improve request batching efficiency.
- Added `single_block_lookup` root span to track all lookups created and added more debug logs:

<img width="1264" height="501" alt="image" src="https://github.com/user-attachments/assets/983629ba-b6d0-41cf-8e93-88a5b96c2f31" />


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>

Co-Authored-By: Jimmy Chen <jimmy@sigmaprime.io>
2025-09-09 06:18:05 +00:00
Pawan Dhananjay
84ec209eba Allow AwaitingDownload to be a valid in-between state (#7984)
N/A


  Extracts (3) from https://github.com/sigp/lighthouse/pull/7946.

Prior to peerdas, a batch should never have been in `AwaitingDownload` state because we immediataly try to move from `AwaitingDownload` to `Downloading` state by sending batches. This was always possible as long as we had peers in the `SyncingChain` in the pre-peerdas world.

However, this is no longer the case as a batch can be stuck waiting in `AwaitingDownload` state if we have no peers to request the columns from. This PR makes `AwaitingDownload` to be an allowable in between state. If a batch is found to be in this state, then we attempt to send the batch instead of erroring like before.
Note to reviewer: We need to make sure that this doesn't lead to a bunch of batches stuck in `AwaitingDownload` state if the chain can be progressed.

Backfill already retries all batches in AwaitingDownload state so we just need to make `AwaitingDownload` a valid state during processing and validation.

This PR explicitly adds the same logic for forward sync to download batches stuck in `AwaitingDownload`.
Apart from that, we also force download of the `processing_target` when sync stops progressing. This is required in cases where `self.batches` has > `BATCH_BUFFER_SIZE` batches that are waiting to get processed but the `processing_batch` has repeatedly failed at download/processing stage. This leads to sync getting stuck and never recovering.
2025-09-04 07:39:16 +00:00
Jimmy Chen
c2a92f1a8c Maintain peers across all data column subnets (#7915)
Closes:
- #7865
- #7855

Changes extracted from earlier PR #7876

This PR fixes two main things with a few other improvements mentioned below:
- Prevent Lighthouse from repeatedly sending `DataColumnByRoot` requests to an unsynced peer, causing lookup sync to get stuck
- Allows Lighthouse to send discovery requests if there isn't enough **synced** peers in the required sampling subnets - this fixes the stuck sync scenario where there isn't enough usable peers in sampling subnet but no discovery is attempted.


  - Make peer discovery queries if custody subnet peer count drops below the minimum threshold
- Update peer pruning logic to prioritise uniform distribution across all data column subnets and avoid pruning sampling peers if the count is below the target threshold (2)
- Check sync status when making discovery requests, to make sure we don't ignore requests if there isn't enough synced peers in the required sampling subnets
- Optimise some of the `PeerDB` functions checking custody peers
- Only send lookup requests to peers that are synced or advanced
2025-09-04 05:36:20 +00:00
Jimmy Chen
a134d43446 Use rayon to speed up batch KZG verification (#7921)
Addresses #7866.


  Use Rayon to speed up batch KZG verification during range / backfill sync.

While I was analysing the traces, I also discovered a bug that resulted in only the first 128 columns in a chain segment batch being verified. This PR fixes it, so we might actually observe slower range sync due to more cells being KZG verified.

I've also updated the handling of batch KZG failure to only find the first invalid KZG column when verification fails as this gets very expensive during range/backfill sync.
2025-08-29 00:59:40 +00:00
Pawan Dhananjay
b6792d85d2 Reduce backfill batch buffer size (#7958)
N/A


  Currently, backfill is allowed to create upto 20 pending batches which is unnecessarily high imo. Forward sync also allows a max of 5 batches to be buffered at a time.  This PR reduces the batch size to match with forward sync.

Having high number of batches is a little annoying with peerdas because we try to create and send 20 requests (even though we are processing them in a rate limited manner). Requests with peerdas is a lot more heavy as we distribute requests across multiple peers leading to lot of requests that may keep getting retried. This could take resources away from processing at head.
2025-08-28 03:31:31 +00:00
Michael Sproul
d235f2c697 Delete RuntimeVariableList::from_vec (#7930)
This method is a footgun because it truncates the list. It is the source of a recent bug:

- https://github.com/sigp/lighthouse/pull/7927


  - Delete uses of `RuntimeVariableList::from_vec` and replace them with `::new` which does validation and can fail.
- Propagate errors where possible, unwrap in tests and use `expect` for obviously-safe uses (in `chain_spec.rs`).
2025-08-27 06:52:14 +00:00
Pawan Dhananjay
ccf03e1c88 Fix data columns by range returning all columns (#7942)
N/A


  In https://github.com/sigp/lighthouse/pull/7897 , we seem to have modified data columns by range to return all the columns we have for the requested epoch disregarding what columns the peer requested.
2025-08-27 05:00:34 +00:00
Jimmy Chen
8901c7417d Notify lookup after gossip data column processing resulted in an import (#7940)
When gossip data column processing completes and results in a block import, sync is currently not notified of the successful import. This is inconsistent with how blob processing and block processing both notify sync.

This fix ensures lookup sync receives block import notifications when blocks become available through gossip data column.
2025-08-27 01:32:17 +00:00
Mac L
e438691683 Add Gloas boilerplate (#7728)
Adds the required boilerplate code for the Gloas (Glamsterdam) hard fork. This allows PRs testing Gloas-candidate features to test fork transition.

This also includes de-duplication of post-Bellatrix readiness notifiers from #6797 (credit to @dapplion)
2025-08-26 02:49:48 +00:00