Commit Graph

3516 Commits

Author SHA1 Message Date
Jimmy Chen
3e78034de6 Add BEACON_PROCESSOR_WORKERS_ACTIVE_GAUGE_BY_TYPE metric (#7935)
Similar to `BEACON_PROCESSOR_WORKERS_ACTIVE_TOTAL` but this metric also records the work type.

This is useful in identifying the task when a worker is stuck due to a deadlock or something else, and usually difficult to debug in production / release mode.
2025-08-26 06:46:14 +00:00
Mac L
e438691683 Add Gloas boilerplate (#7728)
Adds the required boilerplate code for the Gloas (Glamsterdam) hard fork. This allows PRs testing Gloas-candidate features to test fork transition.

This also includes de-duplication of post-Bellatrix readiness notifiers from #6797 (credit to @dapplion)
2025-08-26 02:49:48 +00:00
Jimmy Chen
daf1c7c3af Fix RPC blocks not getting fully KZG verified (#7927)
Fix RPC blocks not getting fully KZG verified due to incorrect list truncation.
2025-08-25 16:46:16 +00:00
Jimmy Chen
747d9118ff Fix DataColumnsByRoot request limit validation bug (#7928)
Fixes #7926

This was a bug I introduced in #7890 and @pawanjay176 noticed it on some running nodes, and added a rpc test to confirm it.

The culprit is this line, where I failed to fill the vec to it's max size, so it doesn't calculate the max size properly, resulting in all `DataColumnByRoot` requests exceeding the max size during validation:
d24a6d2a45/consensus/types/src/chain_spec.rs (L1984)

The PR fixes this and includes new regression tests for this fix.
2025-08-25 04:13:36 +00:00
Mac L
c41d1181d2 Use Fork variants instead of version for JsonPayload types (#7909)
With Fulu, we increment the engine API version for `get_payload` but we do not also increment `new_payload`.
In Lighthouse, we have a tight coupling between these version numbers and the Fork variants.
For example, both `get_payload_v3` and `new_payload_v3` correspond to Deneb, `v4` to Electra, etc.

However this coupling breaks with Fulu where only `get_payload_v5` is related to Fulu and `new_payload_v4` now also corresponds to Fulu (new_payload_v5 does not exist). While we can work around this, it creates a confusing situation where the versions and Fork variants are out of sync.

See the following code snippet where we are using the v4 endpoint, and yet passing a `V5` payload variant: 522bd9e9c6/beacon_node/execution_layer/src/engine_api/http.rs (L849-L870)


  1. Remove `new_payload_v5` as it is unused in Fulu.
2. Rename the `JsonExecutionPayload` and `JsonGetExecutionPayloadResponse` types to use Fork variants instead of version variants. This clarifies the relationship between them.
2025-08-22 09:22:41 +00:00
João Oliveira
884f30094a use DEFAULT_TARGET_PEERS for target peers everywhere (#7916)
Was going to leave this as a comment on #7877 but when noticed it had already been merged.
we have `DEFAULT_TARGET_PEERS` which was set to 50 and only used on the `Default` impl for `peer_manager`'s `Config`, which then get's overridden by this `lighthouse_network::Config`s default
This PR unifies everything on `DEFAULT_TARGET_PEERS`
2025-08-22 00:24:24 +00:00
Jimmy Chen
d24a6d2a45 Prioritise StatusV2 over StatusV1 RPC protocol (#7912)
Prioritise `StatusV2` over `StatusV1` RPC protocol.

A bug discovered during devnet-4 testing and extracted from the sync fixes PR #7876.
2025-08-21 23:02:18 +00:00
João Oliveira
cee30d8ca5 Update lighthouse to the latest upstream libp2p and gossipsub (#7828) 2025-08-21 07:57:46 +00:00
Age Manning
c9ffdf7f71 Re-assess Lighthouse's peer count for Fusaka (#7877) 2025-08-21 06:12:53 +00:00
Jimmy Chen
f19d4f6af1 Implement tracing spans for data columm RPC requests and responses (#7831)
#7830
2025-08-20 23:35:51 +00:00
Jimmy Chen
2d223575d6 Avoid unnecessary database lookups in data column RPC requests (#7897)
This PR is an optimisation to avoid unnecessary database lookups when peer requests data columns that the node doesn't custody (advertised via `cgc`).

e.g. an extreme but realistic example - a full node only store 4 custody columns by default, but it may receive a range request of 32 slots with all 128 columns, and this would result in 4096 database lookups but the node is only able to get 128 (4 * 32) of them.


  - Filter data column RPC requests (`DataColumnsByRoot`, `DataColumnsByRange`) to only lookup columns the node custodies
- Prevents unnecessary database queries that would always fail for non-custody columns
2025-08-20 05:08:53 +00:00
Jimmy Chen
b4704eab4a Fulu update to spec v1.6.0-alpha.4 (#7890)
Fulu update to spec [v1.6.0-alpha.4](https://github.com/ethereum/consensus-specs/releases/tag/v1.6.0-alpha.4).
- Make `number_of_columns` a preset
- Optimise `get_custody_groups` to avoid computing if cgc = 128
- Add support for additional typenum values in type_dispatch macro
2025-08-20 02:05:04 +00:00
Jimmy Chen
34dd1b27ae Revise data column rpc limits and queue sizes (#7887)
Revise data column rpc limits and queue sizes. Also removed some outdated TODOs for Fulu / das.
2025-08-19 03:48:08 +00:00
Daniel Knopik
1fd7ead010 Do not filter validators by status if filter is an empty list (#7884)
69d2feb12a/apis/beacon/states/validators.yaml (L128-L130) says we need to not filter if the filter is an empty list.


  Add a check for `statuses.is_empty()`.
2025-08-18 07:46:37 +00:00
Michael Sproul
836c39efaa Shrink persisted fork choice data (#7805)
Closes:

- https://github.com/sigp/lighthouse/issues/7760


  - [x] Remove `balances_cache` from `PersistedForkChoiceStore` (~65 MB saving on mainnet)
- [x] Remove `justified_balances` from `PersistedForkChoiceStore` (~16 MB saving on mainnet)
- [x] Remove `balances` from `ProtoArray`/`SszContainer`.
- [x] Implement zstd compression for votes
- [x] Fix bug in justified state usage
- [x] Bump schema version to V28 and implement migration.
2025-08-18 06:03:28 +00:00
Jimmy Chen
aa8cba3741 Upgrade rust-eth-kzg to 0.8.0 (#7870)
#7864

The main breaking change in v0.8.0 is the `TrustedSetup` initialisation - it now requires a json string via `PeerDASTrustedSetup::from_json`.
2025-08-18 02:52:39 +00:00
Age Manning
9200042910 Transition network key to hex format (#7665)
#7181


  Instead of storing the network key as binary data we store it as hex, allowing users to modify it via the file.

We can read old-binary forms, however we will migrate binary to hex as it will be the new standard.
2025-08-15 07:12:19 +00:00
Eitan Seri-Levi
90fa7c216e Fix ssz formatting for /light_client/updates beacon API endpoint (#7806)
#7759


  We were incorrectly encoding the full response from `/light_client/updates` instead of only encoding the light client update
2025-08-15 03:17:29 +00:00
Age Manning
ee1b0ae2ff Allow for sync state where batch is unknown (#7391) 2025-08-13 06:00:49 +00:00
chonghe
522bd9e9c6 Update Rust Edition to 2024 (#7766)
* #7749

Thanks @dknopik and @michaelsproul for your help!
2025-08-13 03:04:31 +00:00
Michael Sproul
918121e313 Fix bugs in rebasing of states prior to finalization (#7849)
Attempt to fix this error reported by `beaconcha.in` on their Hoodi archive nodes:

> {"code":500,"message":"UNHANDLED_ERROR: DBError(CacheBuildError(BeaconState(MilhouseError(OutOfBoundsIterFrom { index: 1199549, len: 1060000 }))))","stacktraces":[]}


  There are only a handful of places where we call `iter_from`.

This one is safe by construction (the check immediately prior ensures `self.pubkeys.len()` is not out of bounds):

cfb1f73310/beacon_node/beacon_chain/src/validator_pubkey_cache.rs (L84-L90)

This one should also be safe, and the indexes used here would not be as large as the ones in the reported error:

cfb1f73310/consensus/state_processing/src/per_epoch_processing/single_pass.rs (L365-L368)

Which leaves one remaining usage which must be the culprit:

cfb1f73310/consensus/types/src/beacon_state.rs (L2109-L2113)

This indexing relies on the invariant that `self.pubkey_cache().len() <= self.validators.len()`. We mostly maintain that invariant, except for in `rebase_caches_on` (fixed in this PR).

The other bug, is that we were calling `rebase_on_finalized` for all "hot" states, which post-v7.1.0 includes states prior to the split which are required by the hdiff grid. This is how we end up calling something like `genesis_state.rebase_on(&split_state)`, which then corrupts the pubkey cache of the genesis state using the newer pubkey cache from the split state.
2025-08-12 02:19:24 +00:00
Pawan Dhananjay
80ba0b169b Backfill peer attribution (#7762)
Partly addresses https://github.com/sigp/lighthouse/issues/7744


  Implement similar peer sync attribution like in #7733 for backfill sync.
2025-08-12 02:11:56 +00:00
Eitan Seri-Levi
122f16776f Add metrics to track beacon processor queue times (#7808)
This PR adds a created_timestamp to the beacon processor send channel. When work items are sent through that channel `try_send` will forward the work event along with the current timestamp to the beacon processor. When the work event is completed the `Drop` impl for `SendOnDrop` will track the time it took from work event creation to its completion. Previously we only had data on how long a work event took to process, but didn't have data on how long it sat in the queue + how long it took to process.
2025-08-12 01:06:42 +00:00
Pawan Dhananjay
4262ad3e01 Add a flag to disable getBlobs (#7853)
N/A


  Add a flag to disable get blobs. I configured the flag to disable it regardless of version because its most likely something we use for testing anyway.
2025-08-11 23:17:00 +00:00
Jimmy Chen
40c2fd5ff4 Instrument tracing spans for block processing and import (#7816)
#7815

- removes all existing spans, so some span fields that appear in logs like `service_name` may be lost.
- instruments a few key code paths in the beacon node, starting from **root spans** named below:

* Gossip block and blobs
* `process_gossip_data_column_sidecar`
* `process_gossip_blob`
* `process_gossip_block`
* Rpc block and blobs
* `process_rpc_block`
* `process_rpc_blobs`
* `process_rpc_custody_columns`
* Rpc blocks (range and backfill)
* `process_chain_segment`
* `PendingComponents` lifecycle
* `pending_components`

To test locally:
* Run Grafana and Tempo with https://github.com/sigp/lighthouse-metrics/pull/57
* Run Lighthouse BN with `--telemetry-collector-url http://localhost:4317`

Some captured traces can be found here: https://hackmd.io/@jimmygchen/r1sLOxPPeg

Removing the old spans seem to have reduced the memory usage quite a lot - i think we were using them on long running tasks and too excessively:
<img width="910" height="495" alt="image" src="https://github.com/user-attachments/assets/5208bbe4-53b2-4ead-bc71-0b782c788669" />
2025-08-08 05:32:22 +00:00
Jimmy Chen
3a02bdd94a Adjust DA checker cache size (#7825)
The current `OVERFLOW_LRU_CAPACITY` of `1024` seems a bit excessive now we rarely store more than 1 `PendingComponents` (under normal networking components). Additionally given the blob count increases, the max size of `PendingComponents` has also increased and is expected to increase further.

This PR brings the max capacity of the cache down to `64`, which should be more than enough headroom but also give us  better protection from the network.
2025-08-07 05:11:38 +00:00
Jimmy Chen
8bc6693dac Fix wrong columns getting processed on a CGC change (#7792)
This PR fixes a bug where wrong columns could get processed immediately after a CGC increase.

Scenario:
- The node's CGC increased due to additional validators attached to it (lets say from 10 to 11)
- The new CGC is advertised and new subnets are subscribed immediately, however the change won't be effective in the data availability check until the next epoch (See [this](ab0e8870b4/beacon_node/beacon_chain/src/validator_custody.rs (L93-L99))). Data availability checker still only require 10 columns for the current epoch.
- During this time, data columns for the additional custody column (lets say column 11) may arrive via gossip as we're already subscribed to the topic, and it may be incorrectly used to satisfy the existing data availability requirement (10 columns), and result in this additional column (instead of a required one) getting persisted, resulting in database inconsistency.
2025-08-07 00:45:04 +00:00
Daniel Ramirez-Chiquillo
9c972201bc Fix: RPC test failures (#7734)
Fixes #7735


  Use `tracing::subscriber::set_default` to ensure that each test/thread has its own subscirber.
2025-08-06 14:59:41 +00:00
Michael Sproul
0dcce40ccb Fix Clippy for Rust 1.90 beta (#7826)
Fix Clippy for recently released Rust 1.90 beta. There may be more changes required when Rust 1.89 stable is released in a few days, but possibly not 🤞
2025-08-05 13:52:26 +00:00
Jimmy Chen
adf6ad70f0 Update fetch blobs metrics buckets (#7823)
While looking at metrics I noticed that `beacon_blobs_from_el_expected` and `beacon_blobs_from_el_received_total` have different buckets, this PR adds more buckets to both (to prepare for Fusaka) and make them both consistent.
2025-08-01 18:27:53 +00:00
Michael Sproul
134039d014 Simplify ConfigAndPreset (#7777)
I noticed that we are serving preset values for Fulu on mainnet nodes prior to the fork. This has already gone live in v7.1.0, but should hopefully be handled in a graceful way by API consumers.

This PR _reverts_ the serving of Fulu data prior to Fulu, by serving Fulu data only if Fulu is scheduled.
2025-07-25 08:53:24 +00:00
Pawan Dhananjay
09065a851f Add builder blinded_blocks v2 (#7778)
Partially addresses https://github.com/sigp/lighthouse/issues/7381


  Add blinded_blocks v2 method specified in https://github.com/ethereum/builder-specs/pull/123/
2025-07-25 08:29:19 +00:00
Jimmy Chen
2aae08a8aa Remove KZG verification on blobs fetched from the EL (#7771)
Continuation of #7713, addresses comment about skipping KZG verification on EL fetched blobs:

https://github.com/sigp/lighthouse/pull/7713#discussion_r2198542501
2025-07-25 06:49:50 +00:00
Jimmy Chen
4daa015971 Remove peer sampling code (#7768)
Peer sampling has been completely removed from the spec. This PR removes our partial implementation from the codebase.
https://github.com/ethereum/consensus-specs/pull/4393
2025-07-23 03:24:45 +00:00
Michael Sproul
ce99e0c383 Refine delayed head block logging (#7705)
Small tweak to `Delayed head block` logging to make it more representative of actual issues. Previously we used the total import delay to determine whether a block was late, but this includes the time taken for IO (and now hdiff computation) which happens _after_ the block is made attestable.

This PR changes the logic to use the attestable delay (where possible) falling back to the previous value if the block doesn't have one; e.g. if it didn't meet the conditions to make it into the attestable cache.
2025-07-23 00:29:18 +00:00
Eitan Seri-Levi
db8b6be9df Data column custody info (#7648)
#7647


  Introduces a new record in the blobs db `DataColumnCustodyInfo`

When `DataColumnCustodyInfo` exists in the db this indicates that a recent cgc change has occurred and/or that a custody backfill sync is currently in progress (custody backfill will be added as a separate PR). When a cgc change has occurred `earliest_available_slot` will be equal to the slot at which the cgc change occured. During custody backfill sync`earliest_available_slot` should be updated incrementally as it progresses.

~~Note that if `advertise_false_custody_group_count` is enabled we do not add a `DataColumnCustodyInfo` record in the db as that would affect the status v2 response.~~
(See comment https://github.com/sigp/lighthouse/pull/7648#discussion_r2212403389)

~~If `DataColumnCustodyInfo` doesn't exist in the db this indicates that we have fulfilled our custody requirements up to the DA window.~~
(It now always exist, and the slot will be set to `None` once backfill is complete)

StatusV2 now uses `DataColumnCustodyInfo` to calculate the `earliest_available_slot` if a `DataColumnCustodyInfo` record exists in the db, if it's `None`, then we return the `oldest_block_slot`.
2025-07-22 13:30:30 +00:00
Jimmy Chen
b48879a566 Remove KZG verification from local block production and blobs fetched from the EL (#7713)
#7700


  As described in title, the EL already performs KZG verification on all blobs when they entered the mempool, so it's redundant to perform extra validation on blobs returned from the EL.

This PR removes
- KZG verification for both blobs and data columns during block production
- KZG verification for data columns after fetch engine blobs call. I have not done this for blobs because it requires extra changes to check the observed cache, and doesn't feel like it's a worthy optimisation given the number of blobs per block.

This PR does not remove KZG verification on the block publishing path yet.
2025-07-22 10:48:49 +00:00
Pawan Dhananjay
3f06e5dfba Fix enr loading from disk with cgc (#7754)
N/A


  During building an enr on startup, we weren't using the value in the custody context.
This was resulting in the enr value getting updated when the cgc updates, the change getting persisted, but getting set back to the default on restart.
This PR takes the value explicitly from the custody context.
2025-07-18 04:51:11 +00:00
Eitan Seri-Levi
d6de8a7484 Add additional broadcast validation tests for Fulu/PeerDAS (#7325)
Closes #6855


  Add PeerDAS broadcast validation tests and fix a small bug where `sampling_columns_indices` is none (indicating that we've already sampled the necessary columns) and `process_gossip_data_columns` gets called
2025-07-17 07:50:28 +00:00
Pawan Dhananjay
309c301363 Allow /validator apis to work pre-genesis (#7729)
N/A


  Lighthouse BN http endpoint would return a server error pre-genesis on the `validator/duties/attester` and `validator/prepare_beacon_proposer` because `slot_clock.now()` would return a `None` pre-genesis.

The prysm VC depends on the endpoints pre-genesis and was having issues interoping with the lighthouse bn because of this reason.

The proposer duties endpoint explicitly handles the pre-genesis case here
538067f1ff/beacon_node/http_api/src/proposer_duties.rs (L23-L28)

I see no reason why we can't make the other endpoints more flexible to work pre-genesis. This PR handles the pre-genesis case on the attester and prepare_beacon_proposer endpoints as well.

Thanks for raising @james-prysm.
2025-07-14 06:42:55 +00:00
Pawan Dhananjay
90ff64381e Sync peer attribution (#7733)
Which issue # does this PR address?

Closes #7604


  Improvements to range sync including:

1. Contain column requests only to peers that are part of the SyncingChain
2. Attribute the fault to the correct peer and downscore them if they don't return the data columns for the request
3. Improve sync performance by retrying only the failed columns from other peers instead of failing the entire batch
4. Uses the earliest_available_slot to make requests to peers that claim to have the epoch. Note: if no earliest_available_slot info is available, fallback to using previous logic i.e. assume peer has everything backfilled upto WS checkpoint/da boundary

Tested this on fusaka-devnet-2 with a full node and supernode and the recovering logic seems to works well.
Also tested this a little on mainnet.

Need to do more testing and possibly add some unit tests.
2025-07-12 00:02:30 +00:00
ethDreamer
b43e0b446c Final changes for fusaka-devnet-2 (#7655)
Closes #7467.

This PR primarily addresses [the P2P changes](https://github.com/ethereum/EIPs/pull/9840) in [fusaka-devnet-2](https://fusaka-devnet-2.ethpandaops.io/). Specifically:

* [the new `nfd` parameter added to the `ENR`](https://github.com/ethereum/EIPs/pull/9840)
* [the modified `compute_fork_digest()` changes for every BPO fork](https://github.com/ethereum/EIPs/pull/9840)

90% of this PR was absolutely hacked together as fast as possible during the Berlinterop as fast as I could while running between Glamsterdam debates. Luckily, it seems to work. But I was unable to be as careful in avoiding bugs as I usually am. I've cleaned up the things *I remember* wanting to come back and have a closer look at. But still working on this.

Progress:
* [x] get it working on `fusaka-devnet-2`
* [ ] [*optional* disconnect from peers with incorrect `nfd` at the fork boundary](https://github.com/ethereum/consensus-specs/pull/4407) - Can be addressed in a future PR if necessary
* [x] first pass clean-up
* [x] fix up all the broken tests
* [x] final self-review
* [x] more thorough review from people more familiar with affected code
2025-07-10 21:32:58 +00:00
Jimmy Chen
3826fe91f4 Improve data column KZG verification metric buckets (#7717)
The current data column KZG verification buckets are not giving us useful info as the upper bound is too low. And we see most of numbers above 70ms for batch verification, and we don't know how much time it really takes.

This PR improves the buckets based on the numbers we got from testing. Exponential bucket seems like a good candidate here given we're expecting to increase blob count with a similar approach (possibly 2x each fork if it goes well).
2025-07-10 20:59:45 +00:00
Michael Sproul
538067f1ff Merge remote-tracking branch 'origin/stable' into unstable 2025-07-10 15:53:45 +10:00
Michael Sproul
cfb1f73310 Release v7.1.0 (#7609)
Post-Pectra release for tree-states hot 🎉

Already merged to `release-v7.1.0`:

- https://github.com/sigp/lighthouse/pull/7444
- https://github.com/sigp/lighthouse/pull/6750
- https://github.com/sigp/lighthouse/pull/7437
- https://github.com/sigp/lighthouse/pull/7133
- https://github.com/sigp/lighthouse/pull/7620
- https://github.com/sigp/lighthouse/pull/7663
2025-07-10 01:44:46 +00:00
João Oliveira
8b5ccacac9 Error from RPC send_response when request doesn't exist on the active inbound requests (#7663)
Lighthouse is currently loggign a lot errors in the `RPC` behaviour whenever a response is received for a request_id that no longer exists in active_inbound_requests. This is likely due to a data race or timing issue (e.g., the peer disconnecting before the response is handled).
This PR addresses that by removing the error logging from the RPC layer. Instead, RPC::send_response now simply returns an Err, shifting the responsibility to the main service. The main service can then determine whether the peer is still connected and only log an error if the peer remains connected.
Thanks @ackintosh for helping debug!
2025-07-09 14:26:51 +00:00
Michael Sproul
8e55684b06 Reintroduce --logfile with deprecation warning (#7723)
Reintroduce the `--logfile` flag with a deprecation warning so that it doesn't prevent nodes from starting. This is considered preferable to breaking node startups so that users fix the flag, even though it means the `--logfile` flag is completely ineffective.

The flag was initially removed in:

- https://github.com/sigp/lighthouse/pull/6339
2025-07-09 08:08:27 +00:00
Michael Sproul
7b2f138ca7 Merge remote-tracking branch 'origin/stable' into release-v7.1.0 2025-07-09 11:19:16 +10:00
Michael Sproul
b9c1a2b0c0 Fix description of DB read bytes metric (#7716)
Fix a trivial typo that mixed up reads and writes.
2025-07-08 08:50:15 +00:00
Eitan Seri-Levi
bd8a2a8ffb Gossip recently computed light client data (#7023) 2025-07-08 07:07:10 +00:00