Commit Graph

1538 Commits

Author SHA1 Message Date
Michael Sproul
f04d5ecddd Another check to prevent duplicate block imports (#8050)
Attempt to address performance issues caused by importing the same block multiple times.


  - Check fork choice "after" obtaining the fork choice write lock in `BeaconChain::import_block`. We actually use an upgradable read lock, but this is semantically equivalent (the upgradable read has the advantage of not excluding regular reads).

The hope is that this change has several benefits:

1. By preventing duplicate block imports we save time repeating work inside `import_block` that is unnecessary, e.g. writing the state to disk. Although the store itself now takes some measures to avoid re-writing diffs, it is even better if we avoid a disk write entirely.
2. By returning `DuplicateFullyImported`, we reduce some duplicated work downstream. E.g. if multiple threads importing columns trigger `import_block`, now only _one_ of them will get a notification of the block import completing successfully, and only this one will run `recompute_head`. This should help avoid a situation where multiple beacon processor workers are consumed by threads blocking on the `recompute_head_lock`. However, a similar block-fest is still possible with the upgradable fork choice lock (a large number of threads can be blocked waiting for the first thread to complete block import).


Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
2025-09-16 04:10:42 +00:00
Jimmy Chen
fb77ce9e19 Add missing event in PendingComponent span and clean up sync logs (#8033)
I was looking into some long `PendingComponents` span and noticed the block event wasn't added to the span, so it wasn't possible to see when the block was added from the trace view, this PR fixes this.

<img width="637" height="430" alt="image" src="https://github.com/user-attachments/assets/65040b1c-11e7-43ac-951b-bdfb34b665fb" />

Additionally I've noticed a lot of noises and confusion in sync logs due to the initial`peer_id` being included as part of the syncing chain span, causing all logs under the span to have that `peer_id`, which may not be accurate for some sync logs, I've removed `peer_id` from the `SyncingChain` span, and also cleaned up a bunch of spans to use `%` (display) for slots and epochs to make logs easier to read.


  


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
2025-09-12 05:11:30 +00:00
kevaundray
f71d69755d chore: add comment to PendingComponents (#7979)
Adds doc comment


  


Co-Authored-By: Kevaundray Wedderburn <kevtheappdev@gmail.com>

Co-Authored-By: Jimmy Chen <jimmy@sigmaprime.io>
2025-09-10 13:48:11 +00:00
Eitan Seri-Levi
caa1df6fc3 Skip column gossip verification logic during block production (#7973)
#7950


  Skip column gossip verification logic during block production as its redundant and potentially computationally expensive.


Co-Authored-By: Eitan Seri- Levi <eserilev@gmail.com>

Co-Authored-By: Eitan Seri-Levi <eserilev@ucsc.edu>

Co-Authored-By: Jimmy Chen <jimmy@sigmaprime.io>

Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
2025-09-10 12:29:56 +00:00
hopinheimer
38205192ca Fix http api tests ci (#7943)
Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>

Co-Authored-By: Michael Sproul <micsproul@gmail.com>

Co-Authored-By: Michael Sproul <michael@sigmaprime.io>

Co-Authored-By: hopinheimer <knmanas6@gmail.com>
2025-09-10 06:46:48 +00:00
Jimmy Chen
8a4f6cf0d5 Instrument tracing on block production code path (#8017)
Partially #7814. Instrument block production code path.

New root spans:
* `produce_block_v3`
* `produce_block_v2`

Example traces:

<img width="518" height="432" alt="image" src="https://github.com/user-attachments/assets/a9413d25-501c-49dc-95cc-623db5988981" />


  


Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
2025-09-10 03:30:51 +00:00
Eitan Seri-Levi
8ec2640e04 Don't penalize peers if locally constructed light client data is stale (#7996)
#7994


  We seem to be penalizing peers in situations where locally constructed light client data is stale. This PR ignores incoming light client data if our locally constructed light client data isn't up to date.


Co-Authored-By: Eitan Seri-Levi <eserilev@ucsc.edu>
2025-09-05 03:23:34 +00:00
Jimmy Chen
9d2f55a399 Fix data column reconstruction error (#7998)
Addresses #7991
2025-09-04 20:17:52 +00:00
Jimmy Chen
eef02afc93 Fix data availability checker race condition causing partial data columns to be served over RPC (#7961)
Partially resolves #6439, an simpler alternative to #7931.

Race condition occurs when RPC data columns arrives after a block has been imported and removed from the DA checker:
1. Block becomes available via gossip
2. RPC columns arrive and pass fork choice check (block hasn't been imported)
3. Block import completes (removing block from DA checker)
4. RPC data columns finish verification and get imported into DA checker

This causes two issues:
1. **Partial data serving**: Already imported components get re-inserted, potentially causing LH to serve incomplete data
2. **State cache misses**: Leads to state reconstruction, holding the availability cache write lock longer and increasing race likelihood

### Proposed Changes

1. Never manually remove pending components from DA checker. Components are only removed via LRU eviction as finality advances. This makes sure we don't run into the issue described above.
2. Use `get` instead of `pop` when recovering the executed block, this prevents cache misses in race condition. This should reduce the likelihood of the race condition
3. Refactor DA checker to drop write lock as soon as components are added. This should also reduce the likelihood of the race condition

**Trade-offs:**

This solution eliminates a few nasty race conditions while allowing simplicity, with the cost of allowing block re-import (already existing).

The increase in memory in DA checker can be partially offset by a reduction in block cache size if this really comes an issue (as we now serve recent blocks from DA checker).
2025-09-02 07:18:23 +00:00
Jimmy Chen
979ed2557c Remove expect usage in kzg_utils (#7957)
Remove `expect` usage in `kzg_utils` to handle the case where EL sends us invalid proof size instead of crashing.
2025-09-01 09:21:26 +00:00
kevaundray
9cc3c0553b chore: small refactor of epoch method (#7902)
Stylistic; mostly using early returns to avoid the nested logic

Which issue # does this PR address?


  Please list or describe the changes introduced by this PR.
2025-09-01 09:21:23 +00:00
Jimmy Chen
a134d43446 Use rayon to speed up batch KZG verification (#7921)
Addresses #7866.


  Use Rayon to speed up batch KZG verification during range / backfill sync.

While I was analysing the traces, I also discovered a bug that resulted in only the first 128 columns in a chain segment batch being verified. This PR fixes it, so we might actually observe slower range sync due to more cells being KZG verified.

I've also updated the handling of batch KZG failure to only find the first invalid KZG column when verification fails as this gets very expensive during range/backfill sync.
2025-08-29 00:59:40 +00:00
Jimmy Chen
c13fb2fb46 Instrument publish_block code path (#7945)
Instrument `publish_block` code path and log dropped data columns when publishing.

Example spans (running the devnet from my laptop, so the numbers aren't great)

<img width="734" height="296" alt="image" src="https://github.com/user-attachments/assets/20620bf7-2b38-4392-aa75-9ba96d3a7f0d" />

<img width="718" height="625" alt="image" src="https://github.com/user-attachments/assets/61e1ff1c-65b5-4ad4-981a-d0fadc9829e1" />
2025-08-28 03:31:29 +00:00
Mac L
e438691683 Add Gloas boilerplate (#7728)
Adds the required boilerplate code for the Gloas (Glamsterdam) hard fork. This allows PRs testing Gloas-candidate features to test fork transition.

This also includes de-duplication of post-Bellatrix readiness notifiers from #6797 (credit to @dapplion)
2025-08-26 02:49:48 +00:00
Jimmy Chen
daf1c7c3af Fix RPC blocks not getting fully KZG verified (#7927)
Fix RPC blocks not getting fully KZG verified due to incorrect list truncation.
2025-08-25 16:46:16 +00:00
Jimmy Chen
f19d4f6af1 Implement tracing spans for data columm RPC requests and responses (#7831)
#7830
2025-08-20 23:35:51 +00:00
Jimmy Chen
2d223575d6 Avoid unnecessary database lookups in data column RPC requests (#7897)
This PR is an optimisation to avoid unnecessary database lookups when peer requests data columns that the node doesn't custody (advertised via `cgc`).

e.g. an extreme but realistic example - a full node only store 4 custody columns by default, but it may receive a range request of 32 slots with all 128 columns, and this would result in 4096 database lookups but the node is only able to get 128 (4 * 32) of them.


  - Filter data column RPC requests (`DataColumnsByRoot`, `DataColumnsByRange`) to only lookup columns the node custodies
- Prevents unnecessary database queries that would always fail for non-custody columns
2025-08-20 05:08:53 +00:00
Jimmy Chen
b4704eab4a Fulu update to spec v1.6.0-alpha.4 (#7890)
Fulu update to spec [v1.6.0-alpha.4](https://github.com/ethereum/consensus-specs/releases/tag/v1.6.0-alpha.4).
- Make `number_of_columns` a preset
- Optimise `get_custody_groups` to avoid computing if cgc = 128
- Add support for additional typenum values in type_dispatch macro
2025-08-20 02:05:04 +00:00
Jimmy Chen
34dd1b27ae Revise data column rpc limits and queue sizes (#7887)
Revise data column rpc limits and queue sizes. Also removed some outdated TODOs for Fulu / das.
2025-08-19 03:48:08 +00:00
Michael Sproul
836c39efaa Shrink persisted fork choice data (#7805)
Closes:

- https://github.com/sigp/lighthouse/issues/7760


  - [x] Remove `balances_cache` from `PersistedForkChoiceStore` (~65 MB saving on mainnet)
- [x] Remove `justified_balances` from `PersistedForkChoiceStore` (~16 MB saving on mainnet)
- [x] Remove `balances` from `ProtoArray`/`SszContainer`.
- [x] Implement zstd compression for votes
- [x] Fix bug in justified state usage
- [x] Bump schema version to V28 and implement migration.
2025-08-18 06:03:28 +00:00
Jimmy Chen
aa8cba3741 Upgrade rust-eth-kzg to 0.8.0 (#7870)
#7864

The main breaking change in v0.8.0 is the `TrustedSetup` initialisation - it now requires a json string via `PeerDASTrustedSetup::from_json`.
2025-08-18 02:52:39 +00:00
chonghe
522bd9e9c6 Update Rust Edition to 2024 (#7766)
* #7749

Thanks @dknopik and @michaelsproul for your help!
2025-08-13 03:04:31 +00:00
Pawan Dhananjay
4262ad3e01 Add a flag to disable getBlobs (#7853)
N/A


  Add a flag to disable get blobs. I configured the flag to disable it regardless of version because its most likely something we use for testing anyway.
2025-08-11 23:17:00 +00:00
Jimmy Chen
40c2fd5ff4 Instrument tracing spans for block processing and import (#7816)
#7815

- removes all existing spans, so some span fields that appear in logs like `service_name` may be lost.
- instruments a few key code paths in the beacon node, starting from **root spans** named below:

* Gossip block and blobs
* `process_gossip_data_column_sidecar`
* `process_gossip_blob`
* `process_gossip_block`
* Rpc block and blobs
* `process_rpc_block`
* `process_rpc_blobs`
* `process_rpc_custody_columns`
* Rpc blocks (range and backfill)
* `process_chain_segment`
* `PendingComponents` lifecycle
* `pending_components`

To test locally:
* Run Grafana and Tempo with https://github.com/sigp/lighthouse-metrics/pull/57
* Run Lighthouse BN with `--telemetry-collector-url http://localhost:4317`

Some captured traces can be found here: https://hackmd.io/@jimmygchen/r1sLOxPPeg

Removing the old spans seem to have reduced the memory usage quite a lot - i think we were using them on long running tasks and too excessively:
<img width="910" height="495" alt="image" src="https://github.com/user-attachments/assets/5208bbe4-53b2-4ead-bc71-0b782c788669" />
2025-08-08 05:32:22 +00:00
Jimmy Chen
3a02bdd94a Adjust DA checker cache size (#7825)
The current `OVERFLOW_LRU_CAPACITY` of `1024` seems a bit excessive now we rarely store more than 1 `PendingComponents` (under normal networking components). Additionally given the blob count increases, the max size of `PendingComponents` has also increased and is expected to increase further.

This PR brings the max capacity of the cache down to `64`, which should be more than enough headroom but also give us  better protection from the network.
2025-08-07 05:11:38 +00:00
Jimmy Chen
8bc6693dac Fix wrong columns getting processed on a CGC change (#7792)
This PR fixes a bug where wrong columns could get processed immediately after a CGC increase.

Scenario:
- The node's CGC increased due to additional validators attached to it (lets say from 10 to 11)
- The new CGC is advertised and new subnets are subscribed immediately, however the change won't be effective in the data availability check until the next epoch (See [this](ab0e8870b4/beacon_node/beacon_chain/src/validator_custody.rs (L93-L99))). Data availability checker still only require 10 columns for the current epoch.
- During this time, data columns for the additional custody column (lets say column 11) may arrive via gossip as we're already subscribed to the topic, and it may be incorrectly used to satisfy the existing data availability requirement (10 columns), and result in this additional column (instead of a required one) getting persisted, resulting in database inconsistency.
2025-08-07 00:45:04 +00:00
Michael Sproul
0dcce40ccb Fix Clippy for Rust 1.90 beta (#7826)
Fix Clippy for recently released Rust 1.90 beta. There may be more changes required when Rust 1.89 stable is released in a few days, but possibly not 🤞
2025-08-05 13:52:26 +00:00
Jimmy Chen
adf6ad70f0 Update fetch blobs metrics buckets (#7823)
While looking at metrics I noticed that `beacon_blobs_from_el_expected` and `beacon_blobs_from_el_received_total` have different buckets, this PR adds more buckets to both (to prepare for Fusaka) and make them both consistent.
2025-08-01 18:27:53 +00:00
Jimmy Chen
2aae08a8aa Remove KZG verification on blobs fetched from the EL (#7771)
Continuation of #7713, addresses comment about skipping KZG verification on EL fetched blobs:

https://github.com/sigp/lighthouse/pull/7713#discussion_r2198542501
2025-07-25 06:49:50 +00:00
Jimmy Chen
4daa015971 Remove peer sampling code (#7768)
Peer sampling has been completely removed from the spec. This PR removes our partial implementation from the codebase.
https://github.com/ethereum/consensus-specs/pull/4393
2025-07-23 03:24:45 +00:00
Michael Sproul
ce99e0c383 Refine delayed head block logging (#7705)
Small tweak to `Delayed head block` logging to make it more representative of actual issues. Previously we used the total import delay to determine whether a block was late, but this includes the time taken for IO (and now hdiff computation) which happens _after_ the block is made attestable.

This PR changes the logic to use the attestable delay (where possible) falling back to the previous value if the block doesn't have one; e.g. if it didn't meet the conditions to make it into the attestable cache.
2025-07-23 00:29:18 +00:00
Eitan Seri-Levi
db8b6be9df Data column custody info (#7648)
#7647


  Introduces a new record in the blobs db `DataColumnCustodyInfo`

When `DataColumnCustodyInfo` exists in the db this indicates that a recent cgc change has occurred and/or that a custody backfill sync is currently in progress (custody backfill will be added as a separate PR). When a cgc change has occurred `earliest_available_slot` will be equal to the slot at which the cgc change occured. During custody backfill sync`earliest_available_slot` should be updated incrementally as it progresses.

~~Note that if `advertise_false_custody_group_count` is enabled we do not add a `DataColumnCustodyInfo` record in the db as that would affect the status v2 response.~~
(See comment https://github.com/sigp/lighthouse/pull/7648#discussion_r2212403389)

~~If `DataColumnCustodyInfo` doesn't exist in the db this indicates that we have fulfilled our custody requirements up to the DA window.~~
(It now always exist, and the slot will be set to `None` once backfill is complete)

StatusV2 now uses `DataColumnCustodyInfo` to calculate the `earliest_available_slot` if a `DataColumnCustodyInfo` record exists in the db, if it's `None`, then we return the `oldest_block_slot`.
2025-07-22 13:30:30 +00:00
Jimmy Chen
b48879a566 Remove KZG verification from local block production and blobs fetched from the EL (#7713)
#7700


  As described in title, the EL already performs KZG verification on all blobs when they entered the mempool, so it's redundant to perform extra validation on blobs returned from the EL.

This PR removes
- KZG verification for both blobs and data columns during block production
- KZG verification for data columns after fetch engine blobs call. I have not done this for blobs because it requires extra changes to check the observed cache, and doesn't feel like it's a worthy optimisation given the number of blobs per block.

This PR does not remove KZG verification on the block publishing path yet.
2025-07-22 10:48:49 +00:00
ethDreamer
b43e0b446c Final changes for fusaka-devnet-2 (#7655)
Closes #7467.

This PR primarily addresses [the P2P changes](https://github.com/ethereum/EIPs/pull/9840) in [fusaka-devnet-2](https://fusaka-devnet-2.ethpandaops.io/). Specifically:

* [the new `nfd` parameter added to the `ENR`](https://github.com/ethereum/EIPs/pull/9840)
* [the modified `compute_fork_digest()` changes for every BPO fork](https://github.com/ethereum/EIPs/pull/9840)

90% of this PR was absolutely hacked together as fast as possible during the Berlinterop as fast as I could while running between Glamsterdam debates. Luckily, it seems to work. But I was unable to be as careful in avoiding bugs as I usually am. I've cleaned up the things *I remember* wanting to come back and have a closer look at. But still working on this.

Progress:
* [x] get it working on `fusaka-devnet-2`
* [ ] [*optional* disconnect from peers with incorrect `nfd` at the fork boundary](https://github.com/ethereum/consensus-specs/pull/4407) - Can be addressed in a future PR if necessary
* [x] first pass clean-up
* [x] fix up all the broken tests
* [x] final self-review
* [x] more thorough review from people more familiar with affected code
2025-07-10 21:32:58 +00:00
Jimmy Chen
3826fe91f4 Improve data column KZG verification metric buckets (#7717)
The current data column KZG verification buckets are not giving us useful info as the upper bound is too low. And we see most of numbers above 70ms for batch verification, and we don't know how much time it really takes.

This PR improves the buckets based on the numbers we got from testing. Exponential bucket seems like a good candidate here given we're expecting to increase blob count with a similar approach (possibly 2x each fork if it goes well).
2025-07-10 20:59:45 +00:00
Eitan Seri-Levi
bd8a2a8ffb Gossip recently computed light client data (#7023) 2025-07-08 07:07:10 +00:00
Michael Sproul
c7bb3b00e4 Fix lookups of the block at oldest_block_slot (#7693)
Closes:

- https://github.com/sigp/lighthouse/issues/7690


  Another checkpoint sync related fix! See issue for a description of the bug.

We fix it by just loading the block root of the `oldest_block_slot`, rather than trying to load the slot prior, which will always fail.
2025-07-02 23:40:04 +00:00
Michael Sproul
a459a9af98 Fix and test checkpoint sync from genesis (#7689)
Fix a bug involving checkpoint sync from genesis reported by Sunnyside labs.


  Ensure that the store's `anchor` is initialised prior to storing the genesis state. In the case of checkpoint sync from genesis, the genesis state will be in the _hot DB_, so we need the hot DB metadata to be initialised in order to store it.

I've extended the existing checkpoint sync tests to cover this case as well. There are some subtleties around what the `state_upper_limit` should be set to in this case. I've opted to just enable state reconstruction from the start in the test so it gets set to 0, which results in an end state more consistent with the other test cases (full state reconstruction). This is required because we can't meaningfully do any state reconstruction when the split slot is 0 (there is no range of frozen slots to reconstruct).
2025-07-02 04:50:33 +00:00
Jimmy Chen
fcc602a787 Update fulu network configs and add MIN_EPOCHS_FOR_DATA_COLUMN_SIDECARS_REQUESTS (#7646)
- #6240
- Bring built-in network configs up to date with latest consensus-spec PeerDAS configs.
- Add `MIN_EPOCHS_FOR_DATA_COLUMN_SIDECARS_REQUESTS` and use it to determine data availability window after the Fulu fork.
2025-07-02 02:38:25 +00:00
Jimmy Chen
41742ce2bd Update SAMPLES_PER_SLOT to be number of custody groups instead of data columns (#7683)
Update `SAMPLES_PER_SLOT` to be number of custody groups instead of data columns. This should have no impact on the current implementation as config currently maintains a `group:subnet:column` ratio of `1:1:1`.  **In short, this PR doesn't change anything for Fusaka, but ensures compliance with the spec and potential future changes.**

I've added separate methods to compute sampling columns and custody groups for clarity: `spec.sampling_size_columns` and `spec.sampling_size_custod_groups`

See the clarifications in this PR for more details: https://github.com/ethereum/consensus-specs/pull/4251
2025-07-02 00:08:40 +00:00
Pawan Dhananjay
e305cb1b92 Custody persist fix (#7661)
N/A


  Persist the epoch -> cgc values. This is to ensure that `ValidatorRegistrations::latest_validator_custody_requirement` always returns a `Some` value post restart assuming the `epoch_validator_custody_requirements` map has been updated in the previous runs.
2025-07-01 06:06:37 +00:00
Michael Sproul
c1f94d9b7b Test database schema stability (#7669)
This PR implements some heuristics to check for breaking database changes. The goal is to prevent accidental changes to the database schema occurring without a version bump.
2025-06-30 10:55:47 +00:00
Nikola Ristić
2d759f78be Fix beacon_chain metrics descriptions (#6576)
n/a


  Just fixed a small mistake in the metrics description.
2025-06-30 00:11:14 +00:00
Odinson
6ea5f14b39 feat: better error message for light_client/bootstrap endpoint (#7597)
Fixes #7229
2025-06-28 03:55:39 +00:00
Jimmy Chen
83cad25d98 Fix Rust 1.88 clippy errors & execution engine tests (#7657)
Fix Rust 1.88 clippy errors.
2025-06-27 18:21:17 +00:00
Pawan Dhananjay
9b1f3ed9d1 Add gossip check (#7652)
N/A


  Add an additional gossip condition.
2025-06-27 00:26:38 +00:00
chonghe
8e3c5d1524 Rust 1.89 compiler lint fix (#7644)
Fix lints for Rust 1.89 beta compiler
2025-06-25 05:33:17 +00:00
Eitan Seri-Levi
56b2d4b525 Remove instrumenting log level (#7636)
https://github.com/sigp/lighthouse/issues/7155


  Theres some additional places we set instrumenting log levels that wasn't covered in #7620
2025-06-24 06:29:10 +00:00
Pawan Dhananjay
11bcccb353 Remove all prod eth1 related code (#7133)
N/A


  After the electra fork which includes EIP 6110, the beacon node no longer needs the eth1 bridging mechanism to include new deposits as they are provided by the EL as a `deposit_request`. So after electra + a transition period where the finalized bridge deposits pre-fork are included through the old mechanism, we no longer need the elaborate machinery we had to get deposit contract data from the execution layer.

Since holesky has already forked to electra and completed the transition period, this PR basically checks to see if removing all the eth1 related logic leads to any surprises.
2025-06-23 03:00:07 +00:00
Age Manning
d50924677a Remove instrumenting log level (#7620)
I think this should resolve #7155


This removes the level field from the instrumenting we were doing across a range of functions. The level will now default to the level of the log.
2025-06-20 07:44:59 +00:00