#7181
Instead of storing the network key as binary data we store it as hex, allowing users to modify it via the file.
We can read old-binary forms, however we will migrate binary to hex as it will be the new standard.
Attempt to fix this error reported by `beaconcha.in` on their Hoodi archive nodes:
> {"code":500,"message":"UNHANDLED_ERROR: DBError(CacheBuildError(BeaconState(MilhouseError(OutOfBoundsIterFrom { index: 1199549, len: 1060000 }))))","stacktraces":[]}
There are only a handful of places where we call `iter_from`.
This one is safe by construction (the check immediately prior ensures `self.pubkeys.len()` is not out of bounds):
cfb1f73310/beacon_node/beacon_chain/src/validator_pubkey_cache.rs (L84-L90)
This one should also be safe, and the indexes used here would not be as large as the ones in the reported error:
cfb1f73310/consensus/state_processing/src/per_epoch_processing/single_pass.rs (L365-L368)
Which leaves one remaining usage which must be the culprit:
cfb1f73310/consensus/types/src/beacon_state.rs (L2109-L2113)
This indexing relies on the invariant that `self.pubkey_cache().len() <= self.validators.len()`. We mostly maintain that invariant, except for in `rebase_caches_on` (fixed in this PR).
The other bug, is that we were calling `rebase_on_finalized` for all "hot" states, which post-v7.1.0 includes states prior to the split which are required by the hdiff grid. This is how we end up calling something like `genesis_state.rebase_on(&split_state)`, which then corrupts the pubkey cache of the genesis state using the newer pubkey cache from the split state.
This PR adds a created_timestamp to the beacon processor send channel. When work items are sent through that channel `try_send` will forward the work event along with the current timestamp to the beacon processor. When the work event is completed the `Drop` impl for `SendOnDrop` will track the time it took from work event creation to its completion. Previously we only had data on how long a work event took to process, but didn't have data on how long it sat in the queue + how long it took to process.
N/A
Add a flag to disable get blobs. I configured the flag to disable it regardless of version because its most likely something we use for testing anyway.
#7815
- removes all existing spans, so some span fields that appear in logs like `service_name` may be lost.
- instruments a few key code paths in the beacon node, starting from **root spans** named below:
* Gossip block and blobs
* `process_gossip_data_column_sidecar`
* `process_gossip_blob`
* `process_gossip_block`
* Rpc block and blobs
* `process_rpc_block`
* `process_rpc_blobs`
* `process_rpc_custody_columns`
* Rpc blocks (range and backfill)
* `process_chain_segment`
* `PendingComponents` lifecycle
* `pending_components`
To test locally:
* Run Grafana and Tempo with https://github.com/sigp/lighthouse-metrics/pull/57
* Run Lighthouse BN with `--telemetry-collector-url http://localhost:4317`
Some captured traces can be found here: https://hackmd.io/@jimmygchen/r1sLOxPPeg
Removing the old spans seem to have reduced the memory usage quite a lot - i think we were using them on long running tasks and too excessively:
<img width="910" height="495" alt="image" src="https://github.com/user-attachments/assets/5208bbe4-53b2-4ead-bc71-0b782c788669" />
The current `OVERFLOW_LRU_CAPACITY` of `1024` seems a bit excessive now we rarely store more than 1 `PendingComponents` (under normal networking components). Additionally given the blob count increases, the max size of `PendingComponents` has also increased and is expected to increase further.
This PR brings the max capacity of the cache down to `64`, which should be more than enough headroom but also give us better protection from the network.
This PR fixes a bug where wrong columns could get processed immediately after a CGC increase.
Scenario:
- The node's CGC increased due to additional validators attached to it (lets say from 10 to 11)
- The new CGC is advertised and new subnets are subscribed immediately, however the change won't be effective in the data availability check until the next epoch (See [this](ab0e8870b4/beacon_node/beacon_chain/src/validator_custody.rs (L93-L99))). Data availability checker still only require 10 columns for the current epoch.
- During this time, data columns for the additional custody column (lets say column 11) may arrive via gossip as we're already subscribed to the topic, and it may be incorrectly used to satisfy the existing data availability requirement (10 columns), and result in this additional column (instead of a required one) getting persisted, resulting in database inconsistency.
Fix Clippy for recently released Rust 1.90 beta. There may be more changes required when Rust 1.89 stable is released in a few days, but possibly not 🤞
While looking at metrics I noticed that `beacon_blobs_from_el_expected` and `beacon_blobs_from_el_received_total` have different buckets, this PR adds more buckets to both (to prepare for Fusaka) and make them both consistent.
I noticed that we are serving preset values for Fulu on mainnet nodes prior to the fork. This has already gone live in v7.1.0, but should hopefully be handled in a graceful way by API consumers.
This PR _reverts_ the serving of Fulu data prior to Fulu, by serving Fulu data only if Fulu is scheduled.
Small tweak to `Delayed head block` logging to make it more representative of actual issues. Previously we used the total import delay to determine whether a block was late, but this includes the time taken for IO (and now hdiff computation) which happens _after_ the block is made attestable.
This PR changes the logic to use the attestable delay (where possible) falling back to the previous value if the block doesn't have one; e.g. if it didn't meet the conditions to make it into the attestable cache.
#7647
Introduces a new record in the blobs db `DataColumnCustodyInfo`
When `DataColumnCustodyInfo` exists in the db this indicates that a recent cgc change has occurred and/or that a custody backfill sync is currently in progress (custody backfill will be added as a separate PR). When a cgc change has occurred `earliest_available_slot` will be equal to the slot at which the cgc change occured. During custody backfill sync`earliest_available_slot` should be updated incrementally as it progresses.
~~Note that if `advertise_false_custody_group_count` is enabled we do not add a `DataColumnCustodyInfo` record in the db as that would affect the status v2 response.~~
(See comment https://github.com/sigp/lighthouse/pull/7648#discussion_r2212403389)
~~If `DataColumnCustodyInfo` doesn't exist in the db this indicates that we have fulfilled our custody requirements up to the DA window.~~
(It now always exist, and the slot will be set to `None` once backfill is complete)
StatusV2 now uses `DataColumnCustodyInfo` to calculate the `earliest_available_slot` if a `DataColumnCustodyInfo` record exists in the db, if it's `None`, then we return the `oldest_block_slot`.
#7700
As described in title, the EL already performs KZG verification on all blobs when they entered the mempool, so it's redundant to perform extra validation on blobs returned from the EL.
This PR removes
- KZG verification for both blobs and data columns during block production
- KZG verification for data columns after fetch engine blobs call. I have not done this for blobs because it requires extra changes to check the observed cache, and doesn't feel like it's a worthy optimisation given the number of blobs per block.
This PR does not remove KZG verification on the block publishing path yet.
N/A
During building an enr on startup, we weren't using the value in the custody context.
This was resulting in the enr value getting updated when the cgc updates, the change getting persisted, but getting set back to the default on restart.
This PR takes the value explicitly from the custody context.
Closes#6855
Add PeerDAS broadcast validation tests and fix a small bug where `sampling_columns_indices` is none (indicating that we've already sampled the necessary columns) and `process_gossip_data_columns` gets called
N/A
Lighthouse BN http endpoint would return a server error pre-genesis on the `validator/duties/attester` and `validator/prepare_beacon_proposer` because `slot_clock.now()` would return a `None` pre-genesis.
The prysm VC depends on the endpoints pre-genesis and was having issues interoping with the lighthouse bn because of this reason.
The proposer duties endpoint explicitly handles the pre-genesis case here
538067f1ff/beacon_node/http_api/src/proposer_duties.rs (L23-L28)
I see no reason why we can't make the other endpoints more flexible to work pre-genesis. This PR handles the pre-genesis case on the attester and prepare_beacon_proposer endpoints as well.
Thanks for raising @james-prysm.
Which issue # does this PR address?
Closes#7604
Improvements to range sync including:
1. Contain column requests only to peers that are part of the SyncingChain
2. Attribute the fault to the correct peer and downscore them if they don't return the data columns for the request
3. Improve sync performance by retrying only the failed columns from other peers instead of failing the entire batch
4. Uses the earliest_available_slot to make requests to peers that claim to have the epoch. Note: if no earliest_available_slot info is available, fallback to using previous logic i.e. assume peer has everything backfilled upto WS checkpoint/da boundary
Tested this on fusaka-devnet-2 with a full node and supernode and the recovering logic seems to works well.
Also tested this a little on mainnet.
Need to do more testing and possibly add some unit tests.
Closes#7467.
This PR primarily addresses [the P2P changes](https://github.com/ethereum/EIPs/pull/9840) in [fusaka-devnet-2](https://fusaka-devnet-2.ethpandaops.io/). Specifically:
* [the new `nfd` parameter added to the `ENR`](https://github.com/ethereum/EIPs/pull/9840)
* [the modified `compute_fork_digest()` changes for every BPO fork](https://github.com/ethereum/EIPs/pull/9840)
90% of this PR was absolutely hacked together as fast as possible during the Berlinterop as fast as I could while running between Glamsterdam debates. Luckily, it seems to work. But I was unable to be as careful in avoiding bugs as I usually am. I've cleaned up the things *I remember* wanting to come back and have a closer look at. But still working on this.
Progress:
* [x] get it working on `fusaka-devnet-2`
* [ ] [*optional* disconnect from peers with incorrect `nfd` at the fork boundary](https://github.com/ethereum/consensus-specs/pull/4407) - Can be addressed in a future PR if necessary
* [x] first pass clean-up
* [x] fix up all the broken tests
* [x] final self-review
* [x] more thorough review from people more familiar with affected code
The current data column KZG verification buckets are not giving us useful info as the upper bound is too low. And we see most of numbers above 70ms for batch verification, and we don't know how much time it really takes.
This PR improves the buckets based on the numbers we got from testing. Exponential bucket seems like a good candidate here given we're expecting to increase blob count with a similar approach (possibly 2x each fork if it goes well).
Lighthouse is currently loggign a lot errors in the `RPC` behaviour whenever a response is received for a request_id that no longer exists in active_inbound_requests. This is likely due to a data race or timing issue (e.g., the peer disconnecting before the response is handled).
This PR addresses that by removing the error logging from the RPC layer. Instead, RPC::send_response now simply returns an Err, shifting the responsibility to the main service. The main service can then determine whether the peer is still connected and only log an error if the peer remains connected.
Thanks @ackintosh for helping debug!
Reintroduce the `--logfile` flag with a deprecation warning so that it doesn't prevent nodes from starting. This is considered preferable to breaking node startups so that users fix the flag, even though it means the `--logfile` flag is completely ineffective.
The flag was initially removed in:
- https://github.com/sigp/lighthouse/pull/6339
Fixes#7155.
It turns out the issue is caused by calling a function that creates an info span (`chain.id()` here), e.g.
```rust
debug!(id = chain.id(), ?sync_type, reason = ?remove_reason, op, "Chain removed");
```
I've remove all unneeded spans, especially getter functions - there's little reasons for span and they often get used in logging. We should also revisit all the spans after the release - i think we could make them more useful than they are today.
I've let it run for a while and no longer seeing any `DEBUG` logs.
Closes:
- https://github.com/sigp/lighthouse/issues/7690
Another checkpoint sync related fix! See issue for a description of the bug.
We fix it by just loading the block root of the `oldest_block_slot`, rather than trying to load the slot prior, which will always fail.
Fix a bug involving checkpoint sync from genesis reported by Sunnyside labs.
Ensure that the store's `anchor` is initialised prior to storing the genesis state. In the case of checkpoint sync from genesis, the genesis state will be in the _hot DB_, so we need the hot DB metadata to be initialised in order to store it.
I've extended the existing checkpoint sync tests to cover this case as well. There are some subtleties around what the `state_upper_limit` should be set to in this case. I've opted to just enable state reconstruction from the start in the test so it gets set to 0, which results in an end state more consistent with the other test cases (full state reconstruction). This is required because we can't meaningfully do any state reconstruction when the split slot is 0 (there is no range of frozen slots to reconstruct).
- #6240
- Bring built-in network configs up to date with latest consensus-spec PeerDAS configs.
- Add `MIN_EPOCHS_FOR_DATA_COLUMN_SIDECARS_REQUESTS` and use it to determine data availability window after the Fulu fork.
N/A
This PR switches to using `prepare_beacon_proposer` instead of `beacon_committee_subscriptions` endpoint to register validators with the custody context.
We currently use the `beacon_committee_subscriptions` endpoint for registering validators in the custody context.
Using the subscriptions endpoint has a few disadvantages:
1. The lighthouse VC tries to optimise the number of calls it makes to this endpoint to reduce the load on the subscriptions endpoint. So we would be getting different a subset of the total number of validators in each call. This will lead to a ramp up of the validator custody units instead of a one time bump. For e.g. see these logs
```
Jun 30 22:36:05.012 DEBUG Validator count at head updated old_count: 0, new_count: 19
Jun 30 22:36:11.016 DEBUG Validator count at head updated old_count: 19, new_count: 24
Jun 30 22:36:17.017 DEBUG Validator count at head updated old_count: 24, new_count: 27
Jun 30 22:36:23.020 DEBUG Validator count at head updated old_count: 27, new_count: 32
Jun 30 22:36:29.016 DEBUG Validator count at head updated old_count: 32, new_count: 36
Jun 30 22:36:35.005 DEBUG Validator count at head updated old_count: 36, new_count: 42
Jun 30 22:36:41.014 DEBUG Validator count at head updated old_count: 42, new_count: 44
Jun 30 22:36:47.017 DEBUG Validator count at head updated old_count: 44, new_count: 46
Jun 30 22:36:53.007 DEBUG Validator count at head updated old_count: 46, new_count: 48
Jun 30 22:36:59.009 DEBUG Validator count at head updated old_count: 48, new_count: 49
Jun 30 22:37:05.014 DEBUG Validator count at head updated old_count: 49, new_count: 50
Jun 30 22:37:11.007 DEBUG Validator count at head updated old_count: 50, new_count: 53
Jun 30 22:37:17.007 DEBUG Validator count at head updated old_count: 53, new_count: 55
Jun 30 22:37:35.008 DEBUG Validator count at head updated old_count: 55, new_count: 58
Jun 30 22:37:41.007 DEBUG Validator count at head updated old_count: 58, new_count: 59
Jun 30 22:37:53.010 DEBUG Validator count at head updated old_count: 59, new_count: 60
Jun 30 22:38:05.013 DEBUG Validator count at head updated old_count: 60, new_count: 61
Jun 30 22:38:23.006 DEBUG Validator count at head updated old_count: 61, new_count: 62
Jun 30 22:38:29.009 DEBUG Validator count at head updated old_count: 62, new_count: 63
Jun 30 22:38:41.009 DEBUG Validator count at head updated old_count: 63, new_count: 64
```
2. Different VCs would probably have different behaviours in terms of sending subscriptions
In contrast, the `prepare_beacon_proposer` endpoint usage would be more standard across different VCs without any filtering of validators. Not doing so could mean potentially missing proposals so VCs are incentivised to make this call on any change in the validators managed by them.
Lighthouse calls this endpoint every slot.
Update `SAMPLES_PER_SLOT` to be number of custody groups instead of data columns. This should have no impact on the current implementation as config currently maintains a `group:subnet:column` ratio of `1:1:1`. **In short, this PR doesn't change anything for Fusaka, but ensures compliance with the spec and potential future changes.**
I've added separate methods to compute sampling columns and custody groups for clarity: `spec.sampling_size_columns` and `spec.sampling_size_custod_groups`
See the clarifications in this PR for more details: https://github.com/ethereum/consensus-specs/pull/4251
N/A
Persist the epoch -> cgc values. This is to ensure that `ValidatorRegistrations::latest_validator_custody_requirement` always returns a `Some` value post restart assuming the `epoch_validator_custody_requirements` map has been updated in the previous runs.
This PR implements some heuristics to check for breaking database changes. The goal is to prevent accidental changes to the database schema occurring without a version bump.
The `sync_contributions_across_fork_with_skip_slots` test have been quite flaky recently on CI, we suspect it might be caused by the recent introduction of a `default` timeout in #7400, and CI is failing to consistently process those http requests within 1s likely due to limited resources.
This PR increases the `default` timeout to 2s in tests to avoid flaky tests, but keeps the remaining timeout the same (1s).
https://github.com/sigp/lighthouse/actions/runs/15965113170/job/45023976021
```
FAIL [ 8.945s] http_api::bn_http_api_tests fork_tests::sync_contributions_across_fork_with_skip_slots
stdout ───
running 1 test
test fork_tests::sync_contributions_across_fork_with_skip_slots ... FAILED
failures:
failures:
fork_tests::sync_contributions_across_fork_with_skip_slots
test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 175 filtered out; finished in 8.91s
stderr ───
thread 'fork_tests::sync_contributions_across_fork_with_skip_slots' panicked at beacon_node/http_api/tests/fork_tests.rs:239:10:
called `Result::unwrap()` on an `Err` value: HttpClient(url: http://127.0.0.1:41793/, kind: timeout, detail: operation timed out)
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
```
When we removed the eth1 data, I wrote a v25 schema upgrade to delete the data on disk:
- https://github.com/sigp/lighthouse/pull/7133
However, I forgot to update the current schema version, so this change was never actioned.
This PR updates the current schema version to v25 so that the migration runs.
This bug was first found and partially fixed by @VolodymyrBg in #7317 - this PR applies the same fix everywhere else.
The old logic updated the waker when it already matched the context, and did nothing when it was stale:
```rust
if waker.will_wake(cx.waker()) {
self.waker = Some(cx.waker().clone());
}
```
This is the wrong way around. We only want to update the waker if it doesn't match the current context:
```rust
if !waker.will_wake(cx.waker()) {
self.waker = Some(cx.waker().clone());
}
```
I don't think we've ever noticed any issues, but it’s a subtle bug that could lead to missed wakeups.