Follow-up to the bug fixed in:
- https://github.com/sigp/lighthouse/pull/8121
This fixes the root cause of that bug, which was introduced by me in:
- https://github.com/sigp/lighthouse/pull/8101
Lion identified the issue here:
- https://github.com/sigp/lighthouse/pull/8101#discussion_r2382710356
In the methods that compute the proposer shuffling decision root, ensure we don't use lookahead for the Fulu fork epoch itself. This is accomplished by checking if Fulu is enabled at `epoch - 1`, i.e. if `epoch > fulu_fork_epoch`.
I haven't updated the methods that _compute_ shufflings to use these new corrected bounds (e.g. `BeaconState::compute_proposer_indices`), although we could make this change in future. The `get_beacon_proposer_indices` method already gracefully handles the Fulu boundary case by using the `proposer_lookahead` field (if initialised).
Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
As identified by a researcher during the Fusaka security competition, we were computing the proposer index incorrectly in some places by computing without lookahead.
- [x] Add "low level" checks to computation functions in `consensus/types` to ensure they error cleanly
- [x] Re-work the determination of proposer shuffling decision roots, which are now fork aware.
- [x] Re-work and simplify the beacon proposer cache to be fork-aware.
- [x] Optimise `with_proposer_cache` to use `OnceCell`.
- [x] All tests passing.
- [x] Resolve all remaining `FIXME(sproul)`s.
- [x] Unit tests for `ProtoBlock::proposer_shuffling_root_for_child_block`.
- [x] End-to-end regression test.
- [x] Test on pre-Fulu network.
- [x] Test on post-Fulu network.
Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
- PR https://github.com/sigp/lighthouse/pull/8045 introduced a regression of how lookup sync interacts with the da_checker.
Now in unstable block import from the HTTP API also insert the block in the da_checker while the block is being execution verified. If lookup sync finds the block in the da_checker in `NotValidated` state it expects a `GossipBlockProcessResult` message sometime later. That message is only sent after block import in gossip.
I confirmed in our node's logs for 4/4 cases of stuck lookups are caused by this sequence of events:
- Receive block through API, insert into da_checker in fn process_block in put_pre_execution_block
- Create lookup and leave in AwaitingDownload(block in processing cache) state
- Block from HTTP API finishes importing
- Lookup is left stuck
Closes https://github.com/sigp/lighthouse/issues/8104
- https://github.com/sigp/lighthouse/pull/8110 was my initial solution attempt but we can't send the `GossipBlockProcessResult` event from the `http_api` crate without adding new channels, which seems messy.
For a given node it's rare that a lookup is created at the same time that a block is being published. This PR solves https://github.com/sigp/lighthouse/issues/8104 by allowing lookup sync to import the block twice in that case.
Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
The default debug output of these types contains a lot of unnecessary noise making it hard to read.
This PR removes the type and extra fields from debug output to make logs easier to read.
`len` could be potentially useful in some cases, but this gives us flexibility to only log it separately if we need it.
Related PR in `ssz_types`:
- https://github.com/sigp/ssz_types/pull/57
Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
Fixes the issue described in #7980 where Lighthouse repeatedly sends `DataColumnsByRoot` requests to the same peers that return empty responses, causing sync to get stuck.
The root cause was we don't count empty responses as failures, leading to excessive retries to unresponsive peers.
- Track per peer attempts to limit retry attempts per peer (`MAX_CUSTODY_PEER_ATTEMPTS = 3`)
- Replaced random peer selection with hashing within each lookup to prevent splitting lookup into too many small requests and improve request batching efficiency.
- Added `single_block_lookup` root span to track all lookups created and added more debug logs:
<img width="1264" height="501" alt="image" src="https://github.com/user-attachments/assets/983629ba-b6d0-41cf-8e93-88a5b96c2f31" />
Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
Co-Authored-By: Jimmy Chen <jimmy@sigmaprime.io>
* #6610
- [x] Add `beacon_committee_selections` endpoint
- [x] Test beacon committee aggregator and confirmed working
- [x] Add `sync_committee_selections` endpoint
- [x] Test sync committee aggregator and confirmed working
This method is a footgun because it truncates the list. It is the source of a recent bug:
- https://github.com/sigp/lighthouse/pull/7927
- Delete uses of `RuntimeVariableList::from_vec` and replace them with `::new` which does validation and can fail.
- Propagate errors where possible, unwrap in tests and use `expect` for obviously-safe uses (in `chain_spec.rs`).
Adds the required boilerplate code for the Gloas (Glamsterdam) hard fork. This allows PRs testing Gloas-candidate features to test fork transition.
This also includes de-duplication of post-Bellatrix readiness notifiers from #6797 (credit to @dapplion)
Fixes#7926
This was a bug I introduced in #7890 and @pawanjay176 noticed it on some running nodes, and added a rpc test to confirm it.
The culprit is this line, where I failed to fill the vec to it's max size, so it doesn't calculate the max size properly, resulting in all `DataColumnByRoot` requests exceeding the max size during validation:
d24a6d2a45/consensus/types/src/chain_spec.rs (L1984)
The PR fixes this and includes new regression tests for this fix.
- Remove explicit `env_logger` usage from `state_processing` tests and `lcli`.
- Set up tracing correctly for `lcli` (I've checked that we can see logs after this change).
- I didn't do anything to set up logging for the `state_processing` tests, as these are rarely run manually (they never fail). We could add `test_logger` in there on an as-needed basis.
Re-export `context_deserialize_derive` inside of `context_deserialize` so they are both available from the same interface, which matches how popular crates (like `serde`) handle this.
This also nests both crates inside a new `context_deserialize` directory which will make it easier to eventually spin out into a different repo (if/when) we decide to do that (plus I prefer it aesthetically).
Attempt to fix this error reported by `beaconcha.in` on their Hoodi archive nodes:
> {"code":500,"message":"UNHANDLED_ERROR: DBError(CacheBuildError(BeaconState(MilhouseError(OutOfBoundsIterFrom { index: 1199549, len: 1060000 }))))","stacktraces":[]}
There are only a handful of places where we call `iter_from`.
This one is safe by construction (the check immediately prior ensures `self.pubkeys.len()` is not out of bounds):
cfb1f73310/beacon_node/beacon_chain/src/validator_pubkey_cache.rs (L84-L90)
This one should also be safe, and the indexes used here would not be as large as the ones in the reported error:
cfb1f73310/consensus/state_processing/src/per_epoch_processing/single_pass.rs (L365-L368)
Which leaves one remaining usage which must be the culprit:
cfb1f73310/consensus/types/src/beacon_state.rs (L2109-L2113)
This indexing relies on the invariant that `self.pubkey_cache().len() <= self.validators.len()`. We mostly maintain that invariant, except for in `rebase_caches_on` (fixed in this PR).
The other bug, is that we were calling `rebase_on_finalized` for all "hot" states, which post-v7.1.0 includes states prior to the split which are required by the hdiff grid. This is how we end up calling something like `genesis_state.rebase_on(&split_state)`, which then corrupts the pubkey cache of the genesis state using the newer pubkey cache from the split state.
#7815
- removes all existing spans, so some span fields that appear in logs like `service_name` may be lost.
- instruments a few key code paths in the beacon node, starting from **root spans** named below:
* Gossip block and blobs
* `process_gossip_data_column_sidecar`
* `process_gossip_blob`
* `process_gossip_block`
* Rpc block and blobs
* `process_rpc_block`
* `process_rpc_blobs`
* `process_rpc_custody_columns`
* Rpc blocks (range and backfill)
* `process_chain_segment`
* `PendingComponents` lifecycle
* `pending_components`
To test locally:
* Run Grafana and Tempo with https://github.com/sigp/lighthouse-metrics/pull/57
* Run Lighthouse BN with `--telemetry-collector-url http://localhost:4317`
Some captured traces can be found here: https://hackmd.io/@jimmygchen/r1sLOxPPeg
Removing the old spans seem to have reduced the memory usage quite a lot - i think we were using them on long running tasks and too excessively:
<img width="910" height="495" alt="image" src="https://github.com/user-attachments/assets/5208bbe4-53b2-4ead-bc71-0b782c788669" />
This PR fixes a bug where wrong columns could get processed immediately after a CGC increase.
Scenario:
- The node's CGC increased due to additional validators attached to it (lets say from 10 to 11)
- The new CGC is advertised and new subnets are subscribed immediately, however the change won't be effective in the data availability check until the next epoch (See [this](ab0e8870b4/beacon_node/beacon_chain/src/validator_custody.rs (L93-L99))). Data availability checker still only require 10 columns for the current epoch.
- During this time, data columns for the additional custody column (lets say column 11) may arrive via gossip as we're already subscribed to the topic, and it may be incorrectly used to satisfy the existing data availability requirement (10 columns), and result in this additional column (instead of a required one) getting persisted, resulting in database inconsistency.
- In shuffling, a the raw_pivot (u64) is cast to a usize which will break on 32 bit systems. Now it is modulo'ed with the list_size first then cast to a usize.
- ruint doesn't implement shifting with u64's on 32-bit arch. Since `prefix_bits` is u8 and NODE_ID_BITS = 256, we use them as u32's instead.
See: https://docs.rs/ruint/latest/src/ruint/bits.rs.html#711
Fix Clippy for recently released Rust 1.90 beta. There may be more changes required when Rust 1.89 stable is released in a few days, but possibly not 🤞
I noticed that we are serving preset values for Fulu on mainnet nodes prior to the fork. This has already gone live in v7.1.0, but should hopefully be handled in a graceful way by API consumers.
This PR _reverts_ the serving of Fulu data prior to Fulu, by serving Fulu data only if Fulu is scheduled.
Alternative to:
- https://github.com/sigp/lighthouse/pull/7758
Serve the `blob_schedule` field on `/eth/v1/config/spec` _only_ when Fulu is enabled. If the blob schedule is empty, we will still serve it as `[]`, so long as Fulu is enabled.
N/A
Serializes the blob_schedule in ascending order to match other clients. This is needed to keep the output of `eth/v1/config/spec` http endpoint consistent across clients.
cc @barnabasbusa
Closes#7467.
This PR primarily addresses [the P2P changes](https://github.com/ethereum/EIPs/pull/9840) in [fusaka-devnet-2](https://fusaka-devnet-2.ethpandaops.io/). Specifically:
* [the new `nfd` parameter added to the `ENR`](https://github.com/ethereum/EIPs/pull/9840)
* [the modified `compute_fork_digest()` changes for every BPO fork](https://github.com/ethereum/EIPs/pull/9840)
90% of this PR was absolutely hacked together as fast as possible during the Berlinterop as fast as I could while running between Glamsterdam debates. Luckily, it seems to work. But I was unable to be as careful in avoiding bugs as I usually am. I've cleaned up the things *I remember* wanting to come back and have a closer look at. But still working on this.
Progress:
* [x] get it working on `fusaka-devnet-2`
* [ ] [*optional* disconnect from peers with incorrect `nfd` at the fork boundary](https://github.com/ethereum/consensus-specs/pull/4407) - Can be addressed in a future PR if necessary
* [x] first pass clean-up
* [x] fix up all the broken tests
* [x] final self-review
* [x] more thorough review from people more familiar with affected code
Update `c-kzg` from `v1` to `v2`. My motivation here is that `alloy-consensus` now uses `c-kzg` in `v2` and this results in a conflict when using lighthouse in combination with latest alloy. I tried also to disable the `czkg` feature in alloy, but the conflict persisted.
See here for the alloy update to `c-kzg v2`: https://github.com/alloy-rs/alloy/pull/2240
Error:
```
error: failed to select a version for `c-kzg`.
...
versions that meet the requirements `^1` are: 1.0.3, 1.0.2, 1.0.0
the package `c-kzg` links to the native library `ckzg`, but it conflicts with a previous package which links to `ckzg` as well:
package `c-kzg v2.1.0`
... which satisfies dependency `c-kzg = "^2.1"` of package `alloy-consensus v0.13.0`
... which satisfies dependency `alloy-consensus = "^0.13.0"` of package ...
...
```
- Upgrade `alloy-consensus` to `0.14.0` and disable all default features
- Upgrade `c-kzg` to `v2.1.0`
- Upgrade `alloy-primitives` to `1.0.0`
- Adapt the code to the new API `c-kzg`
- There is now `NO_PRECOMPUTE` as my understand from https://github.com/ethereum/c-kzg-4844/pull/545/files we should use `0` here as `new_from_trusted_setup_no_precomp` does not precomp. But maybe it is misleading. For all other places I used `RECOMMENDED_PRECOMP_WIDTH` because `8` is matching the recommendation.
- `BYTES_PER_G1_POINT` and `BYTES_PER_G2_POINT` are no longer public in `c-kzg`
- I adapted two tests that checking for the `Attestation` bitfield size. But I could not pinpoint to what has changed and why now 8 bytes less. I would be happy about any hint, and if this is correct. I found related a PR here: https://github.com/sigp/lighthouse/pull/6915
- Use same fields names, in json, as well as `c-kzg` and `rust_eth_kzg` for `g1_monomial`, `g1_lagrange`, and `g2_monomial`
- #6240
- Bring built-in network configs up to date with latest consensus-spec PeerDAS configs.
- Add `MIN_EPOCHS_FOR_DATA_COLUMN_SIDECARS_REQUESTS` and use it to determine data availability window after the Fulu fork.
Update `SAMPLES_PER_SLOT` to be number of custody groups instead of data columns. This should have no impact on the current implementation as config currently maintains a `group:subnet:column` ratio of `1:1:1`. **In short, this PR doesn't change anything for Fusaka, but ensures compliance with the spec and potential future changes.**
I've added separate methods to compute sampling columns and custody groups for clarity: `spec.sampling_size_columns` and `spec.sampling_size_custod_groups`
See the clarifications in this PR for more details: https://github.com/ethereum/consensus-specs/pull/4251
#6970
This allows for us to receive `SingleAttestation` over gossip and process it without converting. There is still a conversion to `Attestation` as a final step in the attestation verification process, but by then the `SingleAttestation` is fully verified.
I've also fully removed the `submitPoolAttestationsV1` endpoint as its been deprecated
I've also pre-emptively deprecated supporting `Attestation` in `submitPoolAttestationsV2` endpoint. See here for more info: https://github.com/ethereum/beacon-APIs/pull/531
I tried to the minimize the diff here by only making the "required" changes. There are some unnecessary complexities with the way we manage the different attestation verification wrapper types. We could probably consolidate this to one wrapper type and refactor this even further. We could leave that to a separate PR if we feel like cleaning things up in the future.
Note that I've also updated the test harness to always submit `SingleAttestation` regardless of fork variant. I don't see a problem in that approach and it allows us to delete more code :)
Resolves#6767
This PR implements a basic version of validator custody.
- It introduces a new `CustodyContext` object which contains info regarding number of validators attached to a node and the custody count they contribute to the cgc.
- The `CustodyContext` is added in the da_checker and has methods for returning the current cgc and the number of columns to sample at head. Note that the logic for returning the cgc existed previously in the network globals.
- To estimate the number of validators attached, we use the `beacon_committee_subscriptions` endpoint. This might overestimate the number of validators actually publishing attestations from the node in the case of multi BN setups. We could also potentially use the `publish_attestations` endpoint to get a more conservative estimate at a later point.
- Anytime there's a change in the `custody_group_count` due to addition/removal of validators, the custody context should send an event on a broadcast channnel. The only subscriber for the channel exists in the network service which simply subscribes to more subnets. There can be additional subscribers in sync that will start a backfill once the cgc changes.
TODO
- [ ] **NOT REQUIRED:** Currently, the logic only handles an increase in validator count and does not handle a decrease. We should ideally unsubscribe from subnets when the cgc has decreased.
- [ ] **NOT REQUIRED:** Add a service in the `CustodyContext` that emits an event once `MIN_EPOCHS_FOR_BLOB_SIDECARS_REQUESTS ` passes after updating the current cgc. This event should be picked up by a subscriber which updates the enr and metadata.
- [x] Add more tests
Fix clippy lints for `rustc` 1.87
clippy complains about `BeaconChainError` being too large. I went on a bit of a boxing spree because of this. We may instead want to `Box` some of the `BeaconChainError` variants?
- Create trait `ValidatorStore` with all functions used by the `validator_services`
- Make `validator_services` generic on `S: ValidatorStore`
- Introduce `LighthouseValidatorStore`, which has identical functionality to the old `ValidatorStore`
- Remove dependencies (especially `environment`) from `validator_services` and `beacon_node_fallback` in order to be able to cleanly use them in Anchor