Take 2 of #8390.
Fixes the race condition properly instead of propagating the error. I think this is a better alternative, and doesn't seem to look that bad.
* Lift node id loading or generation from `NetworkService ` startup to the `ClientBuilder`, so that it can be used to compute custody columns for the beacon chain without waiting for Network bootstrap.
I've considered and implemented a few alternatives:
1. passing `node_id` to beacon chain builder and compute columns when creating `CustodyContext`. This approach isn't good for separation of concerns and isn't great for testability
2. passing `ordered_custody_groups` to beacon chain. `CustodyContext` only uses this to compute ordered custody columns, so we might as well lift this logic out, so we don't have to do error handling in `CustodyContext` construction. Less tests to update;.
Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
The beacon API spec wasn't updated to use the Fulu definition of `dependent_root` for the proposer duties endpoint. No other client updated their logic, so to retain backwards compatibility the decision has been made to continue using the block root at the end of epoch `N - 1`, and introduce a new v2 endpoint down the track to use the correct dependent root.
Eth R&D discussion: https://discord.com/channels/595666850260713488/598292067260825641/1433036715848765562
Change the behaviour of the v1 endpoint back to using the last slot of `N - 1` rather than the last slot of `N - 2`. This introduces the possibility of dependent root false positives (the root can change without changing the shuffling), but causes the least compatibility issues with other clients.
Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
#7603
#### Custody backfill sync service
Similar in many ways to the current backfill service. There may be ways to unify the two services. The difficulty there is that the current backfill service tightly couples blocks and their associated blobs/data columns. Any attempts to unify the two services should be left to a separate PR in my opinion.
#### `SyncNeworkContext`
`SyncNetworkContext` manages custody sync data columns by range requests separetly from other sync RPC requests. I think this is a nice separation considering that custody backfill is its own service.
#### Data column import logic
The import logic verifies KZG committments and that the data columns block root matches the block root in the nodes store before importing columns
#### New channel to send messages to `SyncManager`
Now external services can communicate with the `SyncManager`. In this PR this channel is used to trigger a custody sync. Alternatively we may be able to use the existing `mpsc` channel that the `SyncNetworkContext` uses to communicate with the `SyncManager`. I will spend some time reviewing this.
Co-Authored-By: Eitan Seri-Levi <eserilev@ucsc.edu>
Co-Authored-By: Eitan Seri- Levi <eserilev@gmail.com>
Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
Addressing more review comments from:
- https://github.com/sigp/lighthouse/pull/8101
I've also tweaked a few more things that I think are minor bugs.
- Instrument `ensure_state_can_determine_proposers_for_epoch`
- Fix `block_root` usage in `compute_proposer_duties_from_head`. This was a regression introduced in 8101 😬 .
- Update the `state_advance_timer` to prime the next-epoch proposer cache post-Fulu.
Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
#7894
Moves the `Interchange` format from `slashing_protection` and thus removes the dependency on `slashing_protection` from `eth2` which can now just depend on the slimmer `eip_3076` crate.
Co-Authored-By: Mac L <mjladson@pm.me>
Update `ForkName::latest_stable` to Fulu, reflecting our plan to stabilise Fulu in the immediate future!
This will lead to some more tests running with Fulu rather than Electra.
Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
Use quoted integers for `state.proposer_lookahead` when serializing JSON. This is standard for all integer fields, but was missed for the newly added proposer lookahead. I noticed this issue while inspecting the head state on a local devnet.
I'm glad we found this before someone reported it :P
Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
- [x] Remove the unnecessary `_MILLIS` suffix from `MAXIMUM_GOSSIP_CLOCK_DISPARITY`
- [x] Add missing Deneb preset `KZG_COMMITMENT_INCLUSION_PROOF_DEPTH`, not to be confused with `KZG_COMMITMENTS_INCLUSION_PROOF_DEPTH` (plural) from Fulu...
Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
Follow-up to the bug fixed in:
- https://github.com/sigp/lighthouse/pull/8121
This fixes the root cause of that bug, which was introduced by me in:
- https://github.com/sigp/lighthouse/pull/8101
Lion identified the issue here:
- https://github.com/sigp/lighthouse/pull/8101#discussion_r2382710356
In the methods that compute the proposer shuffling decision root, ensure we don't use lookahead for the Fulu fork epoch itself. This is accomplished by checking if Fulu is enabled at `epoch - 1`, i.e. if `epoch > fulu_fork_epoch`.
I haven't updated the methods that _compute_ shufflings to use these new corrected bounds (e.g. `BeaconState::compute_proposer_indices`), although we could make this change in future. The `get_beacon_proposer_indices` method already gracefully handles the Fulu boundary case by using the `proposer_lookahead` field (if initialised).
Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
As identified by a researcher during the Fusaka security competition, we were computing the proposer index incorrectly in some places by computing without lookahead.
- [x] Add "low level" checks to computation functions in `consensus/types` to ensure they error cleanly
- [x] Re-work the determination of proposer shuffling decision roots, which are now fork aware.
- [x] Re-work and simplify the beacon proposer cache to be fork-aware.
- [x] Optimise `with_proposer_cache` to use `OnceCell`.
- [x] All tests passing.
- [x] Resolve all remaining `FIXME(sproul)`s.
- [x] Unit tests for `ProtoBlock::proposer_shuffling_root_for_child_block`.
- [x] End-to-end regression test.
- [x] Test on pre-Fulu network.
- [x] Test on post-Fulu network.
Co-Authored-By: Michael Sproul <michael@sigmaprime.io>
- PR https://github.com/sigp/lighthouse/pull/8045 introduced a regression of how lookup sync interacts with the da_checker.
Now in unstable block import from the HTTP API also insert the block in the da_checker while the block is being execution verified. If lookup sync finds the block in the da_checker in `NotValidated` state it expects a `GossipBlockProcessResult` message sometime later. That message is only sent after block import in gossip.
I confirmed in our node's logs for 4/4 cases of stuck lookups are caused by this sequence of events:
- Receive block through API, insert into da_checker in fn process_block in put_pre_execution_block
- Create lookup and leave in AwaitingDownload(block in processing cache) state
- Block from HTTP API finishes importing
- Lookup is left stuck
Closes https://github.com/sigp/lighthouse/issues/8104
- https://github.com/sigp/lighthouse/pull/8110 was my initial solution attempt but we can't send the `GossipBlockProcessResult` event from the `http_api` crate without adding new channels, which seems messy.
For a given node it's rare that a lookup is created at the same time that a block is being published. This PR solves https://github.com/sigp/lighthouse/issues/8104 by allowing lookup sync to import the block twice in that case.
Co-Authored-By: dapplion <35266934+dapplion@users.noreply.github.com>
The default debug output of these types contains a lot of unnecessary noise making it hard to read.
This PR removes the type and extra fields from debug output to make logs easier to read.
`len` could be potentially useful in some cases, but this gives us flexibility to only log it separately if we need it.
Related PR in `ssz_types`:
- https://github.com/sigp/ssz_types/pull/57
Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
Fixes the issue described in #7980 where Lighthouse repeatedly sends `DataColumnsByRoot` requests to the same peers that return empty responses, causing sync to get stuck.
The root cause was we don't count empty responses as failures, leading to excessive retries to unresponsive peers.
- Track per peer attempts to limit retry attempts per peer (`MAX_CUSTODY_PEER_ATTEMPTS = 3`)
- Replaced random peer selection with hashing within each lookup to prevent splitting lookup into too many small requests and improve request batching efficiency.
- Added `single_block_lookup` root span to track all lookups created and added more debug logs:
<img width="1264" height="501" alt="image" src="https://github.com/user-attachments/assets/983629ba-b6d0-41cf-8e93-88a5b96c2f31" />
Co-Authored-By: Jimmy Chen <jchen.tc@gmail.com>
Co-Authored-By: Jimmy Chen <jimmy@sigmaprime.io>
* #6610
- [x] Add `beacon_committee_selections` endpoint
- [x] Test beacon committee aggregator and confirmed working
- [x] Add `sync_committee_selections` endpoint
- [x] Test sync committee aggregator and confirmed working
This method is a footgun because it truncates the list. It is the source of a recent bug:
- https://github.com/sigp/lighthouse/pull/7927
- Delete uses of `RuntimeVariableList::from_vec` and replace them with `::new` which does validation and can fail.
- Propagate errors where possible, unwrap in tests and use `expect` for obviously-safe uses (in `chain_spec.rs`).
Adds the required boilerplate code for the Gloas (Glamsterdam) hard fork. This allows PRs testing Gloas-candidate features to test fork transition.
This also includes de-duplication of post-Bellatrix readiness notifiers from #6797 (credit to @dapplion)
Fixes#7926
This was a bug I introduced in #7890 and @pawanjay176 noticed it on some running nodes, and added a rpc test to confirm it.
The culprit is this line, where I failed to fill the vec to it's max size, so it doesn't calculate the max size properly, resulting in all `DataColumnByRoot` requests exceeding the max size during validation:
d24a6d2a45/consensus/types/src/chain_spec.rs (L1984)
The PR fixes this and includes new regression tests for this fix.
- Remove explicit `env_logger` usage from `state_processing` tests and `lcli`.
- Set up tracing correctly for `lcli` (I've checked that we can see logs after this change).
- I didn't do anything to set up logging for the `state_processing` tests, as these are rarely run manually (they never fail). We could add `test_logger` in there on an as-needed basis.
Re-export `context_deserialize_derive` inside of `context_deserialize` so they are both available from the same interface, which matches how popular crates (like `serde`) handle this.
This also nests both crates inside a new `context_deserialize` directory which will make it easier to eventually spin out into a different repo (if/when) we decide to do that (plus I prefer it aesthetically).
Attempt to fix this error reported by `beaconcha.in` on their Hoodi archive nodes:
> {"code":500,"message":"UNHANDLED_ERROR: DBError(CacheBuildError(BeaconState(MilhouseError(OutOfBoundsIterFrom { index: 1199549, len: 1060000 }))))","stacktraces":[]}
There are only a handful of places where we call `iter_from`.
This one is safe by construction (the check immediately prior ensures `self.pubkeys.len()` is not out of bounds):
cfb1f73310/beacon_node/beacon_chain/src/validator_pubkey_cache.rs (L84-L90)
This one should also be safe, and the indexes used here would not be as large as the ones in the reported error:
cfb1f73310/consensus/state_processing/src/per_epoch_processing/single_pass.rs (L365-L368)
Which leaves one remaining usage which must be the culprit:
cfb1f73310/consensus/types/src/beacon_state.rs (L2109-L2113)
This indexing relies on the invariant that `self.pubkey_cache().len() <= self.validators.len()`. We mostly maintain that invariant, except for in `rebase_caches_on` (fixed in this PR).
The other bug, is that we were calling `rebase_on_finalized` for all "hot" states, which post-v7.1.0 includes states prior to the split which are required by the hdiff grid. This is how we end up calling something like `genesis_state.rebase_on(&split_state)`, which then corrupts the pubkey cache of the genesis state using the newer pubkey cache from the split state.
#7815
- removes all existing spans, so some span fields that appear in logs like `service_name` may be lost.
- instruments a few key code paths in the beacon node, starting from **root spans** named below:
* Gossip block and blobs
* `process_gossip_data_column_sidecar`
* `process_gossip_blob`
* `process_gossip_block`
* Rpc block and blobs
* `process_rpc_block`
* `process_rpc_blobs`
* `process_rpc_custody_columns`
* Rpc blocks (range and backfill)
* `process_chain_segment`
* `PendingComponents` lifecycle
* `pending_components`
To test locally:
* Run Grafana and Tempo with https://github.com/sigp/lighthouse-metrics/pull/57
* Run Lighthouse BN with `--telemetry-collector-url http://localhost:4317`
Some captured traces can be found here: https://hackmd.io/@jimmygchen/r1sLOxPPeg
Removing the old spans seem to have reduced the memory usage quite a lot - i think we were using them on long running tasks and too excessively:
<img width="910" height="495" alt="image" src="https://github.com/user-attachments/assets/5208bbe4-53b2-4ead-bc71-0b782c788669" />
This PR fixes a bug where wrong columns could get processed immediately after a CGC increase.
Scenario:
- The node's CGC increased due to additional validators attached to it (lets say from 10 to 11)
- The new CGC is advertised and new subnets are subscribed immediately, however the change won't be effective in the data availability check until the next epoch (See [this](ab0e8870b4/beacon_node/beacon_chain/src/validator_custody.rs (L93-L99))). Data availability checker still only require 10 columns for the current epoch.
- During this time, data columns for the additional custody column (lets say column 11) may arrive via gossip as we're already subscribed to the topic, and it may be incorrectly used to satisfy the existing data availability requirement (10 columns), and result in this additional column (instead of a required one) getting persisted, resulting in database inconsistency.
- In shuffling, a the raw_pivot (u64) is cast to a usize which will break on 32 bit systems. Now it is modulo'ed with the list_size first then cast to a usize.
- ruint doesn't implement shifting with u64's on 32-bit arch. Since `prefix_bits` is u8 and NODE_ID_BITS = 256, we use them as u32's instead.
See: https://docs.rs/ruint/latest/src/ruint/bits.rs.html#711
Fix Clippy for recently released Rust 1.90 beta. There may be more changes required when Rust 1.89 stable is released in a few days, but possibly not 🤞
I noticed that we are serving preset values for Fulu on mainnet nodes prior to the fork. This has already gone live in v7.1.0, but should hopefully be handled in a graceful way by API consumers.
This PR _reverts_ the serving of Fulu data prior to Fulu, by serving Fulu data only if Fulu is scheduled.
Alternative to:
- https://github.com/sigp/lighthouse/pull/7758
Serve the `blob_schedule` field on `/eth/v1/config/spec` _only_ when Fulu is enabled. If the blob schedule is empty, we will still serve it as `[]`, so long as Fulu is enabled.
N/A
Serializes the blob_schedule in ascending order to match other clients. This is needed to keep the output of `eth/v1/config/spec` http endpoint consistent across clients.
cc @barnabasbusa
Closes#7467.
This PR primarily addresses [the P2P changes](https://github.com/ethereum/EIPs/pull/9840) in [fusaka-devnet-2](https://fusaka-devnet-2.ethpandaops.io/). Specifically:
* [the new `nfd` parameter added to the `ENR`](https://github.com/ethereum/EIPs/pull/9840)
* [the modified `compute_fork_digest()` changes for every BPO fork](https://github.com/ethereum/EIPs/pull/9840)
90% of this PR was absolutely hacked together as fast as possible during the Berlinterop as fast as I could while running between Glamsterdam debates. Luckily, it seems to work. But I was unable to be as careful in avoiding bugs as I usually am. I've cleaned up the things *I remember* wanting to come back and have a closer look at. But still working on this.
Progress:
* [x] get it working on `fusaka-devnet-2`
* [ ] [*optional* disconnect from peers with incorrect `nfd` at the fork boundary](https://github.com/ethereum/consensus-specs/pull/4407) - Can be addressed in a future PR if necessary
* [x] first pass clean-up
* [x] fix up all the broken tests
* [x] final self-review
* [x] more thorough review from people more familiar with affected code
Update `c-kzg` from `v1` to `v2`. My motivation here is that `alloy-consensus` now uses `c-kzg` in `v2` and this results in a conflict when using lighthouse in combination with latest alloy. I tried also to disable the `czkg` feature in alloy, but the conflict persisted.
See here for the alloy update to `c-kzg v2`: https://github.com/alloy-rs/alloy/pull/2240
Error:
```
error: failed to select a version for `c-kzg`.
...
versions that meet the requirements `^1` are: 1.0.3, 1.0.2, 1.0.0
the package `c-kzg` links to the native library `ckzg`, but it conflicts with a previous package which links to `ckzg` as well:
package `c-kzg v2.1.0`
... which satisfies dependency `c-kzg = "^2.1"` of package `alloy-consensus v0.13.0`
... which satisfies dependency `alloy-consensus = "^0.13.0"` of package ...
...
```
- Upgrade `alloy-consensus` to `0.14.0` and disable all default features
- Upgrade `c-kzg` to `v2.1.0`
- Upgrade `alloy-primitives` to `1.0.0`
- Adapt the code to the new API `c-kzg`
- There is now `NO_PRECOMPUTE` as my understand from https://github.com/ethereum/c-kzg-4844/pull/545/files we should use `0` here as `new_from_trusted_setup_no_precomp` does not precomp. But maybe it is misleading. For all other places I used `RECOMMENDED_PRECOMP_WIDTH` because `8` is matching the recommendation.
- `BYTES_PER_G1_POINT` and `BYTES_PER_G2_POINT` are no longer public in `c-kzg`
- I adapted two tests that checking for the `Attestation` bitfield size. But I could not pinpoint to what has changed and why now 8 bytes less. I would be happy about any hint, and if this is correct. I found related a PR here: https://github.com/sigp/lighthouse/pull/6915
- Use same fields names, in json, as well as `c-kzg` and `rust_eth_kzg` for `g1_monomial`, `g1_lagrange`, and `g2_monomial`
- #6240
- Bring built-in network configs up to date with latest consensus-spec PeerDAS configs.
- Add `MIN_EPOCHS_FOR_DATA_COLUMN_SIDECARS_REQUESTS` and use it to determine data availability window after the Fulu fork.
Update `SAMPLES_PER_SLOT` to be number of custody groups instead of data columns. This should have no impact on the current implementation as config currently maintains a `group:subnet:column` ratio of `1:1:1`. **In short, this PR doesn't change anything for Fusaka, but ensures compliance with the spec and potential future changes.**
I've added separate methods to compute sampling columns and custody groups for clarity: `spec.sampling_size_columns` and `spec.sampling_size_custod_groups`
See the clarifications in this PR for more details: https://github.com/ethereum/consensus-specs/pull/4251