## Issue Addressed
While reviewing #4801 I noticed that our use of `take_while` in the block replayer means that if a state root iterator _with gaps_ is provided, some additonal state roots will be dropped unnecessarily. In practice the impact is small, because once there's _one_ state root miss, the whole tree hash cache needs to be built anyway, and subsequent misses are less costly. However this was still a little inefficient, so I figured it's better to fix it.
## Proposed Changes
Use [`peeking_take_while`](https://docs.rs/itertools/latest/itertools/trait.Itertools.html#method.peeking_take_while) to avoid consuming the next element when checking whether it satisfies the slot predicate.
## Additional Info
There's a gist here that shows the basic dynamics in isolation: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=40b623cc0febf9ed51705d476ab140c5. Changing the `peeking_take_while` to a `take_while` causes the assert to fail. Similarly I've added a new test `block_replayer_peeking_state_roots` which fails if the same change is applied inside `get_state_root`.
Attempting to improve our CI speeds as its recently been a pain point.
Major changes:
- Use a github action to pull stable/nightly rust rather than building it each run
- Shift test suite to `nexttest` https://github.com/nextest-rs/nextest for CI
UPDATE:
So I've iterated on some changes, and although I think its still not optimal I think this is a good base to start from. Some extra things in this PR:
- Shifted where we pull rust from. We're now using this thing: https://github.com/moonrepo/setup-rust . It's got some interesting cache's built in, but was not seeing the gains that Jimmy managed to get. In either case tho, it can pull rust, cargofmt, clippy, cargo nexttest all in < 5s. So I think it's worthwhile.
- I've grouped a few of the check-like tests into a single test called `code-test`. Although we were using github runners in parallel which may be faster, it just seems wasteful. There were like 4-5 tests, where we would pull lighthouse, compile it, then run an action, like clippy, cargo-audit or fmt. I've grouped these into a single action, so we only compile lighthouse once, then in each step we run the checks. This avoids compiling lighthouse like 5 times.
- Ive made doppelganger tests run on our local machines to avoid pulling foundry, building and making lcli which are all now baked into the images.
- We have sccache and do not incremental compile lighthouse
Misc bonus things:
- Cargo update
- Fix web3 signer openssl keys which is required after a cargo update
- Use mock_instant in an LRU cache test to avoid non-deterministic test
- Remove race condition in building web3signer tests
There's still some things we could improve on. Such as downloading the EF tests every run and the web3-signer binary, but I've left these to be out of scope of this PR. I think the above are meaningful improvements.
Co-authored-by: Paul Hauner <paul@paulhauner.com>
Co-authored-by: realbigsean <seananderson33@gmail.com>
Co-authored-by: antondlr <anton@delaruelle.net>
## Issue Addressed
N/A
## Proposed Changes
Add lints for rust 1.71
[3789134](3789134ae2) is probably the one that needs most attention as it changes beacon state code. I changed the `is_in_inactivity_leak ` function to return a `ArithError` as not all consumers of that function work well with a `BeaconState::Error`.
## Issue Addressed
Addresses #4234
## Proposed Changes
- Skip withdrawals processing in an inconsistent state replay.
- Repurpose `StateRootStrategy`: rename to `StateProcessingStrategy` and always skip withdrawals if using `StateProcessingStrategy::Inconsistent`
- Add a test to reproduce the scenario
Co-authored-by: Jimmy Chen <jimmy@sigmaprime.io>
## Issue Addressed
NA
## Proposed Changes
Similar to #4181 but without the version bump and a more nuanced fix.
Patches the high CPU usage seen after the Capella fork which was caused by processing exits when there are skip slots.
## Additional Info
~~This is an imperfect solution that will cause us to drop some exits at the fork boundary. This is tracked at #4184.~~
## Issue Addressed
Closes https://github.com/sigp/lighthouse/issues/2371
## Proposed Changes
Backport some changes from `tree-states` that remove duplicated calculations of the `proposer_index`.
With this change the proposer index should be calculated only once for each block, and then plumbed through to every place it is required.
## Additional Info
In future I hope to add more data to the consensus context that is cached on a per-epoch basis, like the effective balances of validators and the base rewards.
There are some other changes to remove indexing in tests that were also useful for `tree-states` (the `tree-states` types don't implement `Index`).
## Issue Addressed
NA
## Proposed Changes
Adds a test that was written whilst doing some testing. This PR does not make changes to production code, it just adds a test for already existing functionality.
## Additional Info
NA
## Overview
This rather extensive PR achieves two primary goals:
1. Uses the finalized/justified checkpoints of fork choice (FC), rather than that of the head state.
2. Refactors fork choice, block production and block processing to `async` functions.
Additionally, it achieves:
- Concurrent forkchoice updates to the EL and cache pruning after a new head is selected.
- Concurrent "block packing" (attestations, etc) and execution payload retrieval during block production.
- Concurrent per-block-processing and execution payload verification during block processing.
- The `Arc`-ification of `SignedBeaconBlock` during block processing (it's never mutated, so why not?):
- I had to do this to deal with sending blocks into spawned tasks.
- Previously we were cloning the beacon block at least 2 times during each block processing, these clones are either removed or turned into cheaper `Arc` clones.
- We were also `Box`-ing and un-`Box`-ing beacon blocks as they moved throughout the networking crate. This is not a big deal, but it's nice to avoid shifting things between the stack and heap.
- Avoids cloning *all the blocks* in *every chain segment* during sync.
- It also has the potential to clean up our code where we need to pass an *owned* block around so we can send it back in the case of an error (I didn't do much of this, my PR is already big enough 😅)
- The `BeaconChain::HeadSafetyStatus` struct was removed. It was an old relic from prior merge specs.
For motivation for this change, see https://github.com/sigp/lighthouse/pull/3244#issuecomment-1160963273
## Changes to `canonical_head` and `fork_choice`
Previously, the `BeaconChain` had two separate fields:
```
canonical_head: RwLock<Snapshot>,
fork_choice: RwLock<BeaconForkChoice>
```
Now, we have grouped these values under a single struct:
```
canonical_head: CanonicalHead {
cached_head: RwLock<Arc<Snapshot>>,
fork_choice: RwLock<BeaconForkChoice>
}
```
Apart from ergonomics, the only *actual* change here is wrapping the canonical head snapshot in an `Arc`. This means that we no longer need to hold the `cached_head` (`canonical_head`, in old terms) lock when we want to pull some values from it. This was done to avoid deadlock risks by preventing functions from acquiring (and holding) the `cached_head` and `fork_choice` locks simultaneously.
## Breaking Changes
### The `state` (root) field in the `finalized_checkpoint` SSE event
Consider the scenario where epoch `n` is just finalized, but `start_slot(n)` is skipped. There are two state roots we might in the `finalized_checkpoint` SSE event:
1. The state root of the finalized block, which is `get_block(finalized_checkpoint.root).state_root`.
4. The state root at slot of `start_slot(n)`, which would be the state from (1), but "skipped forward" through any skip slots.
Previously, Lighthouse would choose (2). However, we can see that when [Teku generates that event](de2b2801c8/data/beaconrestapi/src/main/java/tech/pegasys/teku/beaconrestapi/handlers/v1/events/EventSubscriptionManager.java (L171-L182)) it uses [`getStateRootFromBlockRoot`](de2b2801c8/data/provider/src/main/java/tech/pegasys/teku/api/ChainDataProvider.java (L336-L341)) which uses (1).
I have switched Lighthouse from (2) to (1). I think it's a somewhat arbitrary choice between the two, where (1) is easier to compute and is consistent with Teku.
## Notes for Reviewers
I've renamed `BeaconChain::fork_choice` to `BeaconChain::recompute_head`. Doing this helped ensure I broke all previous uses of fork choice and I also find it more descriptive. It describes an action and can't be confused with trying to get a reference to the `ForkChoice` struct.
I've changed the ordering of SSE events when a block is received. It used to be `[block, finalized, head]` and now it's `[block, head, finalized]`. It was easier this way and I don't think we were making any promises about SSE event ordering so it's not "breaking".
I've made it so fork choice will run when it's first constructed. I did this because I wanted to have a cached version of the last call to `get_head`. Ensuring `get_head` has been run *at least once* means that the cached values doesn't need to wrapped in an `Option`. This was fairly simple, it just involved passing a `slot` to the constructor so it knows *when* it's being run. When loading a fork choice from the store and a slot clock isn't handy I've just used the `slot` that was saved in the `fork_choice_store`. That seems like it would be a faithful representation of the slot when we saved it.
I added the `genesis_time: u64` to the `BeaconChain`. It's small, constant and nice to have around.
Since we're using FC for the fin/just checkpoints, we no longer get the `0x00..00` roots at genesis. You can see I had to remove a work-around in `ef-tests` here: b56be3bc2. I can't find any reason why this would be an issue, if anything I think it'll be better since the genesis-alias has caught us out a few times (0x00..00 isn't actually a real root). Edit: I did find a case where the `network` expected the 0x00..00 alias and patched it here: 3f26ac3e2.
You'll notice a lot of changes in tests. Generally, tests should be functionally equivalent. Here are the things creating the most diff-noise in tests:
- Changing tests to be `tokio::async` tests.
- Adding `.await` to fork choice, block processing and block production functions.
- Refactor of the `canonical_head` "API" provided by the `BeaconChain`. E.g., `chain.canonical_head.cached_head()` instead of `chain.canonical_head.read()`.
- Wrapping `SignedBeaconBlock` in an `Arc`.
- In the `beacon_chain/tests/block_verification`, we can't use the `lazy_static` `CHAIN_SEGMENT` variable anymore since it's generated with an async function. We just generate it in each test, not so efficient but hopefully insignificant.
I had to disable `rayon` concurrent tests in the `fork_choice` tests. This is because the use of `rayon` and `block_on` was causing a panic.
Co-authored-by: Mac L <mjladson@pm.me>
## Issue Addressed
Successor to #2431
## Proposed Changes
* Add a `BlockReplayer` struct to abstract over the intricacies of calling `per_slot_processing` and `per_block_processing` while avoiding unnecessary tree hashing.
* Add a variant of the forwards state root iterator that does not require an `end_state`.
* Use the `BlockReplayer` when reconstructing states in the database. Use the efficient forwards iterator for frozen states.
* Refactor the iterators to remove `Arc<HotColdDB>` (this seems to be neater than making _everything_ an `Arc<HotColdDB>` as I did in #2431).
Supplying the state roots allow us to avoid building a tree hash cache at all when reconstructing historic states, which saves around 1 second flat (regardless of `slots-per-restore-point`). This is a small percentage of worst-case state load times with 200K validators and SPRP=2048 (~15s vs ~16s) but a significant speed-up for more frequent restore points: state loads with SPRP=32 should be now consistently <500ms instead of 1.5s (a ~3x speedup).
## Additional Info
Required by https://github.com/sigp/lighthouse/pull/2628
## Issue Addressed
New rust lints
## Proposed Changes
- Boxing some enum variants
- removing some unused fields (is the validator lockfile unused? seemed so to me)
## Additional Info
- some error fields were marked as dead code but are logged out in areas
- left some dead fields in our ef test code because I assume they are useful for debugging?
Co-authored-by: realbigsean <seananderson33@gmail.com>
## Issue Addressed
NA
## Proposed Changes
This PR is near-identical to https://github.com/sigp/lighthouse/pull/2652, however it is to be merged into `unstable` instead of `merge-f2f`. Please see that PR for reasoning.
I'm making this duplicate PR to merge to `unstable` in an effort to shrink the diff between `unstable` and `merge-f2f` by doing smaller, lead-up PRs.
## Additional Info
NA
## Proposed Changes
Remove the remaining Altair `FIXME`s from consensus land.
1. Implement tree hash caching for the participation lists. This required some light type manipulation, including removing the `TreeHash` bound from `CachedTreeHash` which was purely descriptive.
2. Plumb the proposer index through Altair attestation processing, to avoid calculating it for _every_ attestation (potentially 128ms on large networks). This duplicates some work from #2431, but with the aim of getting it in sooner, particularly for the Altair devnets.
3. Removes two FIXMEs related to `superstruct` and cloning, which are unlikely to be particularly detrimental and will be tracked here instead: https://github.com/sigp/superstruct/issues/5
## Proposed Changes
Implement the consensus changes necessary for the upcoming Altair hard fork.
## Additional Info
This is quite a heavy refactor, with pivotal types like the `BeaconState` and `BeaconBlock` changing from structs to enums. This ripples through the whole codebase with field accesses changing to methods, e.g. `state.slot` => `state.slot()`.
Co-authored-by: realbigsean <seananderson33@gmail.com>
## Issue Addressed
`test_dht_persistence` failing
## Proposed Changes
Bind `NetworkService::start` to an underscore prefixed variable rather than `_`. `_` was causing it to be dropped immediately
This was failing 5/100 times before this update, but I haven't been able to get it to fail after updating it
Co-authored-by: realbigsean <seananderson33@gmail.com>