* Add timing for block availability
* Attestation metrics analysis
* Prettier printing
* Add some metrics and timings to track late blocks
* Update to latest unstable
* fmt
* Merge latest unstable
* Small tweaks
* Try pushing blob timing down into verification
* Simplify for clippy
* Move processing cache out of DA
* Merge branch 'sigp/unstable' into non-da-processing-cach
* Merge branch 'unstable' of https://github.com/sigp/lighthouse into non-da-processing-cache
* remove unused file, remove outdated TODO, add is_deneb check to missing blob id calculations
* Merge branch 'unstable' of https://github.com/sigp/lighthouse into non-da-processing-cache
* fix lints
* check the da cache and the attester cache in responding to RPC requests
* use the processing cache instead
* update comment
* add da cache metrics
* rename early attester cache method
* Merge branch 'unstable' of https://github.com/sigp/lighthouse into check-da-cache-in-rpc-response
* make rustup update run on the runners
* Revert "make rustup update run on the runners"
This reverts commit d097e9bfa8.
* add processing and processed caching to the DA checker
* move processing cache out of critical cache
* get it compiling
* fix lints
* add docs to `AvailabilityView`
* some self review
* fix lints
* fix beacon chain tests
* cargo fmt
* make availability view easier to implement, start on testing
* move child component cache and finish test
* cargo fix
* cargo fix
* cargo fix
* fmt and lint
* make blob commitments not optional, rename some caches, add missing blobs struct
* Update beacon_node/beacon_chain/src/data_availability_checker/processing_cache.rs
Co-authored-by: ethDreamer <37123614+ethDreamer@users.noreply.github.com>
* marks review feedback and other general cleanup
* cargo fix
* improve availability view docs
* some renames
* some renames and docs
* fix should delay lookup logic
* get rid of some wrapper methods
* fix up single lookup changes
* add a couple docs
* add single blob merge method and improve process_... docs
* update some names
* lints
* fix merge
* remove blob indices from lookup creation log
* remove blob indices from lookup creation log
* delayed lookup logging improvement
* check fork choice before doing any blob processing
* remove unused dep
* Update beacon_node/beacon_chain/src/data_availability_checker/availability_view.rs
Co-authored-by: Michael Sproul <micsproul@gmail.com>
* Update beacon_node/beacon_chain/src/data_availability_checker/availability_view.rs
Co-authored-by: Michael Sproul <micsproul@gmail.com>
* Update beacon_node/beacon_chain/src/data_availability_checker/availability_view.rs
Co-authored-by: Michael Sproul <micsproul@gmail.com>
* Update beacon_node/beacon_chain/src/data_availability_checker/availability_view.rs
Co-authored-by: Michael Sproul <micsproul@gmail.com>
* Update beacon_node/network/src/sync/block_lookups/delayed_lookup.rs
Co-authored-by: Michael Sproul <micsproul@gmail.com>
* remove duplicate deps
* use gen range in random blobs geneartor
* rename processing cache fields
* require block root in rpc block construction and check block root consistency
* send peers as vec in single message
* spawn delayed lookup service from network beacon processor
* fix tests
---------
Co-authored-by: ethDreamer <37123614+ethDreamer@users.noreply.github.com>
Co-authored-by: Michael Sproul <micsproul@gmail.com>
* Low hanging fruits
* Remove unnecessary todo
I think it's fine to not handle this since the calling functions handle the error.
No specific reason imo to handle it in the function as well.
* Rename BlobError to GossipBlobError
I feel this signified better what the error is for. The BlobError was only for failures when gossip
verifying a blob. We cannot get this error when doing rpc validation
* Remove the BlockError::BlobValidation variant
This error was only there to appease gossip verification before publish.
It's unclear how to peer score this error since this cannot actually occur during any
block verification flows.
This commit introuduces an additional error type BlockContentsError to better represent the
Error type
* Add docs for peer scoring (or lack thereof) of AvailabilityCheck errors
* I do not see a non-convoluted way of doing this. Okay to have some redundant code here
* Removing this to catch the failure red handed
* Fix compilation
* Cannot be deleted because some tests assume the trait impl
Also useful to have around for testing in the future imo
* Add some metrics and logs
* Only process `Imported` variant in sync_methods
The only additional thing for other variants that might be useful is logging. We can do that
later if required
* Convert to TryFrom
Not really sure where this would be used, but just did what the comment says.
Could consider just returning the Block variant for a deneb block in the From version
* Unlikely to change now
* This is fine as this is max_rpc_size per rpc chunk (for blobs, it would be 128kb max)
* Log count instead of individual blobs, can delete log later if it becomes too annoying.
* Add block production blob verification timer
* Extend block_straemer test to deneb
* Remove dbg statement
* Fix tests
* some blob reprocessing work
* remove ForceBlockLookup
* reorder enum match arms in sync manager
* a lot more reprocessing work
* impl logic for triggerng blob lookups along with block lookups
* deal with rpc blobs in groups per block in the da checker. don't cache missing blob ids in the da checker.
* make single block lookup generic
* more work
* add delayed processing logic and combine some requests
* start fixing some compile errors
* fix compilation in main block lookup mod
* much work
* get things compiling
* parent blob lookups
* fix compile
* revert red/stevie changes
* fix up sync manager delay message logic
* add peer usefulness enum
* should remove lookup refactor
* consolidate retry error handling
* improve peer scoring during certain failures in parent lookups
* improve retry code
* drop parent lookup if either req has a peer disconnect during download
* refactor single block processed method
* processing peer refactor
* smol bugfix
* fix some todos
* fix lints
* fix lints
* fix compile in lookup tests
* fix lints
* fix lints
* fix existing block lookup tests
* renamings
* fix after merge
* cargo fmt
* compilation fix in beacon chain tests
* fix
* refactor lookup tests to work with multiple forks and response types
* make tests into macros
* wrap availability check error
* fix compile after merge
* add random blobs
* start fixing up lookup verify error handling
* some bug fixes and the start of deneb only tests
* make tests work for all forks
* track information about peer source
* error refactoring
* improve peer scoring
* fix test compilation
* make sure blobs are sent for processing after stream termination, delete copied tests
* add some tests and fix a bug
* smol bugfixes and moar tests
* add tests and fix some things
* compile after merge
* lots of refactoring
* retry on invalid block/blob
* merge unknown parent messages before current slot lookup
* get tests compiling
* penalize blob peer on invalid blobs
* Check disk on in-memory cache miss
* Update beacon_node/beacon_chain/src/data_availability_checker/overflow_lru_cache.rs
* Update beacon_node/network/src/sync/network_context.rs
Co-authored-by: Divma <26765164+divagant-martian@users.noreply.github.com>
* fix bug in matching blocks and blobs in range sync
* pr feedback
* fix conflicts
* upgrade logs from warn to crit when we receive incorrect response in range
* synced_and_connected_within_tolerance -> should_search_for_block
* remove todo
* add data gas used and update excess data gas to u64
* Fix Broken Overflow Tests
* payload verification with commitments
* fix merge conflicts
* restore payload file
* Restore payload file
* remove todo
* add max blob commitments per block
* c-kzg lib update
* Fix ef tests
* Abstract over minimal/mainnet spec in kzg crate
* Start integrating new KZG
* checkpoint sync without alignment
* checkpoint sync without alignment
* add import
* add import
* query for checkpoint state by slot rather than state root (teku doesn't serve by state root)
* query for checkpoint state by slot rather than state root (teku doesn't serve by state root)
* loosen check
* get state first and query by most recent block root
* Revert "loosen check"
This reverts commit 069d13dd63.
* get state first and query by most recent block root
* merge max blobs change
* simplify delay logic
* rename unknown parent sync message variants
* rename parameter, block_slot -> slot
* add some docs to the lookup module
* use interval instead of sleep
* drop request if blocks and blobs requests both return `None` for `Id`
* clean up `find_single_lookup` logic
* add lookup source enum
* clean up `find_single_lookup` logic
* add docs to find_single_lookup_request
* move LookupSource our of param where unnecessary
* remove unnecessary todo
* query for block by `state.latest_block_header.slot`
* fix lint
* fix merge transition ef tests
* fix test
* fix test
* fix observed blob sidecars test
* Add some metrics (#33)
* fix protocol limits for blobs by root
* Update Engine API for 1:1 Structure Method
* make beacon chain tests to fix devnet 6 changes
* get ckzg working and fix some tests
* fix remaining tests
* fix lints
* Fix KZG linking issues
* remove unused dep
* lockfile
* test fixes
* remove dbgs
* remove unwrap
* cleanup tx generator
* small fixes
* fixing fixes
* more self reivew
* more self review
* refactor genesis header initialization
* refactor mock el instantiations
* fix compile
* fix network test, make sure they run for each fork
* pr feedback
* fix last test (hopefully)
---------
Co-authored-by: Pawan Dhananjay <pawandhananjay@gmail.com>
Co-authored-by: Mark Mackey <mark@sigmaprime.io>
Co-authored-by: Divma <26765164+divagant-martian@users.noreply.github.com>
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
## Issue Addressed
Resolves#3238
## Proposed Changes
Please list or describe the changes introduced by this PR.
## Additional Info
Please provide any additional information. For example, future considerations
or information useful for reviewers.
## Issue Addressed
NA
## Proposed Changes
Adds metrics to track validators that are submitting equivocating (but not slashable) sync messages. This follows on from some research we've been doing in a separate fork of LH.
## Additional Info
@jimmygchen and @michaelsproul have already run their eyes over this so it should be easy to get into v4.2.0, IMO.
## Issue Addressed
Cleans up all the remnants of 4844 in capella. This makes sure when 4844 is reviewed there is nothing we are missing because it got included here
## Proposed Changes
drop a bomb on every 4844 thing
## Additional Info
Merge process I did (locally) is as follows:
- squash merge to produce one commit
- in new branch off unstable with the squashed commit create a `git revert HEAD` commit
- merge that new branch onto 4844 with `--strategy ours`
- compare local 4844 to remote 4844 and make sure the diff is empty
- enjoy
Co-authored-by: Paul Hauner <paul@paulhauner.com>
## Proposed Changes
With proposer boosting implemented (#2822) we have an opportunity to re-org out late blocks.
This PR adds three flags to the BN to control this behaviour:
* `--disable-proposer-reorgs`: turn aggressive re-orging off (it's on by default).
* `--proposer-reorg-threshold N`: attempt to orphan blocks with less than N% of the committee vote. If this parameter isn't set then N defaults to 20% when the feature is enabled.
* `--proposer-reorg-epochs-since-finalization N`: only attempt to re-org late blocks when the number of epochs since finalization is less than or equal to N. The default is 2 epochs, meaning re-orgs will only be attempted when the chain is finalizing optimally.
For safety Lighthouse will only attempt a re-org under very specific conditions:
1. The block being proposed is 1 slot after the canonical head, and the canonical head is 1 slot after its parent. i.e. at slot `n + 1` rather than building on the block from slot `n` we build on the block from slot `n - 1`.
2. The current canonical head received less than N% of the committee vote. N should be set depending on the proposer boost fraction itself, the fraction of the network that is believed to be applying it, and the size of the largest entity that could be hoarding votes.
3. The current canonical head arrived after the attestation deadline from our perspective. This condition was only added to support suppression of forkchoiceUpdated messages, but makes intuitive sense.
4. The block is being proposed in the first 2 seconds of the slot. This gives it time to propagate and receive the proposer boost.
## Additional Info
For the initial idea and background, see: https://github.com/ethereum/consensus-specs/pull/2353#issuecomment-950238004
There is also a specification for this feature here: https://github.com/ethereum/consensus-specs/pull/3034
Co-authored-by: Michael Sproul <micsproul@gmail.com>
Co-authored-by: pawan <pawandhananjay@gmail.com>
## Issue Addressed
Implementing the light_client_gossip topics but I'm not there yet.
Which issue # does this PR address?
Partially #3651
## Proposed Changes
Add light client gossip topics.
Please list or describe the changes introduced by this PR.
I'm going to Implement light_client_finality_update and light_client_optimistic_update gossip topics. Currently I've attempted the former and I'm seeking feedback.
## Additional Info
I've only implemented the light_client_finality_update topic because I wanted to make sure I was on the correct path. Also checking that the gossiped LightClientFinalityUpdate is the same as the locally constructed one is not implemented because caching the updates will make this much easier. Could someone give me some feedback on this please?
Please provide any additional information. For example, future considerations
or information useful for reviewers.
Co-authored-by: GeemoCandama <104614073+GeemoCandama@users.noreply.github.com>
## Issue Addressed
Closes https://github.com/sigp/lighthouse/issues/2327
## Proposed Changes
This is an extension of some ideas I implemented while working on `tree-states`:
- Cache the indexed attestations from blocks in the `ConsensusContext`. Previously we were re-computing them 3-4 times over.
- Clean up `import_block` by splitting each part into `import_block_XXX`.
- Move some stuff off hot paths, specifically:
- Relocate non-essential tasks that were running between receiving the payload verification status and priming the early attester cache. These tasks are moved after the cache priming:
- Attestation observation
- Validator monitor updates
- Slasher updates
- Updating the shuffling cache
- Fork choice attestation observation now happens at the end of block verification in parallel with payload verification (this seems to save 5-10ms).
- Payload verification now happens _before_ advancing the pre-state and writing it to disk! States were previously being written eagerly and adding ~20-30ms in front of verifying the execution payload. State catchup also sometimes takes ~500ms if we get a cache miss and need to rebuild the tree hash cache.
The remaining task that's taking substantial time (~20ms) is importing the block to fork choice. I _think_ this is because of pull-tips, and we should be able to optimise it out with a clever total active balance cache in the state (which would be computed in parallel with payload verification). I've decided to leave that for future work though. For now it can be observed via the new `beacon_block_processing_post_exec_pre_attestable_seconds` metric.
Co-authored-by: Michael Sproul <micsproul@gmail.com>
## Issue Addressed
NA
## Proposed Changes
I have observed scenarios on Goerli where Lighthouse was receiving attestations which reference the same, un-cached shuffling on multiple threads at the same time. Lighthouse was then loading the same state from database and determining the shuffling on multiple threads at the same time. This is unnecessary load on the disk and RAM.
This PR modifies the shuffling cache so that each entry can be either:
- A committee
- A promise for a committee (i.e., a `crossbeam_channel::Receiver`)
Now, in the scenario where we have thread A and thread B simultaneously requesting the same un-cached shuffling, we will have the following:
1. Thread A will take the write-lock on the shuffling cache, find that there's no cached committee and then create a "promise" (a `crossbeam_channel::Sender`) for a committee before dropping the write-lock.
1. Thread B will then be allowed to take the write-lock for the shuffling cache and find the promise created by thread A. It will block the current thread waiting for thread A to fulfill that promise.
1. Thread A will load the state from disk, obtain the shuffling, send it down the channel, insert the entry into the cache and then continue to verify the attestation.
1. Thread B will then receive the shuffling from the receiver, be un-blocked and then continue to verify the attestation.
In the case where thread A fails to generate the shuffling and drops the sender, the next time that specific shuffling is requested we will detect that the channel is disconnected and return a `None` entry for that shuffling. This will cause the shuffling to be re-calculated.
## Additional Info
NA
## Issue Addressed
#3285
## Proposed Changes
Adds support for specifying histogram with buckets and adds new metric buckets for metrics mentioned in issue.
## Additional Info
Need some help for the buckets.
Co-authored-by: Michael Sproul <micsproul@gmail.com>
## Issue Addressed
Upcoming spec change https://github.com/ethereum/consensus-specs/pull/2878
## Proposed Changes
1. Run fork choice at the start of every slot, and wait for this run to complete before proposing a block.
2. As an optimisation, also run fork choice 3/4 of the way through the slot (at 9s), _dequeueing attestations for the next slot_.
3. Remove the fork choice run from the state advance timer that occurred before advancing the state.
## Additional Info
### Block Proposal Accuracy
This change makes us more likely to propose on top of the correct head in the presence of re-orgs with proposer boost in play. The main scenario that this change is designed to address is described in the linked spec issue.
### Attestation Accuracy
This change _also_ makes us more likely to attest to the correct head. Currently in the case of a skipped slot at `slot` we only run fork choice 9s into `slot - 1`. This means the attestations from `slot - 1` aren't taken into consideration, and any boost applied to the block from `slot - 1` is not removed (it should be). In the language of the linked spec issue, this means we are liable to attest to C, even when the majority voting weight has already caused a re-org to B.
### Why remove the call before the state advance?
If we've run fork choice at the start of the slot then it has already dequeued all the attestations from the previous slot, which are the only ones eligible to influence the head in the current slot. Running fork choice again is unnecessary (unless we run it for the next slot and try to pre-empt a re-org, but I don't currently think this is a great idea).
### Performance
Based on Prater testing this adds about 5-25ms of runtime to block proposal times, which are 500-1000ms on average (and spike to 5s+ sometimes due to state handling issues 😢 ). I believe this is a small enough penalty to enable it by default, with the option to disable it via the new flag `--fork-choice-before-proposal-timeout 0`. Upcoming work on block packing and state representation will also reduce block production times in general, while removing the spikes.
### Implementation
Fork choice gets invoked at the start of the slot via the `per_slot_task` function called from the slot timer. It then uses a condition variable to signal to block production that fork choice has been updated. This is a bit funky, but it seems to work. One downside of the timer-based approach is that it doesn't happen automatically in most of the tests. The test added by this PR has to trigger the run manually.
## Issue Addressed
This PR fixes the unnecessary `WARN Single block lookup failed` messages described here:
https://github.com/sigp/lighthouse/pull/2866#issuecomment-1008442640
## Proposed Changes
Add a new cache to the `BeaconChain` that tracks the block roots of blocks from before finalization. These could be blocks from the canonical chain (which might need to be read from disk), or old pre-finalization blocks that have been forked out.
The cache also stores a set of block roots for in-progress single block lookups, which duplicates some of the information from sync's `single_block_lookups` hashmap:
a836e180f9/beacon_node/network/src/sync/manager.rs (L192-L196)
On a live node you can confirm that the cache is working by grepping logs for the message: `Rejected attestation to finalized block`.
## Issue Addressed
NA
## Proposed Changes
In https://github.com/sigp/lighthouse/pull/2832 we made some changes to the `SnapshotCache` to help deal with the one-block reorgs seen on mainnet (and testnets).
I believe the change in #2832 is good and we should keep it, but I think that in its present form it is causing the `SnapshotCache` to hold onto states that it doesn't need anymore. For example, a skip slot will result in one more `BeaconSnapshot` being stored in the cache.
This PR adds a new type of pruning that happens after a block is inserted to the cache. We will remove any snapshot from the cache that is a *grandparent* of the block being imported. Since we know the grandparent has two valid blocks built atop it, it is not at risk from a one-block re-org.
## Additional Info
NA
## Issue Addressed
NA
## Proposed Changes
Introduces a cache to attestation to produce atop blocks which will become the head, but are not fully imported (e.g., not inserted into the database).
Whilst attesting to a block before it's imported is rather easy, if we're going to produce that attestation then we also need to be able to:
1. Verify that attestation.
1. Respond to RPC requests for the `beacon_block_root`.
Attestation verification (1) is *partially* covered. Since we prime the shuffling cache before we insert the block into the early attester cache, we should be fine for all typical use-cases. However, it is possible that the cache is washed out before we've managed to insert the state into the database and then attestation verification will fail with a "missing beacon state"-type error.
Providing the block via RPC (2) is also partially covered, since we'll check the database *and* the early attester cache when responding a blocks-by-root request. However, we'll still omit the block from blocks-by-range requests (until the block lands in the DB). I *think* this is fine, since there's no guarantee that we return all blocks for those responses.
Another important consideration is whether or not the *parent* of the early attester block is available in the databse. If it were not, we might fail to respond to blocks-by-root request that are iterating backwards to collect a chain of blocks. I argue that *we will always have the parent of the early attester block in the database.* This is because we are holding the fork-choice write-lock when inserting the block into the early attester cache and we do not drop that until the block is in the database.
## Proposed Changes
In the event of a late block, keep the block in the snapshot cache by cloning it. This helps us process new blocks quickly in the event the late block was re-org'd.
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
Fix max packet sizes
Fix max_payload_size function
Add merge block test
Fix max size calculation; fix up test
Clear comments
Add a payload_size_function
Use safe arith for payload calculation
Return an error if block too big in block production
Separate test to check if block is over limit
## Proposed Changes
Add several metrics for the number of attestations in the op pool. These give us a way to observe the number of valid, non-trivial attestations during block packing rather than just the size of the entire op pool.
## Proposed Changes
Clone the proposer pubkeys during backfill signature verification to reduce the time that the pubkey cache lock is held for. Cloning such a small number of pubkeys has negligible impact on the total running time, but greatly reduces lock contention.
On a Ryzen 5950X, the setup step seems to take around 180us regardless of whether the key is cloned or not, while the verification takes 7ms. When Lighthouse is limited to 10% of one core using `sudo cpulimit --pid <pid> --limit 10` the total time jumps up to 800ms, but the setup step remains only 250us. This means that under heavy load this PR could cut the time the lock is held for from 800ms to 250us, which is a huge saving of 99.97%!
## Issue Addressed
Closes#2528
## Proposed Changes
- Add `BlockTimesCache` to provide block timing information to `BeaconChain`. This allows additional metrics to be calculated for blocks that are set as head too late.
- Thread the `seen_timestamp` of blocks received from RPC responses (except blocks from syncing) through to the sync manager, similar to what is done for blocks from gossip.
## Additional Info
This provides the following additional metrics:
- `BEACON_BLOCK_OBSERVED_SLOT_START_DELAY_TIME`
- The delay between the start of the slot and when the block was first observed.
- `BEACON_BLOCK_IMPORTED_OBSERVED_DELAY_TIME`
- The delay between when the block was first observed and when the block was imported.
- `BEACON_BLOCK_HEAD_IMPORTED_DELAY_TIME`
- The delay between when the block was imported and when the block was set as head.
The metric `BEACON_BLOCK_IMPORTED_SLOT_START_DELAY_TIME` was removed.
A log is produced when a block is set as head too late, e.g.:
```
Aug 27 03:46:39.006 DEBG Delayed head block set_as_head_delay: Some(21.731066ms), imported_delay: Some(119.929934ms), observed_delay: Some(3.864596988s), block_delay: 4.006257988s, slot: 1931331, proposer_index: 24294, block_root: 0x937602c89d3143afa89088a44bdf4b4d0d760dad082abacb229495c048648a9e, service: beacon
```
## Issue Addressed
Resolves#2552
## Proposed Changes
Offers some improvement in inclusion distance calculation in the validator monitor.
When registering an attestation from a block, instead of doing `block.slot() - attesstation.data.slot()` to get the inclusion distance, we now pass the parent block slot from the beacon chain and do `parent_slot.saturating_sub(attestation.data.slot())`. This allows us to give best effort inclusion distance in scenarios where the attestation was included right after a skip slot. Note that this does not give accurate results in scenarios where the attestation was included few blocks after the skip slot.
In this case, if the attestation slot was `b1` and was included in block `b2` with a skip slot in between, we would get the inclusion delay as 0 (by ignoring the skip slot) which is the best effort inclusion delay.
```
b1 <- missed <- b2
```
Here, if the attestation slot was `b1` and was included in block `b3` with a skip slot and valid block `b2` in between, then we would get the inclusion delay as 2 instead of 1 (by ignoring the skip slot).
```
b1 <- missed <- b2 <- b3
```
A solution for the scenario 2 would be to count number of slots between included slot and attestation slot ignoring the skip slots in the beacon chain and pass the value to the validator monitor. But I'm concerned that it could potentially lead to db accesses for older blocks in extreme cases.
This PR also uses the validator monitor data for logging per epoch inclusion distance. This is useful as we won't get inclusion data in post-altair summaries.
Co-authored-by: Michael Sproul <micsproul@gmail.com>
## Issue Addressed
NA
## Proposed Changes
Adds the ability to verify batches of aggregated/unaggregated attestations from the network.
When the `BeaconProcessor` finds there are messages in the aggregated or unaggregated attestation queues, it will first check the length of the queue:
- `== 1` verify the attestation individually.
- `>= 2` take up to 64 of those attestations and verify them in a batch.
Notably, we only perform batch verification if the queue has a backlog. We don't apply any artificial delays to attestations to try and force them into batches.
### Batching Details
To assist with implementing batches we modify `beacon_chain::attestation_verification` to have two distinct categories for attestations:
- *Indexed* attestations: those which have passed initial validation and were valid enough for us to derive an `IndexedAttestation`.
- *Verified* attestations: those attestations which were indexed *and also* passed signature verification. These are well-formed, interesting messages which were signed by validators.
The batching functions accept `n` attestations and then return `n` attestation verification `Result`s, where those `Result`s can be any combination of `Ok` or `Err`. In other words, we attempt to verify as many attestations as possible and return specific per-attestation results so peer scores can be updated, if required.
When we batch verify attestations, we first try to map all those attestations to *indexed* attestations. If any of those attestations were able to be indexed, we then perform batch BLS verification on those indexed attestations. If the batch verification succeeds, we convert them into *verified* attestations, disabling individual signature checking. If the batch fails, we convert to verified attestations with individual signature checking enabled.
Ultimately, we optimistically try to do a batch verification of attestation signatures and fall-back to individual verification if it fails. This opens an attach vector for "poisoning" the attestations and causing us to waste a batch verification. I argue that peer scoring should do a good-enough job of defending against this and the typical-case gains massively outweigh the worst-case losses.
## Additional Info
Before this PR, attestation verification took the attestations by value (instead of by reference). It turns out that this was unnecessary and, in my opinion, resulted in some undesirable ergonomics (e.g., we had to pass the attestation back in the `Err` variant to avoid clones). In this PR I've modified attestation verification so that it now takes a reference.
I refactored the `beacon_chain/tests/attestation_verification.rs` tests so they use a builder-esque "tester" struct instead of a weird macro. It made it easier for me to test individual/batch with the same set of tests and I think it was a nice tidy-up. Notably, I did this last to try and make sure my new refactors to *actual* production code would pass under the existing test suite.
## Issue Addressed
N/A
## Proposed Changes
Add functionality in the validator monitor to provide sync committee related metrics for monitored validators.
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
## Issue Addressed
NA
## Proposed Changes
Missed head votes on attestations is a well-known issue. The primary cause is a block getting set as the head *after* the attestation deadline.
This PR aims to shorten the overall time between "block received" and "block set as head" by:
1. Persisting the head and fork choice *after* setting the canonical head
- Informal measurements show this takes ~200ms
1. Pruning the op pool *after* setting the canonical head.
1. No longer persisting the op pool to disk during `BeaconChain::fork_choice`
- Informal measurements show this can take up to 1.2s.
I also add some metrics to help measure the effect of these changes.
Persistence changes like this run the risk of breaking assumptions downstream. However, I have considered these risks and I think we're fine here. I will describe my reasoning for each change.
## Reasoning
### Change 1: Persisting the head and fork choice *after* setting the canonical head
For (1), although the function is called `persist_head_and_fork_choice`, it only persists:
- Fork choice
- Head tracker
- Genesis block root
Since `BeaconChain::fork_choice_internal` does not modify these values between the original time we were persisting it and the current time, I assert that the change I've made is non-substantial in terms of what ends up on-disk. There's the possibility that some *other* thread has modified fork choice in the extra time we've given it, but that's totally fine.
Since the only time we *read* those values from disk is during startup, I assert that this has no impact during runtime.
### Change 2: Pruning the op pool after setting the canonical head
Similar to the argument above, we don't modify the op pool during `BeaconChain::fork_choice_internal` so it shouldn't matter when we prune. This change should be non-substantial.
### Change 3: No longer persisting the op pool to disk during `BeaconChain::fork_choice`
This change *is* substantial. With the proposed changes, we'll only be persisting the op pool to disk when we shut down cleanly (i.e., the `BeaconChain` gets dropped). This means we'll save disk IO and time during usual operation, but a `kill -9` or similar "crash" will probably result in an out-of-date op pool when we reboot. An out-of-date op pool can only have an impact when producing blocks or aggregate attestations/sync committees.
I think it's pretty reasonable that a crash might result in an out-of-date op pool, since:
- Crashes are fairly rare. Practically the only time I see LH suffer a full crash is when the OOM killer shows up, and that's a very serious event.
- It's generally quite rare to produce a block/aggregate immediately after a reboot. Just a few slots of runtime is probably enough to have a decent-enough op pool again.
## Additional Info
Credits to @macladson for the timings referenced here.
## Issue Addressed
NA
## Proposed Changes
When testing our (not-yet-released) Doppelganger implementation, I noticed that we aren't detecting attestations included in blocks (only those on the gossip network).
This is because during [block processing](e8c0d1f19b/beacon_node/beacon_chain/src/beacon_chain.rs (L2168)) we only update the `observed_attestations` cache with each attestation, but not the `observed_attesters` cache. This is the correct behaviour when we consider the [p2p spec](https://github.com/ethereum/eth2.0-specs/blob/v1.0.1/specs/phase0/p2p-interface.md):
> [IGNORE] There has been no other valid attestation seen on an attestation subnet that has an identical attestation.data.target.epoch and participating validator index.
We're doing the right thing here and still allowing attestations on gossip that we've seen in a block. However, this doesn't work so nicely for Doppelganger.
To resolve this, I've taken the following steps:
- Add a `observed_block_attesters` cache.
- Rename `observed_attesters` to `observed_gossip_attesters`.
## TODO
- [x] Add a test to ensure a validator that's been seen in a block attestation (but not a gossip attestation) returns `true` for `BeaconChain::validator_seen_at_epoch`.
- [x] Add a test to ensure `observed_block_attesters` isn't polluted via gossip attestations and vice versa.
Co-authored-by: realbigsean <seananderson33@gmail.com>
## Issue Addressed
- Resolves#2169
## Proposed Changes
Adds the `AttesterCache` to allow validators to produce attestations for older slots. Presently, some arbitrary restrictions can force validators to receive an error when attesting to a slot earlier than the present one. This can cause attestation misses when there is excessive load on the validator client or time sync issues between the VC and BN.
## Additional Info
NA
## Issue Addressed
NA
## Proposed Changes
This PR addresses two things:
1. Allows the `ValidatorMonitor` to work with Altair states.
1. Optimizes `altair::process_epoch` (see [code](https://github.com/paulhauner/lighthouse/blob/participation-cache/consensus/state_processing/src/per_epoch_processing/altair/participation_cache.rs) for description)
## Breaking Changes
The breaking changes in this PR revolve around one premise:
*After the Altair fork, it's not longer possible (given only a `BeaconState`) to identify if a validator had *any* attestation included during some epoch. The best we can do is see if that validator made the "timely" source/target/head flags.*
Whilst this seems annoying, it's not actually too bad. Finalization is based upon "timely target" attestations, so that's really the most important thing. Although there's *some* value in knowing if a validator had *any* attestation included, it's far more important to know about "timely target" participation, since this is what affects finality and justification.
For simplicity and consistency, I've also removed the ability to determine if *any* attestation was included from metrics and API endpoints. Now, all Altair and non-Altair states will simply report on the head/target attestations.
The following section details where we've removed fields and provides replacement values.
### Breaking Changes: Prometheus Metrics
Some participation metrics have been removed and replaced. Some were removed since they are no longer relevant to Altair (e.g., total attesting balance) and others replaced with gwei values instead of pre-computed values. This provides more flexibility at display-time (e.g., Grafana).
The following metrics were added as replacements:
- `beacon_participation_prev_epoch_head_attesting_gwei_total`
- `beacon_participation_prev_epoch_target_attesting_gwei_total`
- `beacon_participation_prev_epoch_source_attesting_gwei_total`
- `beacon_participation_prev_epoch_active_gwei_total`
The following metrics were removed:
- `beacon_participation_prev_epoch_attester`
- instead use `beacon_participation_prev_epoch_source_attesting_gwei_total / beacon_participation_prev_epoch_active_gwei_total`.
- `beacon_participation_prev_epoch_target_attester`
- instead use `beacon_participation_prev_epoch_target_attesting_gwei_total / beacon_participation_prev_epoch_active_gwei_total`.
- `beacon_participation_prev_epoch_head_attester`
- instead use `beacon_participation_prev_epoch_head_attesting_gwei_total / beacon_participation_prev_epoch_active_gwei_total`.
The `beacon_participation_prev_epoch_attester` endpoint has been removed. Users should instead use the pre-existing `beacon_participation_prev_epoch_target_attester`.
### Breaking Changes: HTTP API
The `/lighthouse/validator_inclusion/{epoch}/{validator_id}` endpoint loses the following fields:
- `current_epoch_attesting_gwei` (use `current_epoch_target_attesting_gwei` instead)
- `previous_epoch_attesting_gwei` (use `previous_epoch_target_attesting_gwei` instead)
The `/lighthouse/validator_inclusion/{epoch}/{validator_id}` endpoint lose the following fields:
- `is_current_epoch_attester` (use `is_current_epoch_target_attester` instead)
- `is_previous_epoch_attester` (use `is_previous_epoch_target_attester` instead)
- `is_active_in_current_epoch` becomes `is_active_unslashed_in_current_epoch`.
- `is_active_in_previous_epoch` becomes `is_active_unslashed_in_previous_epoch`.
## Additional Info
NA
## TODO
- [x] Deal with total balances
- [x] Update validator_inclusion API
- [ ] Ensure `beacon_participation_prev_epoch_target_attester` and `beacon_participation_prev_epoch_head_attester` work before Altair
Co-authored-by: realbigsean <seananderson33@gmail.com>
## Issue Addressed
NA
## Proposed Changes
Adds a metric to see how many set bits are in the sync aggregate for each beacon block being imported.
## Additional Info
NA
## Proposed Changes
Implement the consensus changes necessary for the upcoming Altair hard fork.
## Additional Info
This is quite a heavy refactor, with pivotal types like the `BeaconState` and `BeaconBlock` changing from structs to enums. This ripples through the whole codebase with field accesses changing to methods, e.g. `state.slot` => `state.slot()`.
Co-authored-by: realbigsean <seananderson33@gmail.com>