Use async code when interacting with EL (#3244)

## Overview

This rather extensive PR achieves two primary goals:

1. Uses the finalized/justified checkpoints of fork choice (FC), rather than that of the head state.
2. Refactors fork choice, block production and block processing to `async` functions.

Additionally, it achieves:

- Concurrent forkchoice updates to the EL and cache pruning after a new head is selected.
- Concurrent "block packing" (attestations, etc) and execution payload retrieval during block production.
- Concurrent per-block-processing and execution payload verification during block processing.
- The `Arc`-ification of `SignedBeaconBlock` during block processing (it's never mutated, so why not?):
    - I had to do this to deal with sending blocks into spawned tasks.
    - Previously we were cloning the beacon block at least 2 times during each block processing, these clones are either removed or turned into cheaper `Arc` clones.
    - We were also `Box`-ing and un-`Box`-ing beacon blocks as they moved throughout the networking crate. This is not a big deal, but it's nice to avoid shifting things between the stack and heap.
    - Avoids cloning *all the blocks* in *every chain segment* during sync.
    - It also has the potential to clean up our code where we need to pass an *owned* block around so we can send it back in the case of an error (I didn't do much of this, my PR is already big enough 😅)
- The `BeaconChain::HeadSafetyStatus` struct was removed. It was an old relic from prior merge specs.

For motivation for this change, see https://github.com/sigp/lighthouse/pull/3244#issuecomment-1160963273

## Changes to `canonical_head` and `fork_choice`

Previously, the `BeaconChain` had two separate fields:

```
canonical_head: RwLock<Snapshot>,
fork_choice: RwLock<BeaconForkChoice>
```

Now, we have grouped these values under a single struct:

```
canonical_head: CanonicalHead {
  cached_head: RwLock<Arc<Snapshot>>,
  fork_choice: RwLock<BeaconForkChoice>
} 
```

Apart from ergonomics, the only *actual* change here is wrapping the canonical head snapshot in an `Arc`. This means that we no longer need to hold the `cached_head` (`canonical_head`, in old terms) lock when we want to pull some values from it. This was done to avoid deadlock risks by preventing functions from acquiring (and holding) the `cached_head` and `fork_choice` locks simultaneously.

## Breaking Changes

### The `state` (root) field in the `finalized_checkpoint` SSE event

Consider the scenario where epoch `n` is just finalized, but `start_slot(n)` is skipped. There are two state roots we might in the `finalized_checkpoint` SSE event:

1. The state root of the finalized block, which is `get_block(finalized_checkpoint.root).state_root`.
4. The state root at slot of `start_slot(n)`, which would be the state from (1), but "skipped forward" through any skip slots.

Previously, Lighthouse would choose (2). However, we can see that when [Teku generates that event](de2b2801c8/data/beaconrestapi/src/main/java/tech/pegasys/teku/beaconrestapi/handlers/v1/events/EventSubscriptionManager.java (L171-L182)) it uses [`getStateRootFromBlockRoot`](de2b2801c8/data/provider/src/main/java/tech/pegasys/teku/api/ChainDataProvider.java (L336-L341)) which uses (1).

I have switched Lighthouse from (2) to (1). I think it's a somewhat arbitrary choice between the two, where (1) is easier to compute and is consistent with Teku.

## Notes for Reviewers

I've renamed `BeaconChain::fork_choice` to `BeaconChain::recompute_head`. Doing this helped ensure I broke all previous uses of fork choice and I also find it more descriptive. It describes an action and can't be confused with trying to get a reference to the `ForkChoice` struct.

I've changed the ordering of SSE events when a block is received. It used to be `[block, finalized, head]` and now it's `[block, head, finalized]`. It was easier this way and I don't think we were making any promises about SSE event ordering so it's not "breaking".

I've made it so fork choice will run when it's first constructed. I did this because I wanted to have a cached version of the last call to `get_head`. Ensuring `get_head` has been run *at least once* means that the cached values doesn't need to wrapped in an `Option`. This was fairly simple, it just involved passing a `slot` to the constructor so it knows *when* it's being run. When loading a fork choice from the store and a slot clock isn't handy I've just used the `slot` that was saved in the `fork_choice_store`. That seems like it would be a faithful representation of the slot when we saved it.

I added the `genesis_time: u64` to the `BeaconChain`. It's small, constant and nice to have around.

Since we're using FC for the fin/just checkpoints, we no longer get the `0x00..00` roots at genesis. You can see I had to remove a work-around in `ef-tests` here: b56be3bc2. I can't find any reason why this would be an issue, if anything I think it'll be better since the genesis-alias has caught us out a few times (0x00..00 isn't actually a real root). Edit: I did find a case where the `network` expected the 0x00..00 alias and patched it here: 3f26ac3e2.

You'll notice a lot of changes in tests. Generally, tests should be functionally equivalent. Here are the things creating the most diff-noise in tests:
- Changing tests to be `tokio::async` tests.
- Adding `.await` to fork choice, block processing and block production functions.
- Refactor of the `canonical_head` "API" provided by the `BeaconChain`. E.g., `chain.canonical_head.cached_head()` instead of `chain.canonical_head.read()`.
- Wrapping `SignedBeaconBlock` in an `Arc`.
- In the `beacon_chain/tests/block_verification`, we can't use the `lazy_static` `CHAIN_SEGMENT` variable anymore since it's generated with an async function. We just generate it in each test, not so efficient but hopefully insignificant.

I had to disable `rayon` concurrent tests in the `fork_choice` tests. This is because the use of `rayon` and `block_on` was causing a panic.

Co-authored-by: Mac L <mjladson@pm.me>
This commit is contained in:
Paul Hauner
2022-07-03 05:36:50 +00:00
parent e5212f1320
commit be4e261e74
106 changed files with 6515 additions and 4538 deletions

View File

@@ -52,6 +52,7 @@ use lighthouse_network::{
use logging::TimeLatch;
use slog::{crit, debug, error, trace, warn, Logger};
use std::collections::VecDeque;
use std::future::Future;
use std::pin::Pin;
use std::sync::{Arc, Weak};
use std::task::Context;
@@ -386,7 +387,7 @@ impl<T: BeaconChainTypes> WorkEvent<T> {
message_id: MessageId,
peer_id: PeerId,
peer_client: Client,
block: Box<SignedBeaconBlock<T::EthSpec>>,
block: Arc<SignedBeaconBlock<T::EthSpec>>,
seen_timestamp: Duration,
) -> Self {
Self {
@@ -490,7 +491,7 @@ impl<T: BeaconChainTypes> WorkEvent<T> {
/// Create a new `Work` event for some block, where the result from computation (if any) is
/// sent to the other side of `result_tx`.
pub fn rpc_beacon_block(
block: Box<SignedBeaconBlock<T::EthSpec>>,
block: Arc<SignedBeaconBlock<T::EthSpec>>,
seen_timestamp: Duration,
process_type: BlockProcessType,
) -> Self {
@@ -507,7 +508,7 @@ impl<T: BeaconChainTypes> WorkEvent<T> {
/// Create a new work event to import `blocks` as a beacon chain segment.
pub fn chain_segment(
process_id: ChainSegmentProcessId,
blocks: Vec<SignedBeaconBlock<T::EthSpec>>,
blocks: Vec<Arc<SignedBeaconBlock<T::EthSpec>>>,
) -> Self {
Self {
drop_during_sync: false,
@@ -654,7 +655,7 @@ pub enum Work<T: BeaconChainTypes> {
message_id: MessageId,
peer_id: PeerId,
peer_client: Client,
block: Box<SignedBeaconBlock<T::EthSpec>>,
block: Arc<SignedBeaconBlock<T::EthSpec>>,
seen_timestamp: Duration,
},
DelayedImportBlock {
@@ -691,13 +692,13 @@ pub enum Work<T: BeaconChainTypes> {
seen_timestamp: Duration,
},
RpcBlock {
block: Box<SignedBeaconBlock<T::EthSpec>>,
block: Arc<SignedBeaconBlock<T::EthSpec>>,
seen_timestamp: Duration,
process_type: BlockProcessType,
},
ChainSegment {
process_id: ChainSegmentProcessId,
blocks: Vec<SignedBeaconBlock<T::EthSpec>>,
blocks: Vec<Arc<SignedBeaconBlock<T::EthSpec>>>,
},
Status {
peer_id: PeerId,
@@ -1307,15 +1308,6 @@ impl<T: BeaconChainTypes> BeaconProcessor<T> {
let idle_tx = toolbox.idle_tx;
let work_reprocessing_tx = toolbox.work_reprocessing_tx;
// Wrap the `idle_tx` in a struct that will fire the idle message whenever it is dropped.
//
// This helps ensure that the worker is always freed in the case of an early exit or panic.
// As such, this instantiation should happen as early in the function as possible.
let send_idle_on_drop = SendOnDrop {
tx: idle_tx,
log: self.log.clone(),
};
let work_id = work.str_id();
let worker_timer =
metrics::start_timer_vec(&metrics::BEACON_PROCESSOR_WORKER_TIME, &[work_id]);
@@ -1325,6 +1317,16 @@ impl<T: BeaconChainTypes> BeaconProcessor<T> {
&[work.str_id()],
);
// Wrap the `idle_tx` in a struct that will fire the idle message whenever it is dropped.
//
// This helps ensure that the worker is always freed in the case of an early exit or panic.
// As such, this instantiation should happen as early in the function as possible.
let send_idle_on_drop = SendOnDrop {
tx: idle_tx,
_worker_timer: worker_timer,
log: self.log.clone(),
};
let worker_id = self.current_workers;
self.current_workers = self.current_workers.saturating_add(1);
@@ -1338,7 +1340,6 @@ impl<T: BeaconChainTypes> BeaconProcessor<T> {
return;
};
let log = self.log.clone();
let executor = self.executor.clone();
let worker = Worker {
@@ -1357,252 +1358,308 @@ impl<T: BeaconChainTypes> BeaconProcessor<T> {
"worker" => worker_id,
);
let sub_executor = executor.clone();
executor.spawn_blocking(
move || {
let _worker_timer = worker_timer;
let task_spawner = TaskSpawner {
executor: executor.clone(),
send_idle_on_drop,
};
match work {
/*
* Individual unaggregated attestation verification.
*/
Work::GossipAttestation {
message_id,
peer_id,
attestation,
subnet_id,
should_import,
seen_timestamp,
} => worker.process_gossip_attestation(
message_id,
peer_id,
attestation,
subnet_id,
should_import,
Some(work_reprocessing_tx),
seen_timestamp,
),
/*
* Batched unaggregated attestation verification.
*/
Work::GossipAttestationBatch { packages } => worker
.process_gossip_attestation_batch(packages, Some(work_reprocessing_tx)),
/*
* Individual aggregated attestation verification.
*/
Work::GossipAggregate {
message_id,
peer_id,
aggregate,
seen_timestamp,
} => worker.process_gossip_aggregate(
message_id,
peer_id,
aggregate,
Some(work_reprocessing_tx),
seen_timestamp,
),
/*
* Batched aggregated attestation verification.
*/
Work::GossipAggregateBatch { packages } => {
worker.process_gossip_aggregate_batch(packages, Some(work_reprocessing_tx))
}
/*
* Verification for beacon blocks received on gossip.
*/
Work::GossipBlock {
let sub_executor = executor;
match work {
/*
* Individual unaggregated attestation verification.
*/
Work::GossipAttestation {
message_id,
peer_id,
attestation,
subnet_id,
should_import,
seen_timestamp,
} => task_spawner.spawn_blocking(move || {
worker.process_gossip_attestation(
message_id,
peer_id,
attestation,
subnet_id,
should_import,
Some(work_reprocessing_tx),
seen_timestamp,
)
}),
/*
* Batched unaggregated attestation verification.
*/
Work::GossipAttestationBatch { packages } => task_spawner.spawn_blocking(|| {
worker.process_gossip_attestation_batch(packages, Some(work_reprocessing_tx))
}),
/*
* Individual aggregated attestation verification.
*/
Work::GossipAggregate {
message_id,
peer_id,
aggregate,
seen_timestamp,
} => task_spawner.spawn_blocking(move || {
worker.process_gossip_aggregate(
message_id,
peer_id,
aggregate,
Some(work_reprocessing_tx),
seen_timestamp,
)
}),
/*
* Batched aggregated attestation verification.
*/
Work::GossipAggregateBatch { packages } => task_spawner.spawn_blocking(|| {
worker.process_gossip_aggregate_batch(packages, Some(work_reprocessing_tx))
}),
/*
* Verification for beacon blocks received on gossip.
*/
Work::GossipBlock {
message_id,
peer_id,
peer_client,
block,
seen_timestamp,
} => task_spawner.spawn_async(async move {
worker
.process_gossip_block(
message_id,
peer_id,
peer_client,
block,
seen_timestamp,
} => worker.process_gossip_block(
message_id,
peer_id,
peer_client,
*block,
work_reprocessing_tx.clone(),
work_reprocessing_tx,
duplicate_cache,
seen_timestamp,
),
/*
* Import for blocks that we received earlier than their intended slot.
*/
Work::DelayedImportBlock {
peer_id,
block,
seen_timestamp,
} => worker.process_gossip_verified_block(
peer_id,
*block,
work_reprocessing_tx,
seen_timestamp,
),
/*
* Voluntary exits received on gossip.
*/
Work::GossipVoluntaryExit {
message_id,
peer_id,
voluntary_exit,
} => worker.process_gossip_voluntary_exit(message_id, peer_id, *voluntary_exit),
/*
* Proposer slashings received on gossip.
*/
Work::GossipProposerSlashing {
message_id,
peer_id,
proposer_slashing,
} => worker.process_gossip_proposer_slashing(
message_id,
peer_id,
*proposer_slashing,
),
/*
* Attester slashings received on gossip.
*/
Work::GossipAttesterSlashing {
message_id,
peer_id,
attester_slashing,
} => worker.process_gossip_attester_slashing(
message_id,
peer_id,
*attester_slashing,
),
/*
* Sync committee message verification.
*/
Work::GossipSyncSignature {
message_id,
peer_id,
sync_signature,
subnet_id,
seen_timestamp,
} => worker.process_gossip_sync_committee_signature(
message_id,
peer_id,
*sync_signature,
subnet_id,
seen_timestamp,
),
/*
* Syn contribution verification.
*/
Work::GossipSyncContribution {
message_id,
peer_id,
sync_contribution,
seen_timestamp,
} => worker.process_sync_committee_contribution(
message_id,
peer_id,
*sync_contribution,
seen_timestamp,
),
/*
* Verification for beacon blocks received during syncing via RPC.
*/
Work::RpcBlock {
block,
seen_timestamp,
process_type,
} => {
worker.process_rpc_block(
*block,
seen_timestamp,
process_type,
work_reprocessing_tx.clone(),
duplicate_cache,
);
}
/*
* Verification for a chain segment (multiple blocks).
*/
Work::ChainSegment { process_id, blocks } => {
worker.process_chain_segment(process_id, blocks)
}
/*
* Processing of Status Messages.
*/
Work::Status { peer_id, message } => worker.process_status(peer_id, message),
/*
* Processing of range syncing requests from other peers.
*/
Work::BlocksByRangeRequest {
peer_id,
request_id,
request,
} => {
return worker.handle_blocks_by_range_request(
sub_executor,
send_idle_on_drop,
peer_id,
request_id,
request,
)
}
/*
* Processing of blocks by roots requests from other peers.
*/
Work::BlocksByRootsRequest {
peer_id,
request_id,
request,
} => {
return worker.handle_blocks_by_root_request(
sub_executor,
send_idle_on_drop,
peer_id,
request_id,
request,
)
}
Work::UnknownBlockAttestation {
message_id,
peer_id,
attestation,
subnet_id,
should_import,
seen_timestamp,
} => worker.process_gossip_attestation(
message_id,
peer_id,
attestation,
subnet_id,
should_import,
None, // Do not allow this attestation to be re-processed beyond this point.
seen_timestamp,
),
Work::UnknownBlockAggregate {
message_id,
peer_id,
aggregate,
seen_timestamp,
} => worker.process_gossip_aggregate(
message_id,
peer_id,
aggregate,
None,
seen_timestamp,
),
};
)
.await
}),
/*
* Import for blocks that we received earlier than their intended slot.
*/
Work::DelayedImportBlock {
peer_id,
block,
seen_timestamp,
} => task_spawner.spawn_async(worker.process_gossip_verified_block(
peer_id,
*block,
work_reprocessing_tx,
seen_timestamp,
)),
/*
* Voluntary exits received on gossip.
*/
Work::GossipVoluntaryExit {
message_id,
peer_id,
voluntary_exit,
} => task_spawner.spawn_blocking(move || {
worker.process_gossip_voluntary_exit(message_id, peer_id, *voluntary_exit)
}),
/*
* Proposer slashings received on gossip.
*/
Work::GossipProposerSlashing {
message_id,
peer_id,
proposer_slashing,
} => task_spawner.spawn_blocking(move || {
worker.process_gossip_proposer_slashing(message_id, peer_id, *proposer_slashing)
}),
/*
* Attester slashings received on gossip.
*/
Work::GossipAttesterSlashing {
message_id,
peer_id,
attester_slashing,
} => task_spawner.spawn_blocking(move || {
worker.process_gossip_attester_slashing(message_id, peer_id, *attester_slashing)
}),
/*
* Sync committee message verification.
*/
Work::GossipSyncSignature {
message_id,
peer_id,
sync_signature,
subnet_id,
seen_timestamp,
} => task_spawner.spawn_blocking(move || {
worker.process_gossip_sync_committee_signature(
message_id,
peer_id,
*sync_signature,
subnet_id,
seen_timestamp,
)
}),
/*
* Syn contribution verification.
*/
Work::GossipSyncContribution {
message_id,
peer_id,
sync_contribution,
seen_timestamp,
} => task_spawner.spawn_blocking(move || {
worker.process_sync_committee_contribution(
message_id,
peer_id,
*sync_contribution,
seen_timestamp,
)
}),
/*
* Verification for beacon blocks received during syncing via RPC.
*/
Work::RpcBlock {
block,
seen_timestamp,
process_type,
} => task_spawner.spawn_async(worker.process_rpc_block(
block,
seen_timestamp,
process_type,
work_reprocessing_tx,
duplicate_cache,
)),
/*
* Verification for a chain segment (multiple blocks).
*/
Work::ChainSegment { process_id, blocks } => task_spawner
.spawn_async(async move { worker.process_chain_segment(process_id, blocks).await }),
/*
* Processing of Status Messages.
*/
Work::Status { peer_id, message } => {
task_spawner.spawn_blocking(move || worker.process_status(peer_id, message))
}
/*
* Processing of range syncing requests from other peers.
*/
Work::BlocksByRangeRequest {
peer_id,
request_id,
request,
} => task_spawner.spawn_blocking_with_manual_send_idle(move |send_idle_on_drop| {
worker.handle_blocks_by_range_request(
sub_executor,
send_idle_on_drop,
peer_id,
request_id,
request,
)
}),
/*
* Processing of blocks by roots requests from other peers.
*/
Work::BlocksByRootsRequest {
peer_id,
request_id,
request,
} => task_spawner.spawn_blocking_with_manual_send_idle(move |send_idle_on_drop| {
worker.handle_blocks_by_root_request(
sub_executor,
send_idle_on_drop,
peer_id,
request_id,
request,
)
}),
Work::UnknownBlockAttestation {
message_id,
peer_id,
attestation,
subnet_id,
should_import,
seen_timestamp,
} => task_spawner.spawn_blocking(move || {
worker.process_gossip_attestation(
message_id,
peer_id,
attestation,
subnet_id,
should_import,
None, // Do not allow this attestation to be re-processed beyond this point.
seen_timestamp,
)
}),
Work::UnknownBlockAggregate {
message_id,
peer_id,
aggregate,
seen_timestamp,
} => task_spawner.spawn_blocking(move || {
worker.process_gossip_aggregate(
message_id,
peer_id,
aggregate,
None,
seen_timestamp,
)
}),
};
}
}
trace!(
log,
"Beacon processor worker done";
"work" => work_id,
"worker" => worker_id,
);
/// Spawns tasks that are either:
///
/// - Blocking (i.e. intensive methods that shouldn't run on the core `tokio` executor)
/// - Async (i.e. `async` methods)
///
/// Takes a `SendOnDrop` and ensures it is dropped after the task completes. This frees the beacon
/// processor worker so a new task can be started.
struct TaskSpawner {
executor: TaskExecutor,
send_idle_on_drop: SendOnDrop,
}
// This explicit `drop` is used to remind the programmer that this variable must
// not be dropped until the worker is complete. Dropping it early will cause the
// worker to be marked as "free" and cause an over-spawning of workers.
drop(send_idle_on_drop);
impl TaskSpawner {
/// Spawn an async task, dropping the `SendOnDrop` after the task has completed.
fn spawn_async(self, task: impl Future<Output = ()> + Send + 'static) {
self.executor.spawn(
async {
task.await;
drop(self.send_idle_on_drop)
},
WORKER_TASK_NAME,
);
)
}
/// Spawn a blocking task, dropping the `SendOnDrop` after the task has completed.
fn spawn_blocking<F>(self, task: F)
where
F: FnOnce() + Send + 'static,
{
self.executor.spawn_blocking(
|| {
task();
drop(self.send_idle_on_drop)
},
WORKER_TASK_NAME,
)
}
/// Spawn a blocking task, passing the `SendOnDrop` into the task.
///
/// ## Notes
///
/// Users must ensure the `SendOnDrop` is dropped at the appropriate time!
pub fn spawn_blocking_with_manual_send_idle<F>(self, task: F)
where
F: FnOnce(SendOnDrop) + Send + 'static,
{
self.executor.spawn_blocking(
|| {
task(self.send_idle_on_drop);
},
WORKER_TASK_NAME,
)
}
}
@@ -1618,6 +1675,8 @@ impl<T: BeaconChainTypes> BeaconProcessor<T> {
/// https://doc.rust-lang.org/std/ops/trait.Drop.html#panics
pub struct SendOnDrop {
tx: mpsc::Sender<()>,
// The field is unused, but it's here to ensure the timer is dropped once the task has finished.
_worker_timer: Option<metrics::HistogramTimer>,
log: Logger,
}