Handle early blocks (#2155)

## Issue Addressed

NA

## Problem this PR addresses

There's an issue where Lighthouse is banning a lot of peers due to the following sequence of events:

1. Gossip block 0xabc arrives ~200ms early
    - It is propagated across the network, with respect to [`MAXIMUM_GOSSIP_CLOCK_DISPARITY`](https://github.com/ethereum/eth2.0-specs/blob/v1.0.0/specs/phase0/p2p-interface.md#why-is-there-maximum_gossip_clock_disparity-when-validating-slot-ranges-of-messages-in-gossip-subnets).
    - However, it is not imported to our database since the block is early.
2. Attestations for 0xabc arrive, but the block was not imported.
    - The peer that sent the attestation is down-voted.
        - Each unknown-block attestation causes a score loss of 1, the peer is banned at -100.
        - When the peer is on an attestation subnet there can be hundreds of attestations, so the peer is banned quickly (before the missed block can be obtained via rpc).

## Potential solutions

I can think of three solutions to this:

1. Wait for attestation-queuing (#635) to arrive and solve this.
    - Easy
    - Not immediate fix.
    - Whilst this would work, I don't think it's a perfect solution for this particular issue, rather (3) is better.
1. Allow importing blocks with a tolerance of `MAXIMUM_GOSSIP_CLOCK_DISPARITY`.
    - Easy
    - ~~I have implemented this, for now.~~
1. If a block is verified for gossip propagation (i.e., signature verified) and it's within `MAXIMUM_GOSSIP_CLOCK_DISPARITY`, then queue it to be processed at the start of the appropriate slot.
    - More difficult
    - Feels like the best solution, I will try to implement this.
    
    
**This PR takes approach (3).**

## Changes included

- Implement the `block_delay_queue`, based upon a [`DelayQueue`](https://docs.rs/tokio-util/0.6.3/tokio_util/time/delay_queue/struct.DelayQueue.html) which can store blocks until it's time to import them.
- Add a new `DelayedImportBlock` variant to the `beacon_processor::WorkEvent` enum to handle this new event.
- In the `BeaconProcessor`, refactor a `tokio::select!` to a struct with an explicit `Stream` implementation. I experienced some issues with `tokio::select!` in the block delay queue and I also found it hard to debug. I think this explicit implementation is nicer and functionally equivalent (apart from the fact that `tokio::select!` randomly chooses futures to poll, whereas now we're deterministic).
- Add a testing framework to the `beacon_processor` module that tests this new block delay logic. I also tested a handful of other operations in the beacon processor (attns, slashings, exits) since it was super easy to copy-pasta the code from the `http_api` tester.
    - To implement these tests I added the concept of an optional `work_journal_tx` to the `BeaconProcessor` which will spit out a log of events. I used this in the tests to ensure that things were happening as I expect.
    - The tests are a little racey, but it's hard to avoid that when testing timing-based code. If we see CI failures I can revise. I haven't observed *any* failures due to races on my machine or on CI yet.
    - To assist with testing I allowed for directly setting the time on the `ManualSlotClock`.
- I gave the `beacon_processor::Worker` a `Toolbox` for two reasons; (a) it avoids changing tons of function sigs when you want to pass a new object to the worker and (b) it seemed cute.
This commit is contained in:
Paul Hauner
2021-02-24 03:08:52 +00:00
parent 399d073ab4
commit a764c3b247
19 changed files with 1161 additions and 115 deletions

View File

@@ -0,0 +1,499 @@
#![cfg(not(debug_assertions))] // Tests are too slow in debug.
#![cfg(test)]
use crate::beacon_processor::*;
use crate::{service::NetworkMessage, sync::SyncMessage};
use beacon_chain::{
test_utils::{AttestationStrategy, BeaconChainHarness, BlockStrategy, EphemeralHarnessType},
BeaconChain, MAXIMUM_GOSSIP_CLOCK_DISPARITY,
};
use discv5::enr::{CombinedKey, EnrBuilder};
use environment::{null_logger, Environment, EnvironmentBuilder};
use eth2_libp2p::{rpc::methods::MetaData, types::EnrBitfield, MessageId, NetworkGlobals, PeerId};
use slot_clock::SlotClock;
use std::cmp;
use std::iter::Iterator;
use std::sync::Arc;
use std::time::Duration;
use tokio::runtime::Runtime;
use tokio::sync::mpsc;
use types::{
test_utils::generate_deterministic_keypairs, Attestation, AttesterSlashing, MainnetEthSpec,
ProposerSlashing, SignedBeaconBlock, SignedVoluntaryExit, SubnetId,
};
type E = MainnetEthSpec;
type T = EphemeralHarnessType<E>;
const SLOTS_PER_EPOCH: u64 = 32;
const VALIDATOR_COUNT: usize = SLOTS_PER_EPOCH as usize;
const SMALL_CHAIN: u64 = 2;
const LONG_CHAIN: u64 = SLOTS_PER_EPOCH * 2;
const TCP_PORT: u16 = 42;
const UDP_PORT: u16 = 42;
const SEQ_NUMBER: u64 = 0;
/// The default time to wait for `BeaconProcessor` events.
const STANDARD_TIMEOUT: Duration = Duration::from_secs(10);
/// Provides utilities for testing the `BeaconProcessor`.
struct TestRig {
chain: Arc<BeaconChain<T>>,
next_block: SignedBeaconBlock<E>,
attestations: Vec<(Attestation<E>, SubnetId)>,
attester_slashing: AttesterSlashing<E>,
proposer_slashing: ProposerSlashing,
voluntary_exit: SignedVoluntaryExit,
beacon_processor_tx: mpsc::Sender<WorkEvent<T>>,
work_journal_rx: mpsc::Receiver<String>,
_network_rx: mpsc::UnboundedReceiver<NetworkMessage<E>>,
_sync_rx: mpsc::UnboundedReceiver<SyncMessage<E>>,
environment: Option<Environment<E>>,
}
/// This custom drop implementation ensures that we shut down the tokio runtime gracefully. Without
/// it, tests will hang indefinitely.
impl Drop for TestRig {
fn drop(&mut self) {
// Causes the beacon processor to shutdown.
self.beacon_processor_tx = mpsc::channel(MAX_WORK_EVENT_QUEUE_LEN).0;
self.environment.take().unwrap().shutdown_on_idle();
}
}
impl TestRig {
pub fn new(chain_length: u64) -> Self {
let mut harness = BeaconChainHarness::new(
MainnetEthSpec,
generate_deterministic_keypairs(VALIDATOR_COUNT),
);
harness.advance_slot();
for _ in 0..chain_length {
harness.extend_chain(
1,
BlockStrategy::OnCanonicalHead,
AttestationStrategy::AllValidators,
);
harness.advance_slot();
}
let head = harness.chain.head().unwrap();
assert_eq!(
harness.chain.slot().unwrap(),
head.beacon_block.slot() + 1,
"precondition: current slot is one after head"
);
let (next_block, _next_state) =
harness.make_block(head.beacon_state.clone(), harness.chain.slot().unwrap());
let attestations = harness
.get_unaggregated_attestations(
&AttestationStrategy::AllValidators,
&head.beacon_state,
head.beacon_block_root,
harness.chain.slot().unwrap(),
)
.into_iter()
// .map(|vec| vec.into_iter().map(|(attestation, _subnet_id)| attestation))
.flatten()
.collect::<Vec<_>>();
assert!(
!attestations.is_empty(),
"precondition: attestations for testing"
);
let attester_slashing = harness.make_attester_slashing(vec![0, 1]);
let proposer_slashing = harness.make_proposer_slashing(2);
let voluntary_exit = harness.make_voluntary_exit(3, harness.chain.epoch().unwrap());
// Changing this *after* the chain has been initialized is a bit cheeky, but it shouldn't
// cause issue.
//
// This allows for testing voluntary exits without building out a massive chain.
harness.chain.spec.shard_committee_period = 2;
let chain = Arc::new(harness.chain);
let (network_tx, _network_rx) = mpsc::unbounded_channel();
let log = null_logger().unwrap();
let (beacon_processor_tx, beacon_processor_rx) = mpsc::channel(MAX_WORK_EVENT_QUEUE_LEN);
let (sync_tx, _sync_rx) = mpsc::unbounded_channel();
// Default metadata
let meta_data = MetaData {
seq_number: SEQ_NUMBER,
attnets: EnrBitfield::<MainnetEthSpec>::default(),
};
let enr_key = CombinedKey::generate_secp256k1();
let enr = EnrBuilder::new("v4").build(&enr_key).unwrap();
let network_globals = Arc::new(NetworkGlobals::new(
enr,
TCP_PORT,
UDP_PORT,
meta_data,
vec![],
&log,
));
let mut environment = EnvironmentBuilder::mainnet()
.null_logger()
.unwrap()
.multi_threaded_tokio_runtime()
.unwrap()
.build()
.unwrap();
let executor = environment.core_context().executor;
let (work_journal_tx, work_journal_rx) = mpsc::channel(16_364);
BeaconProcessor {
beacon_chain: Arc::downgrade(&chain),
network_tx,
sync_tx,
network_globals,
executor,
max_workers: cmp::max(1, num_cpus::get()),
current_workers: 0,
log: log.clone(),
}
.spawn_manager(beacon_processor_rx, Some(work_journal_tx));
Self {
chain,
next_block,
attestations,
attester_slashing,
proposer_slashing,
voluntary_exit,
beacon_processor_tx,
work_journal_rx,
_network_rx,
_sync_rx,
environment: Some(environment),
}
}
pub fn enqueue_gossip_block(&self) {
self.beacon_processor_tx
.try_send(WorkEvent::gossip_beacon_block(
junk_message_id(),
junk_peer_id(),
Box::new(self.next_block.clone()),
Duration::from_secs(0),
))
.unwrap();
}
pub fn enqueue_unaggregated_attestation(&self) {
let (attestation, subnet_id) = self.attestations.first().unwrap().clone();
self.beacon_processor_tx
.try_send(WorkEvent::unaggregated_attestation(
junk_message_id(),
junk_peer_id(),
attestation,
subnet_id,
true,
Duration::from_secs(0),
))
.unwrap();
}
pub fn enqueue_gossip_attester_slashing(&self) {
self.beacon_processor_tx
.try_send(WorkEvent::gossip_attester_slashing(
junk_message_id(),
junk_peer_id(),
Box::new(self.attester_slashing.clone()),
))
.unwrap();
}
pub fn enqueue_gossip_proposer_slashing(&self) {
self.beacon_processor_tx
.try_send(WorkEvent::gossip_proposer_slashing(
junk_message_id(),
junk_peer_id(),
Box::new(self.proposer_slashing.clone()),
))
.unwrap();
}
pub fn enqueue_gossip_voluntary_exit(&self) {
self.beacon_processor_tx
.try_send(WorkEvent::gossip_voluntary_exit(
junk_message_id(),
junk_peer_id(),
Box::new(self.voluntary_exit.clone()),
))
.unwrap();
}
fn runtime(&mut self) -> Arc<Runtime> {
self.environment
.as_mut()
.unwrap()
.core_context()
.executor
.runtime()
.upgrade()
.unwrap()
}
/// Assert that the `BeaconProcessor` doesn't produce any events in the given `duration`.
pub fn assert_no_events_for(&mut self, duration: Duration) {
self.runtime().block_on(async {
tokio::select! {
_ = tokio::time::sleep(duration) => (),
event = self.work_journal_rx.recv() => panic!(
"received {:?} within {:?} when expecting no events",
event,
duration
),
}
})
}
/// Assert that the `BeaconProcessor` event journal is as `expected`.
///
/// ## Note
///
/// We won't attempt to listen for any more than `expected.len()` events. As such, it makes sense
/// to use the `NOTHING_TO_DO` event to ensure that execution has completed.
pub fn assert_event_journal(&mut self, expected: &[&str]) {
let events = self.runtime().block_on(async {
let mut events = vec![];
let drain_future = async {
loop {
match self.work_journal_rx.recv().await {
Some(event) => {
events.push(event);
// Break as soon as we collect the desired number of events.
if events.len() >= expected.len() {
break;
}
}
None => break,
}
}
};
// Drain the expected number of events from the channel, or time out and give up.
tokio::select! {
_ = tokio::time::sleep(STANDARD_TIMEOUT) => panic!(
"timeout ({:?}) expired waiting for events. expected {:?} but got {:?}",
STANDARD_TIMEOUT,
expected,
events
),
_ = drain_future => {},
}
events
});
assert_eq!(
events,
expected
.into_iter()
.map(|s| s.to_string())
.collect::<Vec<_>>()
);
}
}
fn junk_peer_id() -> PeerId {
PeerId::random()
}
fn junk_message_id() -> MessageId {
MessageId::new(&[])
}
/// Blocks that arrive early should be queued for later processing.
#[test]
fn import_gossip_block_acceptably_early() {
let mut rig = TestRig::new(SMALL_CHAIN);
let slot_start = rig
.chain
.slot_clock
.start_of(rig.next_block.slot())
.unwrap();
rig.chain
.slot_clock
.set_current_time(slot_start - MAXIMUM_GOSSIP_CLOCK_DISPARITY);
assert_eq!(
rig.chain.slot().unwrap(),
rig.next_block.slot() - 1,
"chain should be at the correct slot"
);
rig.enqueue_gossip_block();
rig.assert_event_journal(&[GOSSIP_BLOCK, WORKER_FREED, NOTHING_TO_DO]);
// Note: this section of the code is a bit race-y. We're assuming that we can set the slot clock
// and check the head in the time between the block arrived early and when its due for
// processing.
//
// If this causes issues we might be able to make the block delay queue add a longer delay for
// processing, instead of just MAXIMUM_GOSSIP_CLOCK_DISPARITY. Speak to @paulhauner if this test
// starts failing.
rig.chain.slot_clock.set_slot(rig.next_block.slot().into());
assert!(
rig.chain.head().unwrap().beacon_block_root != rig.next_block.canonical_root(),
"block not yet imported"
);
rig.assert_event_journal(&[DELAYED_IMPORT_BLOCK, WORKER_FREED, NOTHING_TO_DO]);
assert_eq!(
rig.chain.head().unwrap().beacon_block_root,
rig.next_block.canonical_root(),
"block should be imported and become head"
);
}
/// Blocks that are *too* early shouldn't get into the delay queue.
#[test]
fn import_gossip_block_unacceptably_early() {
let mut rig = TestRig::new(SMALL_CHAIN);
let slot_start = rig
.chain
.slot_clock
.start_of(rig.next_block.slot())
.unwrap();
rig.chain
.slot_clock
.set_current_time(slot_start - MAXIMUM_GOSSIP_CLOCK_DISPARITY - Duration::from_millis(1));
assert_eq!(
rig.chain.slot().unwrap(),
rig.next_block.slot() - 1,
"chain should be at the correct slot"
);
rig.enqueue_gossip_block();
rig.assert_event_journal(&[GOSSIP_BLOCK, WORKER_FREED, NOTHING_TO_DO]);
// Waiting for 5 seconds is a bit arbtirary, however it *should* be long enough to ensure the
// block isn't imported.
rig.assert_no_events_for(Duration::from_secs(5));
assert!(
rig.chain.head().unwrap().beacon_block_root != rig.next_block.canonical_root(),
"block should not be imported"
);
}
/// Blocks that arrive on-time should be processed normally.
#[test]
fn import_gossip_block_at_current_slot() {
let mut rig = TestRig::new(SMALL_CHAIN);
assert_eq!(
rig.chain.slot().unwrap(),
rig.next_block.slot(),
"chain should be at the correct slot"
);
rig.enqueue_gossip_block();
rig.assert_event_journal(&[GOSSIP_BLOCK, WORKER_FREED, NOTHING_TO_DO]);
assert_eq!(
rig.chain.head().unwrap().beacon_block_root,
rig.next_block.canonical_root(),
"block should be imported and become head"
);
}
/// Ensure a valid attestation can be imported.
#[test]
fn import_gossip_attestation() {
let mut rig = TestRig::new(SMALL_CHAIN);
let initial_attns = rig.chain.naive_aggregation_pool.read().num_attestations();
rig.enqueue_unaggregated_attestation();
rig.assert_event_journal(&[GOSSIP_ATTESTATION, WORKER_FREED, NOTHING_TO_DO]);
assert_eq!(
rig.chain.naive_aggregation_pool.read().num_attestations(),
initial_attns + 1,
"op pool should have one more attestation"
);
}
/// Ensure a bunch of valid operations can be imported.
#[test]
fn import_misc_gossip_ops() {
// Exits need the long chain so validators aren't too young to exit.
let mut rig = TestRig::new(LONG_CHAIN);
/*
* Attester slashing
*/
let initial_attester_slashings = rig.chain.op_pool.num_attester_slashings();
rig.enqueue_gossip_attester_slashing();
rig.assert_event_journal(&[GOSSIP_ATTESTER_SLASHING, WORKER_FREED, NOTHING_TO_DO]);
assert_eq!(
rig.chain.op_pool.num_attester_slashings(),
initial_attester_slashings + 1,
"op pool should have one more attester slashing"
);
/*
* Proposer slashing
*/
let initial_proposer_slashings = rig.chain.op_pool.num_proposer_slashings();
rig.enqueue_gossip_proposer_slashing();
rig.assert_event_journal(&[GOSSIP_PROPOSER_SLASHING, WORKER_FREED, NOTHING_TO_DO]);
assert_eq!(
rig.chain.op_pool.num_proposer_slashings(),
initial_proposer_slashings + 1,
"op pool should have one more proposer slashing"
);
/*
* Voluntary exit
*/
let initial_voluntary_exits = rig.chain.op_pool.num_voluntary_exits();
rig.enqueue_gossip_voluntary_exit();
rig.assert_event_journal(&[GOSSIP_VOLUNTARY_EXIT, WORKER_FREED, NOTHING_TO_DO]);
assert_eq!(
rig.chain.op_pool.num_voluntary_exits(),
initial_voluntary_exits + 1,
"op pool should have one more exit"
);
}