Pause sync when EE is offline (#3428)

## Issue Addressed

#3032

## Proposed Changes

Pause sync when ee is offline. Changes include three main parts:
- Online/offline notification system
- Pause sync
- Resume sync

#### Online/offline notification system
- The engine state is now guarded behind a new struct `State` that ensures every change is correctly notified. Notifications are only sent if the state changes. The new `State` is behind a `RwLock` (as before) as the synchronization mechanism.
- The actual notification channel is a [tokio::sync::watch](https://docs.rs/tokio/latest/tokio/sync/watch/index.html) which ensures only the last value is in the receiver channel. This way we don't need to worry about message order etc.
- Sync waits for state changes concurrently with normal messages.

#### Pause Sync
Sync has four components, pausing is done differently in each:
- **Block lookups**: Disabled while in this state. We drop current requests and don't search for new blocks. Block lookups are infrequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it.
- **Parent lookups**: Disabled while in this state. We drop current requests and don't search for new parents. Parent lookups are even less frequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it.
- **Range**: Chains don't send batches for processing to the beacon processor. This is easily done by guarding the channel to the beacon processor and giving it access only if the ee is responsive. I find this the simplest and most powerful approach since we don't need to deal with new sync states and chain segments that are added while the ee is offline will follow the same logic without needing to synchronize a shared state among those. Another advantage of passive pause vs active pause is that we can still keep track of active advertised chain segments so that on resume we don't need to re-evaluate all our peers.
- **Backfill**: Not affected by ee states, we don't pause.

#### Resume Sync
- **Block lookups**: Enabled again.
- **Parent lookups**: Enabled again.
- **Range**: Active resume. Since the only real pause range does is not sending batches for processing, resume makes all chains that are holding read-for-processing batches send them.
- **Backfill**: Not affected by ee states, no need to resume.

## Additional Info

**QUESTION**: Originally I made this to notify and change on synced state, but @pawanjay176 on talks with @paulhauner concluded we only need to check online/offline states. The upcheck function mentions extra checks to have a very up to date sync status to aid the networking stack. However, the only need the networking stack would have is this one. I added a TODO to review if the extra check can be removed

Next gen of #3094

Will work best with #3439 

Co-authored-by: Pawan Dhananjay <pawandhananjay@gmail.com>
This commit is contained in:
Divma
2022-08-24 23:34:56 +00:00
parent aab4a8d2f2
commit 8c69d57c2c
14 changed files with 574 additions and 328 deletions

View File

@@ -5,11 +5,10 @@ use beacon_chain::{BeaconChainTypes, BlockError};
use fnv::FnvHashMap;
use lighthouse_network::{PeerAction, PeerId};
use lru_cache::LRUTimeCache;
use slog::{crit, debug, error, trace, warn, Logger};
use slog::{debug, error, trace, warn, Logger};
use smallvec::SmallVec;
use std::sync::Arc;
use store::{Hash256, SignedBeaconBlock};
use tokio::sync::mpsc;
use crate::beacon_processor::{ChainSegmentProcessId, WorkEvent};
use crate::metrics;
@@ -36,7 +35,7 @@ const SINGLE_BLOCK_LOOKUP_MAX_ATTEMPTS: u8 = 3;
pub(crate) struct BlockLookups<T: BeaconChainTypes> {
/// A collection of parent block lookups.
parent_queue: SmallVec<[ParentLookup<T::EthSpec>; 3]>,
parent_queue: SmallVec<[ParentLookup<T>; 3]>,
/// A cache of failed chain lookups to prevent duplicate searches.
failed_chains: LRUTimeCache<Hash256>,
@@ -47,22 +46,18 @@ pub(crate) struct BlockLookups<T: BeaconChainTypes> {
/// The flag allows us to determine if the peer returned data or sent us nothing.
single_block_lookups: FnvHashMap<Id, SingleBlockRequest<SINGLE_BLOCK_LOOKUP_MAX_ATTEMPTS>>,
/// A multi-threaded, non-blocking processor for applying messages to the beacon chain.
beacon_processor_send: mpsc::Sender<WorkEvent<T>>,
/// The logger for the import manager.
log: Logger,
}
impl<T: BeaconChainTypes> BlockLookups<T> {
pub fn new(beacon_processor_send: mpsc::Sender<WorkEvent<T>>, log: Logger) -> Self {
pub fn new(log: Logger) -> Self {
Self {
parent_queue: Default::default(),
failed_chains: LRUTimeCache::new(Duration::from_secs(
FAILED_CHAINS_CACHE_EXPIRY_SECONDS,
)),
single_block_lookups: Default::default(),
beacon_processor_send,
log,
}
}
@@ -71,12 +66,7 @@ impl<T: BeaconChainTypes> BlockLookups<T> {
/// Searches for a single block hash. If the blocks parent is unknown, a chain of blocks is
/// constructed.
pub fn search_block(
&mut self,
hash: Hash256,
peer_id: PeerId,
cx: &mut SyncNetworkContext<T::EthSpec>,
) {
pub fn search_block(&mut self, hash: Hash256, peer_id: PeerId, cx: &mut SyncNetworkContext<T>) {
// Do not re-request a block that is already being requested
if self
.single_block_lookups
@@ -113,7 +103,7 @@ impl<T: BeaconChainTypes> BlockLookups<T> {
&mut self,
block: Arc<SignedBeaconBlock<T::EthSpec>>,
peer_id: PeerId,
cx: &mut SyncNetworkContext<T::EthSpec>,
cx: &mut SyncNetworkContext<T>,
) {
let block_root = block.canonical_root();
let parent_root = block.parent_root();
@@ -147,18 +137,16 @@ impl<T: BeaconChainTypes> BlockLookups<T> {
peer_id: PeerId,
block: Option<Arc<SignedBeaconBlock<T::EthSpec>>>,
seen_timestamp: Duration,
cx: &mut SyncNetworkContext<T::EthSpec>,
cx: &mut SyncNetworkContext<T>,
) {
let mut request = match self.single_block_lookups.entry(id) {
Entry::Occupied(req) => req,
Entry::Vacant(_) => {
if block.is_some() {
crit!(
debug!(
self.log,
"Block returned for single block lookup not present"
);
#[cfg(debug_assertions)]
panic!("block returned for single block lookup not present");
}
return;
}
@@ -172,6 +160,7 @@ impl<T: BeaconChainTypes> BlockLookups<T> {
block,
seen_timestamp,
BlockProcessType::SingleBlock { id },
cx,
)
.is_err()
{
@@ -212,7 +201,7 @@ impl<T: BeaconChainTypes> BlockLookups<T> {
peer_id: PeerId,
block: Option<Arc<SignedBeaconBlock<T::EthSpec>>>,
seen_timestamp: Duration,
cx: &mut SyncNetworkContext<T::EthSpec>,
cx: &mut SyncNetworkContext<T>,
) {
let mut parent_lookup = if let Some(pos) = self
.parent_queue
@@ -236,6 +225,7 @@ impl<T: BeaconChainTypes> BlockLookups<T> {
block,
seen_timestamp,
BlockProcessType::ParentLookup { chain_hash },
cx,
)
.is_ok()
{
@@ -289,7 +279,7 @@ impl<T: BeaconChainTypes> BlockLookups<T> {
/* Error responses */
#[allow(clippy::needless_collect)] // false positive
pub fn peer_disconnected(&mut self, peer_id: &PeerId, cx: &mut SyncNetworkContext<T::EthSpec>) {
pub fn peer_disconnected(&mut self, peer_id: &PeerId, cx: &mut SyncNetworkContext<T>) {
/* Check disconnection for single block lookups */
// better written after https://github.com/rust-lang/rust/issues/59618
let remove_retry_ids: Vec<Id> = self
@@ -345,7 +335,7 @@ impl<T: BeaconChainTypes> BlockLookups<T> {
&mut self,
id: Id,
peer_id: PeerId,
cx: &mut SyncNetworkContext<T::EthSpec>,
cx: &mut SyncNetworkContext<T>,
) {
if let Some(pos) = self
.parent_queue
@@ -365,7 +355,7 @@ impl<T: BeaconChainTypes> BlockLookups<T> {
);
}
pub fn single_block_lookup_failed(&mut self, id: Id, cx: &mut SyncNetworkContext<T::EthSpec>) {
pub fn single_block_lookup_failed(&mut self, id: Id, cx: &mut SyncNetworkContext<T>) {
if let Some(mut request) = self.single_block_lookups.remove(&id) {
request.register_failure_downloading();
trace!(self.log, "Single block lookup failed"; "block" => %request.hash);
@@ -388,15 +378,12 @@ impl<T: BeaconChainTypes> BlockLookups<T> {
&mut self,
id: Id,
result: BlockProcessResult<T::EthSpec>,
cx: &mut SyncNetworkContext<T::EthSpec>,
cx: &mut SyncNetworkContext<T>,
) {
let mut req = match self.single_block_lookups.remove(&id) {
Some(req) => req,
None => {
#[cfg(debug_assertions)]
panic!("block processed for single block lookup not present");
#[cfg(not(debug_assertions))]
return crit!(
return debug!(
self.log,
"Block processed for single block lookup not present"
);
@@ -476,7 +463,7 @@ impl<T: BeaconChainTypes> BlockLookups<T> {
&mut self,
chain_hash: Hash256,
result: BlockProcessResult<T::EthSpec>,
cx: &mut SyncNetworkContext<T::EthSpec>,
cx: &mut SyncNetworkContext<T>,
) {
let (mut parent_lookup, peer_id) = if let Some((pos, peer)) = self
.parent_queue
@@ -489,13 +476,7 @@ impl<T: BeaconChainTypes> BlockLookups<T> {
}) {
(self.parent_queue.remove(pos), peer)
} else {
#[cfg(debug_assertions)]
panic!(
"Process response for a parent lookup request that was not found. Chain_hash: {}",
chain_hash
);
#[cfg(not(debug_assertions))]
return crit!(self.log, "Process response for a parent lookup request that was not found"; "chain_hash" => %chain_hash);
return debug!(self.log, "Process response for a parent lookup request that was not found"; "chain_hash" => %chain_hash);
};
match &result {
@@ -524,14 +505,22 @@ impl<T: BeaconChainTypes> BlockLookups<T> {
}
BlockProcessResult::Ok
| BlockProcessResult::Err(BlockError::BlockIsAlreadyKnown { .. }) => {
// Check if the beacon processor is available
let beacon_processor_send = match cx.processor_channel_if_enabled() {
Some(channel) => channel,
None => {
return trace!(
self.log,
"Dropping parent chain segment that was ready for processing.";
parent_lookup
);
}
};
let chain_hash = parent_lookup.chain_hash();
let blocks = parent_lookup.chain_blocks();
let process_id = ChainSegmentProcessId::ParentLookup(chain_hash);
match self
.beacon_processor_send
.try_send(WorkEvent::chain_segment(process_id, blocks))
{
match beacon_processor_send.try_send(WorkEvent::chain_segment(process_id, blocks)) {
Ok(_) => {
self.parent_queue.push(parent_lookup);
}
@@ -595,7 +584,7 @@ impl<T: BeaconChainTypes> BlockLookups<T> {
&mut self,
chain_hash: Hash256,
result: BatchProcessResult,
cx: &mut SyncNetworkContext<T::EthSpec>,
cx: &mut SyncNetworkContext<T>,
) {
let parent_lookup = if let Some(pos) = self
.parent_queue
@@ -604,12 +593,6 @@ impl<T: BeaconChainTypes> BlockLookups<T> {
{
self.parent_queue.remove(pos)
} else {
#[cfg(debug_assertions)]
panic!(
"Chain process response for a parent lookup request that was not found. Chain_hash: {}",
chain_hash
);
#[cfg(not(debug_assertions))]
return debug!(self.log, "Chain process response for a parent lookup request that was not found"; "chain_hash" => %chain_hash);
};
@@ -645,25 +628,34 @@ impl<T: BeaconChainTypes> BlockLookups<T> {
block: Arc<SignedBeaconBlock<T::EthSpec>>,
duration: Duration,
process_type: BlockProcessType,
cx: &mut SyncNetworkContext<T>,
) -> Result<(), ()> {
trace!(self.log, "Sending block for processing"; "block" => %block.canonical_root(), "process" => ?process_type);
let event = WorkEvent::rpc_beacon_block(block, duration, process_type);
if let Err(e) = self.beacon_processor_send.try_send(event) {
error!(
self.log,
"Failed to send sync block to processor";
"error" => ?e
);
return Err(());
match cx.processor_channel_if_enabled() {
Some(beacon_processor_send) => {
trace!(self.log, "Sending block for processing"; "block" => %block.canonical_root(), "process" => ?process_type);
let event = WorkEvent::rpc_beacon_block(block, duration, process_type);
if let Err(e) = beacon_processor_send.try_send(event) {
error!(
self.log,
"Failed to send sync block to processor";
"error" => ?e
);
Err(())
} else {
Ok(())
}
}
None => {
trace!(self.log, "Dropping block ready for processing. Beacon processor not available"; "block" => %block.canonical_root());
Err(())
}
}
Ok(())
}
fn request_parent(
&mut self,
mut parent_lookup: ParentLookup<T::EthSpec>,
cx: &mut SyncNetworkContext<T::EthSpec>,
mut parent_lookup: ParentLookup<T>,
cx: &mut SyncNetworkContext<T>,
) {
match parent_lookup.request_parent(cx) {
Err(e) => {
@@ -710,4 +702,14 @@ impl<T: BeaconChainTypes> BlockLookups<T> {
self.parent_queue.len() as i64,
);
}
/// Drops all the single block requests and returns how many requests were dropped.
pub fn drop_single_block_requests(&mut self) -> usize {
self.single_block_lookups.drain().len()
}
/// Drops all the parent chain requests and returns how many requests were dropped.
pub fn drop_parent_chain_requests(&mut self) -> usize {
self.parent_queue.drain(..).len()
}
}

View File

@@ -1,6 +1,7 @@
use beacon_chain::BeaconChainTypes;
use lighthouse_network::PeerId;
use std::sync::Arc;
use store::{EthSpec, Hash256, SignedBeaconBlock};
use store::{Hash256, SignedBeaconBlock};
use strum::IntoStaticStr;
use crate::sync::{
@@ -18,11 +19,11 @@ pub(crate) const PARENT_FAIL_TOLERANCE: u8 = 5;
pub(crate) const PARENT_DEPTH_TOLERANCE: usize = SLOT_IMPORT_TOLERANCE * 2;
/// Maintains a sequential list of parents to lookup and the lookup's current state.
pub(crate) struct ParentLookup<T: EthSpec> {
pub(crate) struct ParentLookup<T: BeaconChainTypes> {
/// The root of the block triggering this parent request.
chain_hash: Hash256,
/// The blocks that have currently been downloaded.
downloaded_blocks: Vec<Arc<SignedBeaconBlock<T>>>,
downloaded_blocks: Vec<Arc<SignedBeaconBlock<T::EthSpec>>>,
/// Request of the last parent.
current_parent_request: SingleBlockRequest<PARENT_FAIL_TOLERANCE>,
/// Id of the last parent request.
@@ -50,14 +51,14 @@ pub enum RequestError {
NoPeers,
}
impl<T: EthSpec> ParentLookup<T> {
pub fn contains_block(&self, block: &SignedBeaconBlock<T>) -> bool {
impl<T: BeaconChainTypes> ParentLookup<T> {
pub fn contains_block(&self, block: &SignedBeaconBlock<T::EthSpec>) -> bool {
self.downloaded_blocks
.iter()
.any(|d_block| d_block.as_ref() == block)
}
pub fn new(block: Arc<SignedBeaconBlock<T>>, peer_id: PeerId) -> Self {
pub fn new(block: Arc<SignedBeaconBlock<T::EthSpec>>, peer_id: PeerId) -> Self {
let current_parent_request = SingleBlockRequest::new(block.parent_root(), peer_id);
Self {
@@ -92,7 +93,7 @@ impl<T: EthSpec> ParentLookup<T> {
self.current_parent_request.check_peer_disconnected(peer_id)
}
pub fn add_block(&mut self, block: Arc<SignedBeaconBlock<T>>) {
pub fn add_block(&mut self, block: Arc<SignedBeaconBlock<T::EthSpec>>) {
let next_parent = block.parent_root();
self.downloaded_blocks.push(block);
self.current_parent_request.hash = next_parent;
@@ -119,7 +120,7 @@ impl<T: EthSpec> ParentLookup<T> {
self.current_parent_request_id = None;
}
pub fn chain_blocks(&mut self) -> Vec<Arc<SignedBeaconBlock<T>>> {
pub fn chain_blocks(&mut self) -> Vec<Arc<SignedBeaconBlock<T::EthSpec>>> {
std::mem::take(&mut self.downloaded_blocks)
}
@@ -127,9 +128,9 @@ impl<T: EthSpec> ParentLookup<T> {
/// the processing result of the block.
pub fn verify_block(
&mut self,
block: Option<Arc<SignedBeaconBlock<T>>>,
block: Option<Arc<SignedBeaconBlock<T::EthSpec>>>,
failed_chains: &mut lru_cache::LRUTimeCache<Hash256>,
) -> Result<Option<Arc<SignedBeaconBlock<T>>>, VerifyError> {
) -> Result<Option<Arc<SignedBeaconBlock<T::EthSpec>>>, VerifyError> {
let block = self.current_parent_request.verify_block(block)?;
// check if the parent of this block isn't in the failed cache. If it is, this chain should
@@ -189,7 +190,7 @@ impl From<super::single_block_lookup::LookupRequestError> for RequestError {
}
}
impl<T: EthSpec> slog::KV for ParentLookup<T> {
impl<T: BeaconChainTypes> slog::KV for ParentLookup<T> {
fn serialize(
&self,
record: &slog::Record,

View File

@@ -12,6 +12,7 @@ use lighthouse_network::{NetworkGlobals, Request};
use slog::{Drain, Level};
use slot_clock::SystemTimeSlotClock;
use store::MemoryStore;
use tokio::sync::mpsc;
use types::test_utils::{SeedableRng, TestRandom, XorShiftRng};
use types::MinimalEthSpec as E;
@@ -26,7 +27,7 @@ struct TestRig {
const D: Duration = Duration::new(0, 0);
impl TestRig {
fn test_setup(log_level: Option<Level>) -> (BlockLookups<T>, SyncNetworkContext<E>, Self) {
fn test_setup(log_level: Option<Level>) -> (BlockLookups<T>, SyncNetworkContext<T>, Self) {
let log = {
let decorator = slog_term::TermDecorator::new().build();
let drain = slog_term::FullFormat::new(decorator).build().fuse();
@@ -47,15 +48,13 @@ impl TestRig {
network_rx,
rng,
};
let bl = BlockLookups::new(
beacon_processor_tx,
log.new(slog::o!("component" => "block_lookups")),
);
let bl = BlockLookups::new(log.new(slog::o!("component" => "block_lookups")));
let cx = {
let globals = Arc::new(NetworkGlobals::new_test_globals(&log));
SyncNetworkContext::new(
network_tx,
globals,
beacon_processor_tx,
log.new(slog::o!("component" => "network_context")),
)
};