Super Silky Smooth Syncs, like a Sir (#1628)

## Issue Addressed
In principle.. closes #1551 but in general are improvements for performance, maintainability and readability. The logic for the optimistic sync in actually simple

## Proposed Changes
There are miscellaneous things here:
- Remove unnecessary `BatchProcessResult::Partial` to simplify the batch validation logic
- Make batches a state machine. This is done to ensure batch state transitions respect our logic (this was previously done by moving batches between `Vec`s) and to ease the cognitive load of the `SyncingChain` struct
- Move most batch-related logic to the batch
- Remove `PendingBatches` in favor of a map of peers to their batches. This is to avoid duplicating peers inside the chain (peer_pool and pending_batches)
- Add `must_use` decoration to the `ProcessingResult` so that chains that request to be removed are handled accordingly. This also means that chains are now removed in more places than before to account for unhandled cases
- Store batches in a sorted map (`BTreeMap`) access is not O(1) but since the number of _active_ batches is bounded this should be fast, and saves performing hashing ops. Batches are indexed by the epoch they start. Sorted, to easily handle chain advancements (range logic)
- Produce the chain Id from the identifying fields: target root and target slot. This, to guarantee there can't be duplicated chains and be able to consistently search chains by either Id or checkpoint
- Fix chain_id not being present in all chain loggers
- Handle mega-edge case where the processor's work queue is full and the batch can't be sent. In this case the chain would lose the blocks, remain in a "syncing" state and waiting for a result that won't arrive, effectively stalling sync.
- When a batch imports blocks or the chain starts syncing with a local finalized epoch greater that the chain's start epoch, the chain is advanced instead of reset. This is to avoid losing download progress and validate batches faster. This also means that the old `start_epoch` now means "current first unvalidated batch", so it represents more accurately the progress of the chain.
- Batch status peers from the same chain to reduce Arc access.
- Handle a couple of cases where the retry counters for a batch were not updated/checked are now handled via the batch state machine. Basically now if we forget to do it, we will know.
- Do not send back the blocks from the processor to the batch. Instead register the attempt before sending the blocks (does not count as failed)
- When re-requesting a batch, try to avoid not only the last failed peer, but all previous failed peers.
- Optimize requesting batches ahead in the buffer by shuffling idle peers just once (this is just addressing a couple of old TODOs in the code)
- In chain_collection, store chains by their id in a map
- Include a mapping from request_ids to (chain, batch) that requested the batch to avoid the double O(n) search on block responses
- Other stuff:
  - impl `slog::KV` for batches
  - impl `slog::KV` for syncing chains
  - PSA: when logging, we can use `%thing` if `thing` implements `Display`. Same for `?` and `Debug`

### Optimistic syncing:
Try first the batch that contains the current head, if the batch imports any block, advance the chain. If not, if this optimistic batch is inside the current processing window leave it there for future use, if not drop it. The tolerance for this block is the same for downloading, but just once for processing



Co-authored-by: Age Manning <Age@AgeManning.com>
This commit is contained in:
divma
2020-09-23 06:29:55 +00:00
parent 80e52a0263
commit b8013b7b2c
13 changed files with 1480 additions and 1199 deletions

View File

@@ -1,35 +1,274 @@
use super::chain::EPOCHS_PER_BATCH;
use eth2_libp2p::rpc::methods::*;
use eth2_libp2p::rpc::methods::BlocksByRangeRequest;
use eth2_libp2p::PeerId;
use fnv::FnvHashMap;
use ssz::Encode;
use std::cmp::min;
use std::cmp::Ordering;
use std::collections::hash_map::Entry;
use std::collections::{HashMap, HashSet};
use std::collections::HashSet;
use std::hash::{Hash, Hasher};
use std::ops::Sub;
use types::{Epoch, EthSpec, SignedBeaconBlock, Slot};
/// A collection of sequential blocks that are requested from peers in a single RPC request.
#[derive(PartialEq, Debug)]
pub struct Batch<T: EthSpec> {
/// The requested start epoch of the batch.
pub start_epoch: Epoch,
/// The requested end slot of batch, exclusive.
pub end_slot: Slot,
/// The `Attempts` that have been made to send us this batch.
pub attempts: Vec<Attempt>,
/// The peer that is currently assigned to the batch.
pub current_peer: PeerId,
/// The number of retries this batch has undergone due to a failed request.
/// This occurs when peers do not respond or we get an RPC error.
pub retries: u8,
/// The number of times this batch has attempted to be re-downloaded and re-processed. This
/// occurs when a batch has been received but cannot be processed.
pub reprocess_retries: u8,
/// The blocks that have been downloaded.
pub downloaded_blocks: Vec<SignedBeaconBlock<T>>,
/// The number of times to retry a batch before it is considered failed.
const MAX_BATCH_DOWNLOAD_ATTEMPTS: u8 = 5;
/// Invalid batches are attempted to be re-downloaded from other peers. If a batch cannot be processed
/// after `MAX_BATCH_PROCESSING_ATTEMPTS` times, it is considered faulty.
const MAX_BATCH_PROCESSING_ATTEMPTS: u8 = 3;
/// A segment of a chain.
pub struct BatchInfo<T: EthSpec> {
/// Start slot of the batch.
start_slot: Slot,
/// End slot of the batch.
end_slot: Slot,
/// The `Attempts` that have been made and failed to send us this batch.
failed_processing_attempts: Vec<Attempt>,
/// The number of download retries this batch has undergone due to a failed request.
failed_download_attempts: Vec<PeerId>,
/// State of the batch.
state: BatchState<T>,
}
/// Current state of a batch
pub enum BatchState<T: EthSpec> {
/// The batch has failed either downloading or processing, but can be requested again.
AwaitingDownload,
/// The batch is being downloaded.
Downloading(PeerId, Vec<SignedBeaconBlock<T>>),
/// The batch has been completely downloaded and is ready for processing.
AwaitingProcessing(PeerId, Vec<SignedBeaconBlock<T>>),
/// The batch is being processed.
Processing(Attempt),
/// The batch was successfully processed and is waiting to be validated.
///
/// It is not sufficient to process a batch successfully to consider it correct. This is
/// because batches could be erroneously empty, or incomplete. Therefore, a batch is considered
/// valid, only if the next sequential batch imports at least a block.
AwaitingValidation(Attempt),
/// Intermediate state for inner state handling.
Poisoned,
/// The batch has maxed out the allowed attempts for either downloading or processing. It
/// cannot be recovered.
Failed,
}
impl<T: EthSpec> BatchState<T> {
/// Helper function for poisoning a state.
pub fn poison(&mut self) -> BatchState<T> {
std::mem::replace(self, BatchState::Poisoned)
}
}
impl<T: EthSpec> BatchInfo<T> {
/// Batches are downloaded excluding the first block of the epoch assuming it has already been
/// downloaded.
///
/// For example:
///
/// Epoch boundary | |
/// ... | 30 | 31 | 32 | 33 | 34 | ... | 61 | 62 | 63 | 64 | 65 |
/// Batch 1 | Batch 2 | Batch 3
pub fn new(start_epoch: &Epoch, num_of_epochs: u64) -> Self {
let start_slot = start_epoch.start_slot(T::slots_per_epoch()) + 1;
let end_slot = start_slot + num_of_epochs * T::slots_per_epoch();
BatchInfo {
start_slot,
end_slot,
failed_processing_attempts: Vec::new(),
failed_download_attempts: Vec::new(),
state: BatchState::AwaitingDownload,
}
}
/// Gives a list of peers from which this batch has had a failed download or processing
/// attempt.
pub fn failed_peers(&self) -> HashSet<PeerId> {
let mut peers = HashSet::with_capacity(
self.failed_processing_attempts.len() + self.failed_download_attempts.len(),
);
for attempt in &self.failed_processing_attempts {
peers.insert(attempt.peer_id.clone());
}
for download in &self.failed_download_attempts {
peers.insert(download.clone());
}
peers
}
pub fn current_peer(&self) -> Option<&PeerId> {
match &self.state {
BatchState::AwaitingDownload | BatchState::Failed => None,
BatchState::Downloading(peer_id, _)
| BatchState::AwaitingProcessing(peer_id, _)
| BatchState::Processing(Attempt { peer_id, .. })
| BatchState::AwaitingValidation(Attempt { peer_id, .. }) => Some(&peer_id),
BatchState::Poisoned => unreachable!("Poisoned batch"),
}
}
pub fn to_blocks_by_range_request(&self) -> BlocksByRangeRequest {
BlocksByRangeRequest {
start_slot: self.start_slot.into(),
count: self.end_slot.sub(self.start_slot).into(),
step: 1,
}
}
pub fn state(&self) -> &BatchState<T> {
&self.state
}
pub fn attempts(&self) -> &[Attempt] {
&self.failed_processing_attempts
}
/// Adds a block to a downloading batch.
pub fn add_block(&mut self, block: SignedBeaconBlock<T>) {
match self.state.poison() {
BatchState::Downloading(peer, mut blocks) => {
blocks.push(block);
self.state = BatchState::Downloading(peer, blocks)
}
other => unreachable!("Add block for batch in wrong state: {:?}", other),
}
}
/// Marks the batch as ready to be processed if the blocks are in the range. The number of
/// received blocks is returned, or the wrong batch end on failure
#[must_use = "Batch may have failed"]
pub fn download_completed(
&mut self,
) -> Result<
usize, /* Received blocks */
(
Slot, /* expected slot */
Slot, /* received slot */
&BatchState<T>,
),
> {
match self.state.poison() {
BatchState::Downloading(peer, blocks) => {
// verify that blocks are in range
if let Some(last_slot) = blocks.last().map(|b| b.slot()) {
// the batch is non-empty
let first_slot = blocks[0].slot();
let failed_range = if first_slot < self.start_slot {
Some((self.start_slot, first_slot))
} else if self.end_slot < last_slot {
Some((self.end_slot, last_slot))
} else {
None
};
if let Some(range) = failed_range {
// this is a failed download, register the attempt and check if the batch
// can be tried again
self.failed_download_attempts.push(peer);
self.state = if self.failed_download_attempts.len()
>= MAX_BATCH_DOWNLOAD_ATTEMPTS as usize
{
BatchState::Failed
} else {
// drop the blocks
BatchState::AwaitingDownload
};
return Err((range.0, range.1, &self.state));
}
}
let received = blocks.len();
self.state = BatchState::AwaitingProcessing(peer, blocks);
Ok(received)
}
other => unreachable!("Download completed for batch in wrong state: {:?}", other),
}
}
#[must_use = "Batch may have failed"]
pub fn download_failed(&mut self) -> &BatchState<T> {
match self.state.poison() {
BatchState::Downloading(peer, _) => {
// register the attempt and check if the batch can be tried again
self.failed_download_attempts.push(peer);
self.state = if self.failed_download_attempts.len()
>= MAX_BATCH_DOWNLOAD_ATTEMPTS as usize
{
BatchState::Failed
} else {
// drop the blocks
BatchState::AwaitingDownload
};
&self.state
}
other => unreachable!("Download failed for batch in wrong state: {:?}", other),
}
}
pub fn start_downloading_from_peer(&mut self, peer: PeerId) {
match self.state.poison() {
BatchState::AwaitingDownload => {
self.state = BatchState::Downloading(peer, Vec::new());
}
other => unreachable!("Starting download for batch in wrong state: {:?}", other),
}
}
pub fn start_processing(&mut self) -> Vec<SignedBeaconBlock<T>> {
match self.state.poison() {
BatchState::AwaitingProcessing(peer, blocks) => {
self.state = BatchState::Processing(Attempt::new(peer, &blocks));
blocks
}
other => unreachable!("Start processing for batch in wrong state: {:?}", other),
}
}
#[must_use = "Batch may have failed"]
pub fn processing_completed(&mut self, was_sucessful: bool) -> &BatchState<T> {
match self.state.poison() {
BatchState::Processing(attempt) => {
self.state = if !was_sucessful {
// register the failed attempt
self.failed_processing_attempts.push(attempt);
// check if the batch can be downloaded again
if self.failed_processing_attempts.len()
>= MAX_BATCH_PROCESSING_ATTEMPTS as usize
{
BatchState::Failed
} else {
BatchState::AwaitingDownload
}
} else {
BatchState::AwaitingValidation(attempt)
};
&self.state
}
other => unreachable!("Processing completed for batch in wrong state: {:?}", other),
}
}
#[must_use = "Batch may have failed"]
pub fn validation_failed(&mut self) -> &BatchState<T> {
match self.state.poison() {
BatchState::AwaitingValidation(attempt) => {
self.failed_processing_attempts.push(attempt);
// check if the batch can be downloaded again
self.state = if self.failed_processing_attempts.len()
>= MAX_BATCH_PROCESSING_ATTEMPTS as usize
{
BatchState::Failed
} else {
BatchState::AwaitingDownload
};
&self.state
}
other => unreachable!("Validation failed for batch in wrong state: {:?}", other),
}
}
}
/// Represents a peer's attempt and providing the result for this batch.
@@ -43,131 +282,61 @@ pub struct Attempt {
pub hash: u64,
}
impl<T: EthSpec> Eq for Batch<T> {}
impl<T: EthSpec> Batch<T> {
pub fn new(start_epoch: Epoch, end_slot: Slot, peer_id: PeerId) -> Self {
Batch {
start_epoch,
end_slot,
attempts: Vec::new(),
current_peer: peer_id,
retries: 0,
reprocess_retries: 0,
downloaded_blocks: Vec::new(),
}
}
pub fn start_slot(&self) -> Slot {
// batches are shifted by 1
self.start_epoch.start_slot(T::slots_per_epoch()) + 1
}
pub fn end_slot(&self) -> Slot {
self.end_slot
}
pub fn to_blocks_by_range_request(&self) -> BlocksByRangeRequest {
let start_slot = self.start_slot();
BlocksByRangeRequest {
start_slot: start_slot.into(),
count: min(
T::slots_per_epoch() * EPOCHS_PER_BATCH,
self.end_slot.sub(start_slot).into(),
),
step: 1,
}
}
/// This gets a hash that represents the blocks currently downloaded. This allows comparing a
/// previously downloaded batch of blocks with a new downloaded batch of blocks.
pub fn hash(&self) -> u64 {
// the hash used is the ssz-encoded list of blocks
impl Attempt {
#[allow(clippy::ptr_arg)]
fn new<T: EthSpec>(peer_id: PeerId, blocks: &Vec<SignedBeaconBlock<T>>) -> Self {
let mut hasher = std::collections::hash_map::DefaultHasher::new();
self.downloaded_blocks.as_ssz_bytes().hash(&mut hasher);
hasher.finish()
blocks.as_ssz_bytes().hash(&mut hasher);
let hash = hasher.finish();
Attempt { peer_id, hash }
}
}
impl<T: EthSpec> Ord for Batch<T> {
fn cmp(&self, other: &Self) -> Ordering {
self.start_epoch.cmp(&other.start_epoch)
impl<T: EthSpec> slog::KV for &mut BatchInfo<T> {
fn serialize(
&self,
record: &slog::Record,
serializer: &mut dyn slog::Serializer,
) -> slog::Result {
slog::KV::serialize(*self, record, serializer)
}
}
impl<T: EthSpec> PartialOrd for Batch<T> {
fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
Some(self.cmp(other))
impl<T: EthSpec> slog::KV for BatchInfo<T> {
fn serialize(
&self,
record: &slog::Record,
serializer: &mut dyn slog::Serializer,
) -> slog::Result {
use slog::Value;
Value::serialize(&self.start_slot, record, "start_slot", serializer)?;
Value::serialize(
&(self.end_slot - 1), // NOTE: The -1 shows inclusive blocks
record,
"end_slot",
serializer,
)?;
serializer.emit_usize("downloaded", self.failed_download_attempts.len())?;
serializer.emit_usize("processed", self.failed_processing_attempts.len())?;
serializer.emit_str("state", &format!("{:?}", self.state))?;
slog::Result::Ok(())
}
}
/// A structure that contains a mapping of pending batch requests, that also keeps track of which
/// peers are currently making batch requests.
///
/// This is used to optimise searches for idle peers (peers that have no outbound batch requests).
pub struct PendingBatches<T: EthSpec> {
/// The current pending batches.
batches: FnvHashMap<usize, Batch<T>>,
/// A mapping of peers to the number of pending requests.
peer_requests: HashMap<PeerId, HashSet<usize>>,
}
impl<T: EthSpec> PendingBatches<T> {
pub fn new() -> Self {
PendingBatches {
batches: FnvHashMap::default(),
peer_requests: HashMap::new(),
}
}
pub fn insert(&mut self, request_id: usize, batch: Batch<T>) -> Option<Batch<T>> {
let peer_request = batch.current_peer.clone();
self.peer_requests
.entry(peer_request)
.or_insert_with(HashSet::new)
.insert(request_id);
self.batches.insert(request_id, batch)
}
pub fn remove(&mut self, request_id: usize) -> Option<Batch<T>> {
if let Some(batch) = self.batches.remove(&request_id) {
if let Entry::Occupied(mut entry) = self.peer_requests.entry(batch.current_peer.clone())
{
entry.get_mut().remove(&request_id);
if entry.get().is_empty() {
entry.remove();
}
impl<T: EthSpec> std::fmt::Debug for BatchState<T> {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
BatchState::Processing(_) => f.write_str("Processing"),
BatchState::AwaitingValidation(_) => f.write_str("AwaitingValidation"),
BatchState::AwaitingDownload => f.write_str("AwaitingDownload"),
BatchState::Failed => f.write_str("Failed"),
BatchState::AwaitingProcessing(ref peer, ref blocks) => {
write!(f, "AwaitingProcessing({}, {} blocks)", peer, blocks.len())
}
Some(batch)
} else {
None
BatchState::Downloading(peer, blocks) => {
write!(f, "Downloading({}, {} blocks)", peer, blocks.len())
}
BatchState::Poisoned => f.write_str("Poisoned"),
}
}
/// The number of current pending batch requests.
pub fn len(&self) -> usize {
self.batches.len()
}
/// Adds a block to the batches if the request id exists. Returns None if there is no batch
/// matching the request id.
pub fn add_block(&mut self, request_id: usize, block: SignedBeaconBlock<T>) -> Option<()> {
let batch = self.batches.get_mut(&request_id)?;
batch.downloaded_blocks.push(block);
Some(())
}
/// Returns true if there the peer does not exist in the peer_requests mapping. Indicating it
/// has no pending outgoing requests.
pub fn peer_is_idle(&self, peer_id: &PeerId) -> bool {
self.peer_requests.get(peer_id).is_none()
}
/// Removes a batch for a given peer.
pub fn remove_batch_by_peer(&mut self, peer_id: &PeerId) -> Option<Batch<T>> {
let request_ids = self.peer_requests.get(peer_id)?;
let request_id = *request_ids.iter().next()?;
self.remove(request_id)
}
}