Handle processing results of non faulty batches (#3439)

## Issue Addressed
Solves #3390 

So after checking some logs @pawanjay176 got, we conclude that this happened because we blacklisted a chain after trying it "too much". Now here, in all occurrences it seems that "too much" means we got too many download failures. This happened very slowly, exactly because the batch is allowed to stay alive for very long times after not counting penalties when the ee is offline. The error here then was not that the batch failed because of offline ee errors, but that we blacklisted a chain because of download errors, which we can't pin on the chain but on the peer. This PR fixes that.

## Proposed Changes

Adds a missing piece of logic so that if a chain fails for errors that can't be attributed to an objectively bad behavior from the peer, it is not blacklisted. The issue at hand occurred when new peers arrived claiming a head that had wrongfully blacklisted, even if the original peers participating in the chain were not penalized.

Another notable change is that we need to consider a batch invalid if it processed correctly but its next non empty batch fails processing. Now since a batch can fail processing in non empty ways, there is no need to mark as invalid previous batches.

Improves some logging as well.

## Additional Info

We should do this regardless of pausing sync on ee offline/unsynced state. This is because I think it's almost impossible to ensure a processing result will reach in a predictable order with a synced notification from the ee. Doing this handles what I think are inevitable data races when we actually pause sync

This also fixes a return that reports which batch failed and caused us some confusion checking the logs
This commit is contained in:
Divma
2022-08-12 00:56:38 +00:00
parent a476ae4907
commit f4ffa9e0b4
12 changed files with 298 additions and 274 deletions

View File

@@ -11,7 +11,7 @@ use std::sync::Arc;
use store::{Hash256, SignedBeaconBlock};
use tokio::sync::mpsc;
use crate::beacon_processor::{ChainSegmentProcessId, FailureMode, WorkEvent};
use crate::beacon_processor::{ChainSegmentProcessId, WorkEvent};
use crate::metrics;
use self::{
@@ -610,35 +610,26 @@ impl<T: BeaconChainTypes> BlockLookups<T> {
chain_hash
);
#[cfg(not(debug_assertions))]
return crit!(self.log, "Chain process response for a parent lookup request that was not found"; "chain_hash" => %chain_hash);
return debug!(self.log, "Chain process response for a parent lookup request that was not found"; "chain_hash" => %chain_hash);
};
debug!(self.log, "Parent chain processed"; "chain_hash" => %chain_hash, "result" => ?result);
match result {
BatchProcessResult::Success(_) => {
BatchProcessResult::Success { .. } => {
// nothing to do.
}
BatchProcessResult::Failed {
BatchProcessResult::FaultyFailure {
imported_blocks: _,
peer_action,
mode,
penalty,
} => {
if let FailureMode::ExecutionLayer { pause_sync: _ } = mode {
debug!(
self.log,
"Chain segment processing failed. Execution layer is offline";
"chain_hash" => %chain_hash,
"error" => ?mode
);
} else {
self.failed_chains.insert(parent_lookup.chain_hash());
if let Some(peer_action) = peer_action {
for &peer_id in parent_lookup.used_peers() {
cx.report_peer(peer_id, peer_action, "parent_chain_failure")
}
}
self.failed_chains.insert(parent_lookup.chain_hash());
for &peer_id in parent_lookup.used_peers() {
cx.report_peer(peer_id, penalty, "parent_chain_failure")
}
}
BatchProcessResult::NonFaultyFailure => {
// We might request this chain again if there is need but otherwise, don't try again
}
}
metrics::set_gauge(

View File

@@ -284,7 +284,10 @@ fn test_parent_lookup_happy_path() {
// Processing succeeds, now the rest of the chain should be sent for processing.
bl.parent_block_processed(chain_hash, BlockError::BlockIsAlreadyKnown.into(), &mut cx);
rig.expect_parent_chain_process();
bl.parent_chain_processed(chain_hash, BatchProcessResult::Success(true), &mut cx);
let process_result = BatchProcessResult::Success {
was_non_empty: true,
};
bl.parent_chain_processed(chain_hash, process_result, &mut cx);
assert_eq!(bl.parent_queue.len(), 0);
}
@@ -318,7 +321,10 @@ fn test_parent_lookup_wrong_response() {
// Processing succeeds, now the rest of the chain should be sent for processing.
bl.parent_block_processed(chain_hash, Ok(()).into(), &mut cx);
rig.expect_parent_chain_process();
bl.parent_chain_processed(chain_hash, BatchProcessResult::Success(true), &mut cx);
let process_result = BatchProcessResult::Success {
was_non_empty: true,
};
bl.parent_chain_processed(chain_hash, process_result, &mut cx);
assert_eq!(bl.parent_queue.len(), 0);
}
@@ -347,7 +353,10 @@ fn test_parent_lookup_empty_response() {
// Processing succeeds, now the rest of the chain should be sent for processing.
bl.parent_block_processed(chain_hash, Ok(()).into(), &mut cx);
rig.expect_parent_chain_process();
bl.parent_chain_processed(chain_hash, BatchProcessResult::Success(true), &mut cx);
let process_result = BatchProcessResult::Success {
was_non_empty: true,
};
bl.parent_chain_processed(chain_hash, process_result, &mut cx);
assert_eq!(bl.parent_queue.len(), 0);
}
@@ -375,7 +384,10 @@ fn test_parent_lookup_rpc_failure() {
// Processing succeeds, now the rest of the chain should be sent for processing.
bl.parent_block_processed(chain_hash, Ok(()).into(), &mut cx);
rig.expect_parent_chain_process();
bl.parent_chain_processed(chain_hash, BatchProcessResult::Success(true), &mut cx);
let process_result = BatchProcessResult::Success {
was_non_empty: true,
};
bl.parent_chain_processed(chain_hash, process_result, &mut cx);
assert_eq!(bl.parent_queue.len(), 0);
}