mirror of
https://github.com/sigp/lighthouse.git
synced 2026-03-15 10:52:43 +00:00
Implement el_offline and use it in the VC (#4295)
## Issue Addressed
Closes https://github.com/sigp/lighthouse/issues/4291, part of #3613.
## Proposed Changes
- Implement the `el_offline` field on `/eth/v1/node/syncing`. We set `el_offline=true` if:
- The EL's internal status is `Offline` or `AuthFailed`, _or_
- The most recent call to `newPayload` resulted in an error (more on this in a moment).
- Use the `el_offline` field in the VC to mark nodes with offline ELs as _unsynced_. These nodes will still be used, but only after synced nodes.
- Overhaul the usage of `RequireSynced` so that `::No` is used almost everywhere. The `--allow-unsynced` flag was broken and had the opposite effect to intended, so it has been deprecated.
- Add tests for the EL being offline on the upcheck call, and being offline due to the newPayload check.
## Why track `newPayload` errors?
Tracking the EL's online/offline status is too coarse-grained to be useful in practice, because:
- If the EL is timing out to some calls, it's unlikely to timeout on the `upcheck` call, which is _just_ `eth_syncing`. Every failed call is followed by an upcheck [here](693886b941/beacon_node/execution_layer/src/engines.rs (L372-L380)), which would have the effect of masking the failure and keeping the status _online_.
- The `newPayload` call is the most likely to time out. It's the call in which ELs tend to do most of their work (often 1-2 seconds), with `forkchoiceUpdated` usually returning much faster (<50ms).
- If `newPayload` is failing consistently (e.g. timing out) then this is a good indication that either the node's EL is in trouble, or the network as a whole is. In the first case validator clients _should_ prefer other BNs if they have one available. In the second case, all of their BNs will likely report `el_offline` and they'll just have to proceed with trying to use them.
## Additional Changes
- Add utility method `ForkName::latest` which is quite convenient for test writing, but probably other things too.
- Delete some stale comments from when we used to support multiple execution nodes.
This commit is contained in:
@@ -28,7 +28,7 @@ const UPDATE_REQUIRED_LOG_HINT: &str = "this VC or the remote BN may need updati
|
||||
/// too early, we risk switching nodes between the time of publishing an attestation and publishing
|
||||
/// an aggregate; this may result in a missed aggregation. If we set this time too late, we risk not
|
||||
/// having the correct nodes up and running prior to the start of the slot.
|
||||
const SLOT_LOOKAHEAD: Duration = Duration::from_secs(1);
|
||||
const SLOT_LOOKAHEAD: Duration = Duration::from_secs(2);
|
||||
|
||||
/// Indicates a measurement of latency between the VC and a BN.
|
||||
pub struct LatencyMeasurement {
|
||||
@@ -52,7 +52,7 @@ pub fn start_fallback_updater_service<T: SlotClock + 'static, E: EthSpec>(
|
||||
|
||||
let future = async move {
|
||||
loop {
|
||||
beacon_nodes.update_unready_candidates().await;
|
||||
beacon_nodes.update_all_candidates().await;
|
||||
|
||||
let sleep_time = beacon_nodes
|
||||
.slot_clock
|
||||
@@ -385,33 +385,21 @@ impl<T: SlotClock, E: EthSpec> BeaconNodeFallback<T, E> {
|
||||
n
|
||||
}
|
||||
|
||||
/// Loop through any `self.candidates` that we don't think are online, compatible or synced and
|
||||
/// poll them to see if their status has changed.
|
||||
/// Loop through ALL candidates in `self.candidates` and update their sync status.
|
||||
///
|
||||
/// We do not poll nodes that are synced to avoid sending additional requests when everything is
|
||||
/// going smoothly.
|
||||
pub async fn update_unready_candidates(&self) {
|
||||
let mut futures = Vec::new();
|
||||
for candidate in &self.candidates {
|
||||
// There is a potential race condition between having the read lock and the write
|
||||
// lock. The worst case of this race is running `try_become_ready` twice, which is
|
||||
// acceptable.
|
||||
//
|
||||
// Note: `RequireSynced` is always set to false here. This forces us to recheck the sync
|
||||
// status of nodes that were previously not-synced.
|
||||
if candidate.status(RequireSynced::Yes).await.is_err() {
|
||||
// There exists a race-condition that could result in `refresh_status` being called
|
||||
// when the status does not require refreshing anymore. This is deemed an
|
||||
// acceptable inefficiency.
|
||||
futures.push(candidate.refresh_status(
|
||||
self.slot_clock.as_ref(),
|
||||
&self.spec,
|
||||
&self.log,
|
||||
));
|
||||
}
|
||||
}
|
||||
/// It is possible for a node to return an unsynced status while continuing to serve
|
||||
/// low quality responses. To route around this it's best to poll all connected beacon nodes.
|
||||
/// A previous implementation of this function polled only the unavailable BNs.
|
||||
pub async fn update_all_candidates(&self) {
|
||||
let futures = self
|
||||
.candidates
|
||||
.iter()
|
||||
.map(|candidate| {
|
||||
candidate.refresh_status(self.slot_clock.as_ref(), &self.spec, &self.log)
|
||||
})
|
||||
.collect::<Vec<_>>();
|
||||
|
||||
//run all updates concurrently and ignore results
|
||||
// run all updates concurrently and ignore errors
|
||||
let _ = future::join_all(futures).await;
|
||||
}
|
||||
|
||||
|
||||
Reference in New Issue
Block a user