Improved peer management (#2993)

## Issue Addressed

I noticed in some logs some excess and unecessary discovery queries. What was happening was we were pruning our peers down to our outbound target and having some disconnect. When we are below this threshold we try to find more peers (even if we are at our peer limit). The request becomes futile because we have no more peer slots. 

This PR corrects this issue and advances the pruning mechanism to favour subnet peers. 

An overview the new logic added is:
- We prune peers down to a target outbound peer count which is higher than the minimum outbound peer count.
- We only search for more peers if there is room to do so, and we are below the minimum outbound peer count not the target. So this gives us some buffer for peers to disconnect. The buffer is currently 10%

The modified pruning logic is documented in the code but for reference it should do the following:
- Prune peers with bad scores first
- If we need to prune more peers, then prune peers that are subscribed to a long-lived subnet
- If we still need to prune peers, the prune peers that we have a higher density of on any given subnet which should drive for uniform peers across all subnets.

This will need a bit of testing as it modifies some significant peer management behaviours in lighthouse.
This commit is contained in:
Age Manning
2022-02-18 02:36:43 +00:00
parent da4ca024f1
commit 3ebb8b0244
7 changed files with 876 additions and 81 deletions

View File

@@ -29,6 +29,9 @@ const BANNED_PEERS_PER_IP_THRESHOLD: usize = 5;
/// Relative factor of peers that are allowed to have a negative gossipsub score without penalizing
/// them in lighthouse.
const ALLOWED_NEGATIVE_GOSSIPSUB_FACTOR: f32 = 0.1;
/// The time we allow peers to be in the dialing state in our PeerDb before we revert them to a
/// disconnected state.
const DIAL_TIMEOUT: u64 = 15;
/// Storage of known peers, their reputation and information
pub struct PeerDB<TSpec: EthSpec> {
@@ -322,6 +325,32 @@ impl<TSpec: EthSpec> PeerDB<TSpec> {
/* Mutability */
/// Cleans up the connection state of dialing peers.
// Libp2p dial's peerids, but sometimes the response is from another peer-id or libp2p
// returns dial errors without a peer-id attached. This function reverts peers that have a
// dialing status longer than DIAL_TIMEOUT seconds to a disconnected status. This is important because
// we count the number of dialing peers in our inbound connections.
pub fn cleanup_dialing_peers(&mut self) {
let peers_to_disconnect: Vec<_> = self
.peers
.iter()
.filter_map(|(peer_id, info)| {
if let PeerConnectionStatus::Dialing { since } = info.connection_status() {
if (*since) + std::time::Duration::from_secs(DIAL_TIMEOUT)
< std::time::Instant::now()
{
return Some(*peer_id);
}
}
None
})
.collect();
for peer_id in peers_to_disconnect {
self.update_connection_state(&peer_id, NewConnectionState::Disconnected);
}
}
/// Allows the sync module to update sync status' of peers. Returns None, if the peer doesn't
/// exist and returns Some(bool) representing if the sync state was modified.
pub fn update_sync_status(