mirror of
https://github.com/sigp/lighthouse.git
synced 2026-03-15 02:42:38 +00:00
Optimise attestation selection proof signing (#4033)
## Issue Addressed
Closes #3963 (hopefully)
## Proposed Changes
Compute attestation selection proofs gradually each slot rather than in a single `join_all` at the start of each epoch. On a machine with 5k validators this replaces 5k tasks signing 5k proofs with 1 task that signs 5k/32 ~= 160 proofs each slot.
Based on testing with Goerli validators this seems to reduce the average time to produce a signature by preventing Tokio and the OS from falling over each other trying to run hundreds of threads. My testing so far has been with local keystores, which run on a dynamic pool of up to 512 OS threads because they use [`spawn_blocking`](https://docs.rs/tokio/1.11.0/tokio/task/fn.spawn_blocking.html) (and we haven't changed the default).
An earlier version of this PR hyper-optimised the time-per-signature metric to the detriment of the entire system's performance (see the reverted commits). The current PR is conservative in that it avoids touching the attestation service at all. I think there's more optimising to do here, but we can come back for that in a future PR rather than expanding the scope of this one.
The new algorithm for attestation selection proofs is:
- We sign a small batch of selection proofs each slot, for slots up to 8 slots in the future. On average we'll sign one slot's worth of proofs per slot, with an 8 slot lookahead.
- The batch is signed halfway through the slot when there is unlikely to be contention for signature production (blocks are <4s, attestations are ~4-6 seconds, aggregates are 8s+).
## Performance Data
_See first comment for updated graphs_.
Graph of median signing times before this PR:

Graph of update attesters metric (includes selection proof signing) before this PR:

Median signing time after this PR (prototype from 12:00, updated version from 13:30):

99th percentile on signing times (bounded attestation signing from 16:55, now removed):

Attester map update timing after this PR:

Selection proof signings per second change:

## Link to late blocks
I believe this is related to the slow block signings because logs from Stakely in #3963 show these two logs almost 5 seconds apart:
> Feb 23 18:56:23.978 INFO Received unsigned block, slot: 5862880, service: block, module: validator_client::block_service:393
> Feb 23 18:56:28.552 INFO Publishing signed block, slot: 5862880, service: block, module: validator_client::block_service:416
The only thing that happens between those two logs is the signing of the block:
0fb58a680d/validator_client/src/block_service.rs (L410-L414)
Helpfully, Stakely noticed this issue without any Lighthouse BNs in the mix, which pointed to a clear issue in the VC.
## TODO
- [x] Further testing on testnet infrastructure.
- [x] Make the attestation signing parallelism configurable.
This commit is contained in:
@@ -32,6 +32,7 @@ pub const PROPOSER_DUTIES_HTTP_GET: &str = "proposer_duties_http_get";
|
||||
pub const VALIDATOR_ID_HTTP_GET: &str = "validator_id_http_get";
|
||||
pub const SUBSCRIPTIONS_HTTP_POST: &str = "subscriptions_http_post";
|
||||
pub const UPDATE_PROPOSERS: &str = "update_proposers";
|
||||
pub const ATTESTATION_SELECTION_PROOFS: &str = "attestation_selection_proofs";
|
||||
pub const SUBSCRIPTIONS: &str = "subscriptions";
|
||||
pub const LOCAL_KEYSTORE: &str = "local_keystore";
|
||||
pub const WEB3SIGNER: &str = "web3signer";
|
||||
@@ -172,6 +173,10 @@ lazy_static::lazy_static! {
|
||||
"Duration to obtain a signature",
|
||||
&["type"]
|
||||
);
|
||||
pub static ref BLOCK_SIGNING_TIMES: Result<Histogram> = try_create_histogram(
|
||||
"vc_block_signing_times_seconds",
|
||||
"Duration to obtain a signature for a block",
|
||||
);
|
||||
|
||||
pub static ref ATTESTATION_DUTY: Result<IntGaugeVec> = try_create_int_gauge_vec(
|
||||
"vc_attestation_duty_slot",
|
||||
|
||||
Reference in New Issue
Block a user