mirror of
https://github.com/sigp/lighthouse.git
synced 2026-04-17 12:58:31 +00:00
Implement database temp states to reduce memory usage (#1798)
## Issue Addressed Closes #800 Closes #1713 ## Proposed Changes Implement the temporary state storage algorithm described in #800. Specifically: * Add `DBColumn::BeaconStateTemporary`, for storing 0-length temporary marker values. * Store intermediate states immediately as they are created, marked temporary. Delete the temporary flag if the block is processed successfully. * Add a garbage collection process to delete leftover temporary states on start-up. * Bump the database schema version to 2 so that a DB with temporary states can't accidentally be used with older versions of the software. The auto-migration is a no-op, but puts in place some infra that we can use for future migrations (e.g. #1784) ## Additional Info There are two known race conditions, one potentially causing permanent faults (hopefully rare), and the other insignificant. ### Race 1: Permanent state marked temporary EDIT: this has been fixed by the addition of a lock around the relevant critical section There are 2 threads that are trying to store 2 different blocks that share some intermediate states (e.g. they both skip some slots from the current head). Consider this sequence of events: 1. Thread 1 checks if state `s` already exists, and seeing that it doesn't, prepares an atomic commit of `(s, s_temporary_flag)`. 2. Thread 2 does the same, but also gets as far as committing the state txn, finishing the processing of its block, and _deleting_ the temporary flag. 3. Thread 1 is (finally) scheduled again, and marks `s` as temporary with its transaction. 4. a) The process is killed, or thread 1's block fails verification and the temp flag is not deleted. This is a permanent failure! Any attempt to load state `s` will fail... hope it isn't on the main chain! Alternatively (4b) happens... b) Thread 1 finishes, and re-deletes the temporary flag. In this case the failure is transient, state `s` will disappear temporarily, but will come back once thread 1 finishes running. I _hope_ that steps 1-3 only happen very rarely, and 4a even more rarely. It's hard to know This once again begs the question of why we're using LevelDB (#483), when it clearly doesn't care about atomicity! A ham-fisted fix would be to wrap the hot and cold DBs in locks, which would bring us closer to how other DBs handle read-write transactions. E.g. [LMDB only allows one R/W transaction at a time](https://docs.rs/lmdb/0.8.0/lmdb/struct.Environment.html#method.begin_rw_txn). ### Race 2: Temporary state returned from `get_state` I don't think this race really matters, but in `load_hot_state`, if another thread stores a state between when we call `load_state_temporary_flag` and when we call `load_hot_state_summary`, then we could end up returning that state even though it's only a temporary state. I can't think of any case where this would be relevant, and I suspect if it did come up, it would be safe/recoverable (having data is safer than _not_ having data). This could be fixed by using a LevelDB read snapshot, but that would require substantial changes to how we read all our values, so I don't think it's worth it right now.
This commit is contained in:
@@ -1498,7 +1498,7 @@ impl<T: BeaconChainTypes> BeaconChain<T> {
|
||||
let block_root = fully_verified_block.block_root;
|
||||
let mut state = fully_verified_block.state;
|
||||
let current_slot = self.slot()?;
|
||||
let mut ops = fully_verified_block.intermediate_states;
|
||||
let mut ops = fully_verified_block.confirmation_db_batch;
|
||||
|
||||
let attestation_observation_timer =
|
||||
metrics::start_timer(&metrics::BLOCK_PROCESSING_ATTESTATION_OBSERVATION);
|
||||
@@ -1623,13 +1623,16 @@ impl<T: BeaconChainTypes> BeaconChain<T> {
|
||||
|
||||
let db_write_timer = metrics::start_timer(&metrics::BLOCK_PROCESSING_DB_WRITE);
|
||||
|
||||
// Store all the states between the parent block state and this block's slot, the block and state.
|
||||
ops.push(StoreOp::PutBlock(block_root.into(), signed_block.clone()));
|
||||
ops.push(StoreOp::PutState(
|
||||
block.state_root.into(),
|
||||
Cow::Borrowed(&state),
|
||||
// Store the block and its state, and execute the confirmation batch for the intermediate
|
||||
// states, which will delete their temporary flags.
|
||||
ops.push(StoreOp::PutBlock(
|
||||
block_root,
|
||||
Box::new(signed_block.clone()),
|
||||
));
|
||||
ops.push(StoreOp::PutState(block.state_root, &state));
|
||||
let txn_lock = self.store.hot_db.begin_rw_transaction();
|
||||
self.store.do_atomically(ops)?;
|
||||
drop(txn_lock);
|
||||
|
||||
// The fork choice write-lock is dropped *after* the on-disk database has been updated.
|
||||
// This prevents inconsistency between the two at the expense of concurrency.
|
||||
|
||||
Reference in New Issue
Block a user