Implement database temp states to reduce memory usage (#1798)

## Issue Addressed

Closes #800
Closes #1713

## Proposed Changes

Implement the temporary state storage algorithm described in #800. Specifically:

* Add `DBColumn::BeaconStateTemporary`, for storing 0-length temporary marker values.
* Store intermediate states immediately as they are created, marked temporary. Delete the temporary flag if the block is processed successfully.
* Add a garbage collection process to delete leftover temporary states on start-up.
* Bump the database schema version to 2 so that a DB with temporary states can't accidentally be used with older versions of the software. The auto-migration is a no-op, but puts in place some infra that we can use for future migrations (e.g. #1784)

## Additional Info

There are two known race conditions, one potentially causing permanent faults (hopefully rare), and the other insignificant.

### Race 1: Permanent state marked temporary

EDIT: this has been fixed by the addition of a lock around the relevant critical section

There are 2 threads that are trying to store 2 different blocks that share some intermediate states (e.g. they both skip some slots from the current head). Consider this sequence of events:

1. Thread 1 checks if state `s` already exists, and seeing that it doesn't, prepares an atomic commit of `(s, s_temporary_flag)`.
2. Thread 2 does the same, but also gets as far as committing the state txn, finishing the processing of its block, and _deleting_ the temporary flag.
3. Thread 1 is (finally) scheduled again, and marks `s` as temporary with its transaction.
4.
    a) The process is killed, or thread 1's block fails verification and the temp flag is not deleted. This is a permanent failure! Any attempt to load state `s` will fail... hope it isn't on the main chain! Alternatively (4b) happens...
    b) Thread 1 finishes, and re-deletes the temporary flag. In this case the failure is transient, state `s` will disappear temporarily, but will come back once thread 1 finishes running.

I _hope_ that steps 1-3 only happen very rarely, and 4a even more rarely. It's hard to know

This once again begs the question of why we're using LevelDB (#483), when it clearly doesn't care about atomicity! A ham-fisted fix would be to wrap the hot and cold DBs in locks, which would bring us closer to how other DBs handle read-write transactions. E.g. [LMDB only allows one R/W transaction at a time](https://docs.rs/lmdb/0.8.0/lmdb/struct.Environment.html#method.begin_rw_txn).

### Race 2: Temporary state returned from `get_state`

I don't think this race really matters, but in `load_hot_state`, if another thread stores a state between when we call `load_state_temporary_flag` and when we call `load_hot_state_summary`, then we could end up returning that state even though it's only a temporary state. I can't think of any case where this would be relevant, and I suspect if it did come up, it would be safe/recoverable (having data is safer than _not_ having data).

This could be fixed by using a LevelDB read snapshot, but that would require substantial changes to how we read all our values, so I don't think it's worth it right now.
This commit is contained in:
Michael Sproul
2020-10-23 01:27:51 +00:00
parent 66f0cf4430
commit acd49d988d
14 changed files with 343 additions and 96 deletions

View File

@@ -5,6 +5,7 @@ use crate::config::StoreConfig;
use crate::forwards_iter::HybridForwardsBlockRootsIterator;
use crate::impls::beacon_state::{get_full_state, store_full_state};
use crate::iter::{ParentRootBlockIterator, StateRootsIterator};
use crate::leveldb_store::BytesKey;
use crate::leveldb_store::LevelDB;
use crate::memory_store::MemoryStore;
use crate::metadata::{
@@ -15,6 +16,7 @@ use crate::{
get_key_for_col, DBColumn, Error, ItemStore, KeyValueStoreOp, PartialBeaconState, StoreItem,
StoreOp,
};
use leveldb::iterator::LevelDBIterator;
use lru::LruCache;
use parking_lot::{Mutex, RwLock};
use slog::{debug, error, info, trace, warn, Logger};
@@ -46,8 +48,6 @@ pub enum BlockReplay {
/// intermittent "restore point" states pre-finalization.
#[derive(Debug)]
pub struct HotColdDB<E: EthSpec, Hot: ItemStore<E>, Cold: ItemStore<E>> {
/// The schema version. Loaded from disk on initialization.
schema_version: SchemaVersion,
/// The slot and state root at the point where the database is split between hot and cold.
///
/// States with slots less than `split.slot` are in the cold DB, while states with slots
@@ -73,8 +73,8 @@ pub struct HotColdDB<E: EthSpec, Hot: ItemStore<E>, Cold: ItemStore<E>> {
#[derive(Debug, PartialEq)]
pub enum HotColdDBError {
UnsupportedSchemaVersion {
software_version: SchemaVersion,
disk_version: SchemaVersion,
target_version: SchemaVersion,
current_version: SchemaVersion,
},
/// Recoverable error indicating that the database freeze point couldn't be updated
/// due to the finalized block not lying on an epoch boundary (should be infrequent).
@@ -101,6 +101,9 @@ pub enum HotColdDBError {
slots_per_epoch: u64,
},
RestorePointBlockHashError(BeaconStateError),
IterationError {
unexpected_key: BytesKey,
},
}
impl<E: EthSpec> HotColdDB<E, MemoryStore<E>, MemoryStore<E>> {
@@ -112,7 +115,6 @@ impl<E: EthSpec> HotColdDB<E, MemoryStore<E>, MemoryStore<E>> {
Self::verify_slots_per_restore_point(config.slots_per_restore_point)?;
let db = HotColdDB {
schema_version: CURRENT_SCHEMA_VERSION,
split: RwLock::new(Split::default()),
cold_db: MemoryStore::open(),
hot_db: MemoryStore::open(),
@@ -141,7 +143,6 @@ impl<E: EthSpec> HotColdDB<E, LevelDB<E>, LevelDB<E>> {
Self::verify_slots_per_restore_point(config.slots_per_restore_point)?;
let db = HotColdDB {
schema_version: CURRENT_SCHEMA_VERSION,
split: RwLock::new(Split::default()),
cold_db: LevelDB::open(cold_path)?,
hot_db: LevelDB::open(hot_path)?,
@@ -153,15 +154,15 @@ impl<E: EthSpec> HotColdDB<E, LevelDB<E>, LevelDB<E>> {
};
// Ensure that the schema version of the on-disk database matches the software.
// In the future, this would be the spot to hook in auto-migration, etc.
// If the version is mismatched, an automatic migration will be attempted.
if let Some(schema_version) = db.load_schema_version()? {
if schema_version != CURRENT_SCHEMA_VERSION {
return Err(HotColdDBError::UnsupportedSchemaVersion {
software_version: CURRENT_SCHEMA_VERSION,
disk_version: schema_version,
}
.into());
}
debug!(
db.log,
"Attempting schema migration";
"from_version" => schema_version.as_u64(),
"to_version" => CURRENT_SCHEMA_VERSION.as_u64(),
);
db.migrate_schema(schema_version, CURRENT_SCHEMA_VERSION)?;
} else {
db.store_schema_version(CURRENT_SCHEMA_VERSION)?;
}
@@ -178,14 +179,41 @@ impl<E: EthSpec> HotColdDB<E, LevelDB<E>, LevelDB<E>> {
info!(
db.log,
"Hot-Cold DB initialized";
"version" => db.schema_version.0,
"version" => CURRENT_SCHEMA_VERSION.as_u64(),
"split_slot" => split.slot,
"split_state" => format!("{:?}", split.state_root)
);
*db.split.write() = split;
}
// Finally, run a garbage collection pass.
db.remove_garbage()?;
Ok(db)
}
/// Return an iterator over the state roots of all temporary states.
pub fn iter_temporary_state_roots<'a>(
&'a self,
) -> impl Iterator<Item = Result<Hash256, Error>> + 'a {
let column = DBColumn::BeaconStateTemporary;
let start_key =
BytesKey::from_vec(get_key_for_col(column.into(), Hash256::zero().as_bytes()));
let keys_iter = self.hot_db.keys_iter();
keys_iter.seek(&start_key);
keys_iter
.take_while(move |key| key.matches_column(column))
.map(move |bytes_key| {
bytes_key.remove_column(column).ok_or_else(|| {
HotColdDBError::IterationError {
unexpected_key: bytes_key,
}
.into()
})
})
}
}
impl<E: EthSpec, Hot: ItemStore<E>, Cold: ItemStore<E>> HotColdDB<E, Hot, Cold> {
@@ -391,39 +419,41 @@ impl<E: EthSpec, Hot: ItemStore<E>, Cold: ItemStore<E>> HotColdDB<E, Hot, Cold>
let mut key_value_batch = Vec::with_capacity(batch.len());
for op in batch {
match op {
StoreOp::PutBlock(block_hash, block) => {
let untyped_hash: Hash256 = (*block_hash).into();
key_value_batch.push(block.as_kv_store_op(untyped_hash));
StoreOp::PutBlock(block_root, block) => {
key_value_batch.push(block.as_kv_store_op(*block_root));
}
StoreOp::PutState(state_hash, state) => {
let untyped_hash: Hash256 = (*state_hash).into();
self.store_hot_state(&untyped_hash, state, &mut key_value_batch)?;
StoreOp::PutState(state_root, state) => {
self.store_hot_state(state_root, state, &mut key_value_batch)?;
}
StoreOp::PutStateSummary(state_hash, summary) => {
let untyped_hash: Hash256 = (*state_hash).into();
key_value_batch.push(summary.as_kv_store_op(untyped_hash));
StoreOp::PutStateSummary(state_root, summary) => {
key_value_batch.push(summary.as_kv_store_op(*state_root));
}
StoreOp::DeleteBlock(block_hash) => {
let untyped_hash: Hash256 = (*block_hash).into();
let key =
get_key_for_col(DBColumn::BeaconBlock.into(), untyped_hash.as_bytes());
StoreOp::PutStateTemporaryFlag(state_root) => {
key_value_batch.push(TemporaryFlag.as_kv_store_op(*state_root));
}
StoreOp::DeleteStateTemporaryFlag(state_root) => {
let db_key =
get_key_for_col(TemporaryFlag::db_column().into(), state_root.as_bytes());
key_value_batch.push(KeyValueStoreOp::DeleteKey(db_key));
}
StoreOp::DeleteBlock(block_root) => {
let key = get_key_for_col(DBColumn::BeaconBlock.into(), block_root.as_bytes());
key_value_batch.push(KeyValueStoreOp::DeleteKey(key));
}
StoreOp::DeleteState(state_hash, slot) => {
let untyped_hash: Hash256 = (*state_hash).into();
let state_summary_key = get_key_for_col(
DBColumn::BeaconStateSummary.into(),
untyped_hash.as_bytes(),
);
StoreOp::DeleteState(state_root, slot) => {
let state_summary_key =
get_key_for_col(DBColumn::BeaconStateSummary.into(), state_root.as_bytes());
key_value_batch.push(KeyValueStoreOp::DeleteKey(state_summary_key));
if *slot % E::slots_per_epoch() == 0 {
if slot.map_or(true, |slot| slot % E::slots_per_epoch() == 0) {
let state_key =
get_key_for_col(DBColumn::BeaconState.into(), untyped_hash.as_bytes());
get_key_for_col(DBColumn::BeaconState.into(), state_root.as_bytes());
key_value_batch.push(KeyValueStoreOp::DeleteKey(state_key));
}
}
@@ -440,18 +470,20 @@ impl<E: EthSpec, Hot: ItemStore<E>, Cold: ItemStore<E>> HotColdDB<E, Hot, Cold>
for op in &batch {
match op {
StoreOp::PutBlock(block_hash, block) => {
let untyped_hash: Hash256 = (*block_hash).into();
guard.put(untyped_hash, block.clone());
StoreOp::PutBlock(block_root, block) => {
guard.put(*block_root, (**block).clone());
}
StoreOp::PutState(_, _) => (),
StoreOp::PutStateSummary(_, _) => (),
StoreOp::DeleteBlock(block_hash) => {
let untyped_hash: Hash256 = (*block_hash).into();
guard.pop(&untyped_hash);
StoreOp::PutStateTemporaryFlag(_) => (),
StoreOp::DeleteStateTemporaryFlag(_) => (),
StoreOp::DeleteBlock(block_root) => {
guard.pop(block_root);
}
StoreOp::DeleteState(_, _) => (),
@@ -500,6 +532,12 @@ impl<E: EthSpec, Hot: ItemStore<E>, Cold: ItemStore<E>> HotColdDB<E, Hot, Cold>
) -> Result<Option<BeaconState<E>>, Error> {
metrics::inc_counter(&metrics::BEACON_STATE_HOT_GET_COUNT);
// If the state is marked as temporary, do not return it. It will become visible
// only once its transaction commits and deletes its temporary flag.
if self.load_state_temporary_flag(state_root)?.is_some() {
return Ok(None);
}
if let Some(HotStateSummary {
slot,
latest_block_root,
@@ -785,7 +823,7 @@ impl<E: EthSpec, Hot: ItemStore<E>, Cold: ItemStore<E>> HotColdDB<E, Hot, Cold>
}
/// Store the database schema version.
fn store_schema_version(&self, schema_version: SchemaVersion) -> Result<(), Error> {
pub(crate) fn store_schema_version(&self, schema_version: SchemaVersion) -> Result<(), Error> {
self.hot_db.put(&SCHEMA_VERSION_KEY, &schema_version)
}
@@ -846,6 +884,17 @@ impl<E: EthSpec, Hot: ItemStore<E>, Cold: ItemStore<E>> HotColdDB<E, Hot, Cold>
self.hot_db.get(state_root)
}
/// Load the temporary flag for a state root, if one exists.
///
/// Returns `Some` if the state is temporary, or `None` if the state is permanent or does not
/// exist -- you should call `load_hot_state_summary` to find out which.
pub fn load_state_temporary_flag(
&self,
state_root: &Hash256,
) -> Result<Option<TemporaryFlag>, Error> {
self.hot_db.get(state_root)
}
/// Check that the restore point frequency is valid.
///
/// Specifically, check that it is:
@@ -937,7 +986,7 @@ pub fn migrate_database<E: EthSpec, Hot: ItemStore<E>, Cold: ItemStore<E>>(
store.cold_db.do_atomically(cold_db_ops)?;
// Delete the old summary, and the full state if we lie on an epoch boundary.
hot_db_ops.push(StoreOp::DeleteState(state_root.into(), slot));
hot_db_ops.push(StoreOp::DeleteState(state_root, Some(slot)));
}
// Warning: Critical section. We have to take care not to put any of the two databases in an
@@ -1107,3 +1156,20 @@ impl StoreItem for RestorePointHash {
Ok(Self::from_ssz_bytes(bytes)?)
}
}
#[derive(Debug, Clone, Copy, Default)]
pub struct TemporaryFlag;
impl StoreItem for TemporaryFlag {
fn db_column() -> DBColumn {
DBColumn::BeaconStateTemporary
}
fn as_store_bytes(&self) -> Vec<u8> {
vec![]
}
fn from_store_bytes(_: &[u8]) -> Result<Self, Error> {
Ok(TemporaryFlag)
}
}