mirror of
https://github.com/bitcoinresearchkit/brk.git
synced 2026-06-10 23:13:33 -07:00
indexer: snapshot
This commit is contained in:
@@ -1,63 +1,79 @@
|
||||
# brk_indexer
|
||||
|
||||
Full Bitcoin blockchain indexer for fast analytics queries.
|
||||
Parses and indexes the entire Bitcoin blockchain so you can look up any block, transaction, input, output, or address by index in O(1).
|
||||
|
||||
## What It Enables
|
||||
## How It's Organized
|
||||
|
||||
Transform raw Bitcoin blockchain data into indexed vectors and key-value stores optimized for analytics. Query any block, transaction, address, or UTXO without scanning the chain.
|
||||
Every entity gets a sequential index in blockchain order:
|
||||
|
||||
## Key Features
|
||||
- Block 0, 1, 2, ... → **height**
|
||||
- Transaction 0, 1, 2, ... → **txindex**
|
||||
- Input 0, 1, 2, ... → **txinindex**
|
||||
- Output 0, 1, 2, ... → **txoutindex**
|
||||
- Address 0, 1, 2, ... → **addressindex** (per address type)
|
||||
|
||||
- **Multi-phase block processing**: Parallel TXID computation, input/output processing, sequential finalization
|
||||
- **Address indexing**: Maps addresses to their transaction history and UTXOs per address type
|
||||
- **UTXO tracking**: Live outpoint→value lookups, address→unspent outputs
|
||||
- **Reorg handling**: Automatic rollback to valid chain state on reorganization
|
||||
- **Collision detection**: Validates rapidhash-based prefix lookups against known duplicate TXIDs
|
||||
- **Incremental snapshots**: Periodic checkpoints for crash recovery
|
||||
Data is stored in append-only vectors keyed by these indexes. Each block also stores the first index of each entity type it contains (e.g. `first_txindex`, `first_txoutindex`), so you can find all transactions, inputs, outputs, and addresses in any block in O(1).
|
||||
|
||||
## Core API
|
||||
## What's Indexed
|
||||
|
||||
```rust,ignore
|
||||
let mut indexer = Indexer::forced_import(&outputs_dir)?;
|
||||
### Per Block (keyed by height)
|
||||
|
||||
// Index new blocks
|
||||
let starting_indexes = indexer.index(&blocks, &client, &exit)?;
|
||||
- Block hash, timestamp, difficulty, size, weight
|
||||
|
||||
// Access indexed data
|
||||
let txindex = indexer.stores.txidprefix_to_txindex.get(&txid_prefix)?;
|
||||
let blockhash = indexer.vecs.blocks.blockhash.get(height)?;
|
||||
```
|
||||
### Per Transaction (keyed by txindex)
|
||||
|
||||
## Data Structures
|
||||
- Txid, version, locktime, base size, total size, RBF flag, block height
|
||||
|
||||
**Vecs** (append-only vectors):
|
||||
- `blocks`: `blockhash`, `timestamp`, `difficulty`, `total_size`, `weight`
|
||||
- `transactions`: `txid`, `first_txinindex`, `first_txoutindex`
|
||||
- `inputs`: `outpoint`, `txindex`
|
||||
- `outputs`: `value`, `outputtype`, `typeindex`, `txindex`
|
||||
- `addresses`: Per-type `p2pkhbytes`, `p2shbytes`, `p2wpkhbytes`, etc.
|
||||
### Per Input (keyed by txinindex)
|
||||
|
||||
**Stores** (key-value lookups):
|
||||
- `txidprefix_to_txindex` - TXID lookup via 10-byte prefix
|
||||
- `blockhashprefix_to_height` - Block lookup via 4-byte prefix
|
||||
- `addresstype_to_addresshash_to_addressindex` - Address lookup per type
|
||||
- `addresstype_to_addressindex_and_unspentoutpoint` - Live UTXO set per address
|
||||
- Spent outpoint, containing txindex, and the spent output's type and address index
|
||||
|
||||
## Processing Pipeline
|
||||
### Per Output (keyed by txoutindex)
|
||||
|
||||
1. **Block metadata**: Store blockhash, difficulty, timestamp
|
||||
2. **Compute TXIDs**: Parallel SHA256d across transactions
|
||||
3. **Process inputs**: Lookup spent outpoints, resolve address info
|
||||
4. **Process outputs**: Extract addresses, assign type indexes
|
||||
5. **Finalize**: Sequential store updates, UTXO set mutations
|
||||
6. **Commit**: Periodic flush to disk
|
||||
- Value in satoshis, script type, address index within that type, containing txindex
|
||||
|
||||
Script types: P2PK (compressed/uncompressed), P2PKH, P2SH, P2WPKH, P2WSH, P2TR, P2A, P2MS, OP_RETURN, Empty, Unknown
|
||||
|
||||
### Per Address (keyed by addressindex, one set per type)
|
||||
|
||||
- Raw address bytes (20-65 bytes depending on type: pubkey, pubkey hash, script hash, witness program, etc.)
|
||||
|
||||
Address types each get their own index space: P2PK65, P2PK33, P2PKH, P2SH, P2WPKH, P2WSH, P2TR, P2A
|
||||
|
||||
### Per Non-Address Script (OP_RETURN, P2MS, Empty, Unknown)
|
||||
|
||||
- Containing txindex
|
||||
|
||||
## Key-Value Stores
|
||||
|
||||
On top of the vectors, key-value stores enable lookups that aren't sequential:
|
||||
|
||||
| Store | Purpose |
|
||||
|-------|---------|
|
||||
| txid prefix → txindex | Look up a transaction by its txid |
|
||||
| block hash prefix → height | Look up a block by its hash |
|
||||
| address hash → addressindex | Look up an address (per type) |
|
||||
| addressindex + txindex | All transactions involving an address |
|
||||
| addressindex + outpoint | Unspent outputs for an address (live UTXO set) |
|
||||
| height → coinbase tag | Miner-embedded message per block |
|
||||
|
||||
## How It Works
|
||||
|
||||
1. **Block metadata** — store block hash, difficulty, timestamp, size, weight
|
||||
2. **Compute TXIDs** — parallel SHA256d across all transactions
|
||||
3. **Process outputs** — classify script types, extract addresses, detect new unique addresses
|
||||
4. **Process inputs** — resolve spent outpoints, look up address info
|
||||
5. **Finalize** — update address stores, UTXO set mutations, push all vectors
|
||||
6. **Snapshot** — periodic flush to disk for crash recovery
|
||||
|
||||
Reorg handling is built-in: on chain reorganization, the indexer rolls back to the last valid state.
|
||||
|
||||
## Performance
|
||||
|
||||
| Machine | Time | Disk | Peak Disk | Memory | Peak Memory |
|
||||
|---------|------|------|-----------|--------|-------------|
|
||||
| MBP M3 Pro (36GB, internal SSD) | 3h | 247 GB | 314 GB | 5.2 GB | 11 GB |
|
||||
| Mac Mini M4 (16GB, external SSD) | 4.9h | 233 GB | 303 GB | 5.4 GB | 11 GB |
|
||||
| Version | Machine | Time | Disk | Peak Disk | Memory | Peak Memory |
|
||||
|---------|---------|------|------|-----------|--------|-------------|
|
||||
| v0.2.0-pre | MBP M3 Pro (36GB, internal SSD) | 2h40 | 239 GB | 302 GB | 5.9 GB | 13 GB |
|
||||
| v0.1.0-alpha.0 | Mac Mini M4 (16GB, external SSD) | 4.9h | 233 GB | 303 GB | 5.4 GB | 11 GB |
|
||||
|
||||
Full benchmark data: [bitcoinresearchkit/benches](https://github.com/bitcoinresearchkit/benches/tree/main/brk_indexer)
|
||||
|
||||
@@ -67,8 +83,7 @@ Use [mimalloc v3](https://crates.io/crates/mimalloc) as the global allocator to
|
||||
|
||||
## Built On
|
||||
|
||||
- `vecdb` for append-only vectors
|
||||
- `brk_cohort` for address type handling
|
||||
- `vecdb` for append-only vectors — integer-compressed (`PcoVec`) or raw bytes (`BytesVec`)
|
||||
- `brk_iterator` for block iteration
|
||||
- `brk_store` for key-value storage
|
||||
- `brk_store` for key-value storage (fjall LSM)
|
||||
- `brk_types` for domain types
|
||||
|
||||
@@ -19,7 +19,9 @@ impl<'a> BlockProcessor<'a> {
|
||||
.par_iter()
|
||||
.enumerate()
|
||||
.map(|(index, tx)| {
|
||||
let txid = Txid::from(tx.compute_txid());
|
||||
let (btc_txid, base_size, total_size) =
|
||||
self.block.compute_tx_id_and_sizes(index);
|
||||
let txid = Txid::from(btc_txid);
|
||||
let txid_prefix = TxidPrefix::from(&txid);
|
||||
|
||||
let prev_txindex_opt = if will_check_collisions {
|
||||
@@ -37,8 +39,8 @@ impl<'a> BlockProcessor<'a> {
|
||||
txid,
|
||||
txid_prefix,
|
||||
prev_txindex_opt,
|
||||
base_size: tx.base_size() as u32,
|
||||
total_size: tx.total_size() as u32,
|
||||
base_size,
|
||||
total_size,
|
||||
})
|
||||
})
|
||||
.collect()
|
||||
|
||||
@@ -70,7 +70,7 @@ impl<'a> BlockProcessor<'a> {
|
||||
let prev_addressbytes = self.vecs.get_addressbytes_by_type(
|
||||
addresstype,
|
||||
typeindex,
|
||||
self.readers.addressbytes.get_unwrap(addresstype),
|
||||
&self.readers.addressbytes,
|
||||
)
|
||||
.ok_or(Error::Internal("Missing addressbytes"))?;
|
||||
|
||||
|
||||
@@ -1,43 +1,55 @@
|
||||
use brk_cohort::ByAddressType;
|
||||
use vecdb::Reader;
|
||||
use brk_types::{
|
||||
OutputType, P2AAddressIndex, P2ABytes, P2PK33AddressIndex, P2PK33Bytes, P2PK65AddressIndex,
|
||||
P2PK65Bytes, P2PKHAddressIndex, P2PKHBytes, P2SHAddressIndex, P2SHBytes, P2TRAddressIndex,
|
||||
P2TRBytes, P2WPKHAddressIndex, P2WPKHBytes, P2WSHAddressIndex, P2WSHBytes, TxIndex,
|
||||
TxOutIndex, Txid, TypeIndex,
|
||||
};
|
||||
use vecdb::{BytesStrategy, VecReader};
|
||||
|
||||
use crate::Vecs;
|
||||
|
||||
pub struct AddressReaders {
|
||||
pub p2pk65: VecReader<P2PK65AddressIndex, P2PK65Bytes, BytesStrategy<P2PK65Bytes>>,
|
||||
pub p2pk33: VecReader<P2PK33AddressIndex, P2PK33Bytes, BytesStrategy<P2PK33Bytes>>,
|
||||
pub p2pkh: VecReader<P2PKHAddressIndex, P2PKHBytes, BytesStrategy<P2PKHBytes>>,
|
||||
pub p2sh: VecReader<P2SHAddressIndex, P2SHBytes, BytesStrategy<P2SHBytes>>,
|
||||
pub p2wpkh: VecReader<P2WPKHAddressIndex, P2WPKHBytes, BytesStrategy<P2WPKHBytes>>,
|
||||
pub p2wsh: VecReader<P2WSHAddressIndex, P2WSHBytes, BytesStrategy<P2WSHBytes>>,
|
||||
pub p2tr: VecReader<P2TRAddressIndex, P2TRBytes, BytesStrategy<P2TRBytes>>,
|
||||
pub p2a: VecReader<P2AAddressIndex, P2ABytes, BytesStrategy<P2ABytes>>,
|
||||
}
|
||||
|
||||
/// Readers for vectors that need to be accessed during block processing.
|
||||
/// These provide consistent snapshots for reading while the main vectors are being modified.
|
||||
///
|
||||
/// All fields use `VecReader` which caches the mmap base pointer for O(1)
|
||||
/// random access without recomputing `region.start() + HEADER_OFFSET` per read.
|
||||
pub struct Readers {
|
||||
pub txid: Reader,
|
||||
pub txindex_to_first_txoutindex: Reader,
|
||||
pub txoutindex_to_outputtype: Reader,
|
||||
pub txoutindex_to_typeindex: Reader,
|
||||
pub addressbytes: ByAddressType<Reader>,
|
||||
pub txid: VecReader<TxIndex, Txid, BytesStrategy<Txid>>,
|
||||
pub txindex_to_first_txoutindex:
|
||||
VecReader<TxIndex, TxOutIndex, BytesStrategy<TxOutIndex>>,
|
||||
pub txoutindex_to_outputtype:
|
||||
VecReader<TxOutIndex, OutputType, BytesStrategy<OutputType>>,
|
||||
pub txoutindex_to_typeindex:
|
||||
VecReader<TxOutIndex, TypeIndex, BytesStrategy<TypeIndex>>,
|
||||
pub addressbytes: AddressReaders,
|
||||
}
|
||||
|
||||
impl Readers {
|
||||
pub fn new(vecs: &Vecs) -> Self {
|
||||
Self {
|
||||
txid: vecs.transactions.txid.create_reader(),
|
||||
txindex_to_first_txoutindex: vecs.transactions.first_txoutindex.create_reader(),
|
||||
txoutindex_to_outputtype: vecs.outputs.outputtype.create_reader(),
|
||||
txoutindex_to_typeindex: vecs.outputs.typeindex.create_reader(),
|
||||
addressbytes: ByAddressType {
|
||||
p2pk65: vecs
|
||||
.addresses
|
||||
.p2pk65bytes
|
||||
.create_reader(),
|
||||
p2pk33: vecs
|
||||
.addresses
|
||||
.p2pk33bytes
|
||||
.create_reader(),
|
||||
p2pkh: vecs.addresses.p2pkhbytes.create_reader(),
|
||||
p2sh: vecs.addresses.p2shbytes.create_reader(),
|
||||
p2wpkh: vecs
|
||||
.addresses
|
||||
.p2wpkhbytes
|
||||
.create_reader(),
|
||||
p2wsh: vecs.addresses.p2wshbytes.create_reader(),
|
||||
p2tr: vecs.addresses.p2trbytes.create_reader(),
|
||||
p2a: vecs.addresses.p2abytes.create_reader(),
|
||||
txid: vecs.transactions.txid.reader(),
|
||||
txindex_to_first_txoutindex: vecs.transactions.first_txoutindex.reader(),
|
||||
txoutindex_to_outputtype: vecs.outputs.outputtype.reader(),
|
||||
txoutindex_to_typeindex: vecs.outputs.typeindex.reader(),
|
||||
addressbytes: AddressReaders {
|
||||
p2pk65: vecs.addresses.p2pk65bytes.reader(),
|
||||
p2pk33: vecs.addresses.p2pk33bytes.reader(),
|
||||
p2pkh: vecs.addresses.p2pkhbytes.reader(),
|
||||
p2sh: vecs.addresses.p2shbytes.reader(),
|
||||
p2wpkh: vecs.addresses.p2wpkhbytes.reader(),
|
||||
p2wsh: vecs.addresses.p2wshbytes.reader(),
|
||||
p2tr: vecs.addresses.p2trbytes.reader(),
|
||||
p2a: vecs.addresses.p2abytes.reader(),
|
||||
},
|
||||
}
|
||||
}
|
||||
|
||||
@@ -8,10 +8,11 @@ use brk_types::{
|
||||
};
|
||||
use rayon::prelude::*;
|
||||
use vecdb::{
|
||||
AnyStoredVec, BytesVec, Database, WritableVec, ImportableVec, PcoVec, Reader, ReadableVec,
|
||||
AnyStoredVec, BytesVec, Database, WritableVec, ImportableVec, PcoVec, ReadableVec,
|
||||
Stamp, VecIndex,
|
||||
};
|
||||
|
||||
use crate::AddressReaders;
|
||||
use crate::parallel_import;
|
||||
|
||||
#[derive(Clone, Traversable)]
|
||||
@@ -164,46 +165,46 @@ impl AddressesVecs {
|
||||
.into_par_iter()
|
||||
}
|
||||
|
||||
/// Get address bytes by output type, using the reader for the specific address type.
|
||||
/// Get address bytes by output type, using the cached VecReader for the specific address type.
|
||||
/// Returns None if the index doesn't exist yet.
|
||||
pub fn get_bytes_by_type(
|
||||
&self,
|
||||
addresstype: OutputType,
|
||||
typeindex: TypeIndex,
|
||||
reader: &Reader,
|
||||
readers: &AddressReaders,
|
||||
) -> Option<AddressBytes> {
|
||||
match addresstype {
|
||||
OutputType::P2PK65 => self
|
||||
.p2pk65bytes
|
||||
.get_pushed_or_read(typeindex.into(), reader)
|
||||
.get_pushed_or_read(typeindex.into(), &readers.p2pk65)
|
||||
.map(AddressBytes::from),
|
||||
OutputType::P2PK33 => self
|
||||
.p2pk33bytes
|
||||
.get_pushed_or_read(typeindex.into(), reader)
|
||||
.get_pushed_or_read(typeindex.into(), &readers.p2pk33)
|
||||
.map(AddressBytes::from),
|
||||
OutputType::P2PKH => self
|
||||
.p2pkhbytes
|
||||
.get_pushed_or_read(typeindex.into(), reader)
|
||||
.get_pushed_or_read(typeindex.into(), &readers.p2pkh)
|
||||
.map(AddressBytes::from),
|
||||
OutputType::P2SH => self
|
||||
.p2shbytes
|
||||
.get_pushed_or_read(typeindex.into(), reader)
|
||||
.get_pushed_or_read(typeindex.into(), &readers.p2sh)
|
||||
.map(AddressBytes::from),
|
||||
OutputType::P2WPKH => self
|
||||
.p2wpkhbytes
|
||||
.get_pushed_or_read(typeindex.into(), reader)
|
||||
.get_pushed_or_read(typeindex.into(), &readers.p2wpkh)
|
||||
.map(AddressBytes::from),
|
||||
OutputType::P2WSH => self
|
||||
.p2wshbytes
|
||||
.get_pushed_or_read(typeindex.into(), reader)
|
||||
.get_pushed_or_read(typeindex.into(), &readers.p2wsh)
|
||||
.map(AddressBytes::from),
|
||||
OutputType::P2TR => self
|
||||
.p2trbytes
|
||||
.get_pushed_or_read(typeindex.into(), reader)
|
||||
.get_pushed_or_read(typeindex.into(), &readers.p2tr)
|
||||
.map(AddressBytes::from),
|
||||
OutputType::P2A => self
|
||||
.p2abytes
|
||||
.get_pushed_or_read(typeindex.into(), reader)
|
||||
.get_pushed_or_read(typeindex.into(), &readers.p2a)
|
||||
.map(AddressBytes::from),
|
||||
_ => unreachable!("get_bytes_by_type called with non-address type"),
|
||||
}
|
||||
|
||||
@@ -4,7 +4,9 @@ use brk_error::Result;
|
||||
use brk_traversable::Traversable;
|
||||
use brk_types::{AddressBytes, AddressHash, Height, OutputType, TypeIndex, Version};
|
||||
use rayon::prelude::*;
|
||||
use vecdb::{AnyStoredVec, Database, Reader, Stamp};
|
||||
use vecdb::{AnyStoredVec, Database, Stamp};
|
||||
|
||||
use crate::AddressReaders;
|
||||
|
||||
const PAGE_SIZE: usize = 4096;
|
||||
|
||||
@@ -150,10 +152,10 @@ impl Vecs {
|
||||
&self,
|
||||
addresstype: OutputType,
|
||||
typeindex: TypeIndex,
|
||||
reader: &Reader,
|
||||
readers: &AddressReaders,
|
||||
) -> Option<AddressBytes> {
|
||||
self.addresses
|
||||
.get_bytes_by_type(addresstype, typeindex, reader)
|
||||
.get_bytes_by_type(addresstype, typeindex, readers)
|
||||
}
|
||||
|
||||
pub fn push_bytes_if_needed(&mut self, index: TypeIndex, bytes: AddressBytes) -> Result<()> {
|
||||
|
||||
Reference in New Issue
Block a user