How SLAF Works¶
SLAF (Sparse Lazy Array Format) is a high-performance format for single-cell data that combines the power of SQL with lazy evaluation. It's designed to solve the performance and scalability challenges of traditional single-cell data formats.
Key Design Principles¶
1. SQL-Native Design with Relational Schema¶
SLAF is built on a SQL-native relational schema that enables direct SQL queries while providing lazy AnnData/Scanpy equivalences for seamless migration:
Relational Schema¶
SLAF stores single-cell data in three core tables:
cells
table: Cell metadata, QC metrics, and annotations withcell_id
(string) andcell_integer_id
(integer)genes
table: Gene metadata, annotations, and feature information withgene_id
(string) andgene_integer_id
(integer)expression
table: Sparse expression matrix withcell_integer_id
,gene_integer_id
, andvalue
columns
The expression table uses integer IDs for efficiency, so you need to JOIN with metadata tables to get string identifiers.
This relational design enables direct SQL queries for everything:
# Direct SQL for complex aggregations
results = slaf.query("""
SELECT cell_type,
COUNT(*) as cell_count,
AVG(total_counts) as avg_counts,
SUM(e.value) as total_expression
FROM cells c
JOIN expression e ON c.cell_integer_id = e.cell_integer_id
WHERE batch = 'batch1' AND n_genes_by_counts >= 500
GROUP BY cell_type
ORDER BY cell_count DESC
""")
# Window functions for advanced analysis
ranked_genes = slaf.query("""
SELECT g.gene_id,
c.cell_type,
e.value,
ROW_NUMBER() OVER (
PARTITION BY c.cell_type
ORDER BY e.value DESC
) as rank
FROM expression e
JOIN cells c ON e.cell_integer_id = c.cell_integer_id
JOIN genes g ON e.gene_integer_id = g.gene_integer_id
WHERE g.gene_id IN ('MS4A1', 'CD3D', 'CD8A')
""")
Lazy AnnData/Scanpy Equivalences¶
For users migrating from Scanpy, SLAF provides drop-in lazy equivalents:
# Load as LazyAnnData (no data loaded yet)
adata = read_slaf("data.slaf")
# Use familiar Scanpy-style operations
subset = adata[adata.obs.cell_type == "T cells", :] # This is lazy
# Access expression data lazily
expression = subset.X.compute() # Only loads the subset
# Use Scanpy preprocessing (lazy)
from slaf.scanpy import pp
pp.normalize_total(adata, target_sum=1e4, inplace=True)
pp.log1p(adata)
pp.highly_variable_genes(adata)
expression = adata[cell_ids, gene_ids].X.compute()
Seamless Switching Between Interfaces¶
You can switch between SQL and AnnData interfaces as needed:
# Start with AnnData interface
lazy_adata = read_slaf("data.slaf")
t_cells = lazy_adata[lazy_adata.obs.cell_type == "T cells", :]
# Switch to SQL for complex operations
t_cells_slaf = t_cells.slaf # Access underlying SLAFArray object
complex_query_result = t_cells_slaf.query("""
SELECT g.gene_id,
COUNT(*) as expressing_cells,
AVG(e.value) as mean_expression
FROM expression e
JOIN genes g ON e.gene_integer_id = g.gene_integer_id
WHERE e.cell_integer_id IN (
SELECT cell_integer_id FROM cells
WHERE cell_type = 'T cells'
)
GROUP BY g.gene_id
HAVING expressing_cells >= 10
ORDER BY mean_expression DESC
""")
# Back to AnnData for visualization
import scanpy as sc
t_cells_as_adata = t_cells.compute() # Convert to native scanpy
sc.pl.umap(t_cells_as_adata, color='leiden')
Benefits:
- SQL-native: Direct access to relational database capabilities
- SQL-native: Complex aggregations and window functions
- Migration-friendly: Drop-in replacement for existing Scanpy workflows
- Flexible: Switch between SQL and AnnData interfaces as needed
2. Polars-Like: OLAP Database with Pushdown Filters¶
SLAF leverages modern OLAP databases and pushdown filters on optimized storage formats rather than in-memory operations:
cells
table: Cell metadata and QC metricsgenes
table: Gene metadata and annotationsexpression
table: Sparse expression matrix data
Like Polars, SLAF pushes complex operations down to the query engine:
# Metadata-only filtering without loading expression data
filtered_cells = slaf.filter_cells(n_genes_by_counts=">=500")
high_quality = slaf.filter_cells(
n_genes_by_counts=">=1000",
pct_counts_mt="<=10"
)
Benefits:
- Memory efficient: Only load metadata when filtering
- Faster metadata filtering vs h5ad as datasets scale
3. Zarr-Like: Lazy Loading of Sparse Matrices¶
SLAF provides lazy loading of sparse matrices from cloud storage with concurrent access patterns:
# No data loaded yet - just metadata
adata = read_slaf("data.slaf")
# Lazy slicing like Zarr chunked arrays
subset = adata[adata.obs.cell_type == "T cells", :]
single_cell = adata.get_cell_expression("AAACCTGAGAAACCAT-1")
gene_expression = adata.get_gene_expression("MS4A1")
# Data loaded only when .compute() is called
expression = subset.X.compute()
Benefits:
- Memory efficient for submatrix operations
- Concurrent access: Multiple readers can access different slices
- Cloud-native: Direct access to data in S3/GCS without downloading
- Chunked processing: Handle datasets larger than RAM
4. Dask-Like: Lazy Computation Graphs¶
SLAF enables building computational graphs of operations that execute lazily on demand:
# Build lazy computation graph
adata = LazyAnnData("data.slaf")
# Each operation is lazy - no data loaded yet
pp.calculate_qc_metrics(adata, inplace=True)
pp.filter_cells(adata, min_counts=500, min_genes=200, inplace=True)
pp.normalize_total(adata, target_sum=1e4, inplace=True)
pp.log1p(adata)
# Execute only on the slice of interest
expression = adata.X[cell_ids, gene_ids].compute()
Benefits:
- Complex pipelines: Build preprocessing workflows impossible with eager processing
- Composable: Chain operations without intermediate materialization
- Memory efficient: Only process the slice you need
- Scalable: Handle datasets that would cause memory explosions
5. Advanced Query Optimization¶
SLAF includes sophisticated query optimization to overcome current limitations of LanceDB:
# Adaptive batching for large scattered ID sets
batched_query = QueryOptimizer.build_optimized_query(
entity_ids=large_id_list,
entity_type="cell",
use_adaptive_batching=True
)
# CTE optimization for complex queries
cte_query = QueryOptimizer.build_cte_query(
entity_ids=scattered_ids,
entity_type="gene"
)
Key optimizations:
- Submatrix optimization: Efficient slicing for complex selectors
- Adaptive batching: Optimize query patterns based on ID distribution
- Range vs IN optimization: Choose BETWEEN vs IN clauses intelligently
6. Foundation Model Training Support¶
SLAF's versatile SQL combined with OLAP-optimized query engine enables window function queries that directly support tokenization and streaming dataloaders:
# Streaming tokenization for transformer models
from slaf.ml.dataloaders import SLAFDataLoader
# Create production-ready DataLoader
dataloader = SLAFDataLoader(
slaf_array=slaf_array,
tokenizer_type="geneformer",
batch_size=32,
max_genes=2048,
vocab_size=50000
)
# Stream batches for training
for batch in dataloader:
input_ids = batch["input_ids"] # Already tokenized
attention_mask = batch["attention_mask"]
cell_ids = batch["cell_ids"]
# Your training code here
# High-throughput dataloading
# 15k cells/sec peak throughput
# 30M tokens/sec for large batches
Benefits:
- Streaming architecture: Supports asynchronous pre-fetching and concurrent streaming
- GPU-optimized: Batch sizes up to 2048 cells with high GPU utilization
- Multi-node ready: Shard-aware streaming for distributed training
- Foundation model support: Direct integration with scGPT, Geneformer, etc.
Comparison with Other Formats¶
Feature | SLAF | H5AD | Zarr | SOMA |
---|---|---|---|---|
Storage | ||||
Cloud-Native | ✅ | ❌ | ✅ | ✅ |
Sparse Arrays | ✅ | ✅ | ❌ | ✅ |
Chunked Reads | ✅ | ❌ | ✅ | ✅ |
Schema Evolution | ✅ | ❌ | ✅ | ✅ |
Compute | ||||
SQL Queries | ✅ | ❌ | ❌ | ❌ |
Optimized Query Engine | ✅ | ❌ | ❌ | ✅ |
Random Access | ✅ | ✅ | ✅ | ✅ |
Use Cases | ||||
Scanpy Integration | ✅ | ✅ | ✅ | ❌ |
Lazy Computation | ✅ | ❌ | ❌ | ❌ |
Tokenizers | ✅ | ❌ | ❌ | ❌ |
Dataloaders | ✅ | ❌ | ❌ | ❌ |
Embeddings Support | 🔄 | ❌ | ❌ | ❌ |
Visualization Support | 🔄 | ✅ | ❌ | ❌ |
Legend:
- ✅ = Supported
- ❌ = Not supported
- 🔄 = Coming soon
Next Steps¶
- Learn about Migrating to SLAF
- Explore SQL Queries
- See Examples for real-world usage