How SLAF Works¶
SLAF (Sparse Lazy Array Format) is a high-performance format for single-cell data that combines the power of SQL with lazy evaluation. It's designed to solve the performance and scalability challenges of traditional single-cell data formats.
Key Design Principles¶
1. SQL-Native Design with Relational Schema¶
SLAF is built on a SQL-native relational schema that enables direct SQL queries while providing lazy AnnData/Scanpy equivalences for seamless migration:
Relational Schema¶
SLAF stores single-cell data in three core tables:
cellstable: Cell metadata, QC metrics, and annotations withcell_id(string) andcell_integer_id(integer)genestable: Gene metadata, annotations, and feature information withgene_id(string) andgene_integer_id(integer)expressiontable: Sparse expression matrix withcell_integer_id,gene_integer_id, andvaluecolumns
The expression table uses integer IDs for efficiency, so you need to JOIN with metadata tables to get string identifiers.
This relational design enables direct SQL queries for everything:
# Direct SQL for complex aggregations
results = slaf.query("""
SELECT cell_type,
COUNT(*) as cell_count,
AVG(total_counts) as avg_counts,
SUM(e.value) as total_expression
FROM cells c
JOIN expression e ON c.cell_integer_id = e.cell_integer_id
WHERE batch = 'batch1' AND n_genes_by_counts >= 500
GROUP BY cell_type
ORDER BY cell_count DESC
""")
# Window functions for advanced analysis
ranked_genes = slaf.query("""
SELECT g.gene_id,
c.cell_type,
e.value,
ROW_NUMBER() OVER (
PARTITION BY c.cell_type
ORDER BY e.value DESC
) as rank
FROM expression e
JOIN cells c ON e.cell_integer_id = c.cell_integer_id
JOIN genes g ON e.gene_integer_id = g.gene_integer_id
WHERE g.gene_id IN ('MS4A1', 'CD3D', 'CD8A')
""")
Lazy AnnData/Scanpy Equivalences¶
For users migrating from Scanpy, SLAF provides drop-in lazy equivalents:
# Load as LazyAnnData (no data loaded yet)
adata = read_slaf("data.slaf")
# Use familiar Scanpy-style operations
subset = adata[adata.obs.cell_type == "T cells", :] # This is lazy
# Access expression data lazily
expression = subset.X.compute() # Only loads the subset
# Use Scanpy preprocessing (lazy)
from slaf.scanpy import pp
pp.normalize_total(adata, target_sum=1e4, inplace=True)
pp.log1p(adata)
pp.highly_variable_genes(adata)
expression = adata[cell_ids, gene_ids].X.compute()
Seamless Switching Between Interfaces¶
You can switch between SQL and AnnData interfaces as needed:
# Start with AnnData interface
lazy_adata = read_slaf("data.slaf")
t_cells = lazy_adata[lazy_adata.obs.cell_type == "T cells", :]
# Switch to SQL for complex operations
t_cells_slaf = t_cells.slaf # Access underlying SLAFArray object
complex_query_result = t_cells_slaf.query("""
SELECT g.gene_id,
COUNT(*) as expressing_cells,
AVG(e.value) as mean_expression
FROM expression e
JOIN genes g ON e.gene_integer_id = g.gene_integer_id
WHERE e.cell_integer_id IN (
SELECT cell_integer_id FROM cells
WHERE cell_type = 'T cells'
)
GROUP BY g.gene_id
HAVING expressing_cells >= 10
ORDER BY mean_expression DESC
""")
# Back to AnnData for visualization
import scanpy as sc
t_cells_as_adata = t_cells.compute() # Convert to native scanpy
sc.pl.umap(t_cells_as_adata, color='leiden')
Benefits:
- SQL-native: Direct access to relational database capabilities
- SQL-native: Complex aggregations and window functions
- Migration-friendly: Drop-in replacement for existing Scanpy workflows
- Flexible: Switch between SQL and AnnData interfaces as needed
2. Polars-Like: OLAP Database with Pushdown Filters¶
SLAF leverages modern OLAP databases and pushdown filters on optimized storage formats rather than in-memory operations:
cellstable: Cell metadata and QC metricsgenestable: Gene metadata and annotationsexpressiontable: Sparse expression matrix data
Like Polars, SLAF pushes complex operations down to the query engine:
# Metadata-only filtering without loading expression data
filtered_cells = slaf.filter_cells(n_genes_by_counts=">=500")
high_quality = slaf.filter_cells(
n_genes_by_counts=">=1000",
pct_counts_mt="<=10"
)
Benefits:
- Memory efficient: Only load metadata when filtering
- Faster metadata filtering vs h5ad as datasets scale
3. Zarr-Like: Lazy Loading of Sparse Matrices¶
SLAF provides lazy loading of sparse matrices from cloud storage with concurrent access patterns:
# No data loaded yet - just metadata
adata = read_slaf("data.slaf")
# Lazy slicing like Zarr chunked arrays
subset = adata[adata.obs.cell_type == "T cells", :]
single_cell = adata.get_cell_expression("AAACCTGAGAAACCAT-1")
gene_expression = adata.get_gene_expression("MS4A1")
# Data loaded only when .compute() is called
expression = subset.X.compute()
Benefits:
- Memory efficient for submatrix operations
- Concurrent access: Multiple readers can access different slices
- Cloud-native: Direct access to data in S3/GCS without downloading
- Chunked processing: Handle datasets larger than RAM
4. Dask-Like: Lazy Computation Graphs¶
SLAF enables building computational graphs of operations that execute lazily on demand:
# Build lazy computation graph
adata = LazyAnnData("data.slaf")
# Each operation is lazy - no data loaded yet
pp.calculate_qc_metrics(adata, inplace=True)
pp.filter_cells(adata, min_counts=500, min_genes=200, inplace=True)
pp.normalize_total(adata, target_sum=1e4, inplace=True)
pp.log1p(adata)
# Execute only on the slice of interest
expression = adata.X[cell_ids, gene_ids].compute()
Benefits:
- Complex pipelines: Build preprocessing workflows impossible with eager processing
- Composable: Chain operations without intermediate materialization
- Memory efficient: Only process the slice you need
- Scalable: Handle datasets that would cause memory explosions
5. Advanced Query Optimization¶
SLAF includes sophisticated query optimization to overcome current limitations of LanceDB:
# Adaptive batching for large scattered ID sets
batched_query = QueryOptimizer.build_optimized_query(
entity_ids=large_id_list,
entity_type="cell",
use_adaptive_batching=True
)
# CTE optimization for complex queries
cte_query = QueryOptimizer.build_cte_query(
entity_ids=scattered_ids,
entity_type="gene"
)
Key optimizations:
- Submatrix optimization: Efficient slicing for complex selectors
- Adaptive batching: Optimize query patterns based on ID distribution
- Range vs IN optimization: Choose BETWEEN vs IN clauses intelligently
6. Foundation Model Training Support¶
SLAF's versatile SQL combined with OLAP-optimized query engine enables window function queries that directly support tokenization and streaming dataloaders:
# Streaming tokenization for transformer models
from slaf.ml.dataloaders import SLAFDataLoader
# Create production-ready DataLoader
dataloader = SLAFDataLoader(
slaf_array=slaf_array,
tokenizer_type="geneformer",
batch_size=32,
max_genes=2048,
vocab_size=50000
)
# Stream batches for training
for batch in dataloader:
input_ids = batch["input_ids"] # Already tokenized
attention_mask = batch["attention_mask"]
cell_ids = batch["cell_ids"]
# Your training code here
# High-throughput dataloading
# 15k cells/sec peak throughput
# 30M tokens/sec for large batches
Benefits:
- Streaming architecture: Supports asynchronous pre-fetching and concurrent streaming
- GPU-optimized: Batch sizes up to 2048 cells with high GPU utilization
- Multi-node ready: Shard-aware streaming for distributed training
- Foundation model support: Direct integration with scGPT, Geneformer, etc.
Comparison with Other Formats¶
| Feature | SLAF | H5AD | Zarr | SOMA |
|---|---|---|---|---|
| Storage | ||||
| Cloud-Native | ✅ | ❌ | ✅ | ✅ |
| Sparse Arrays | ✅ | ✅ | ❌ | ✅ |
| Chunked Reads | ✅ | ❌ | ✅ | ✅ |
| Schema Evolution | ✅ | ❌ | ✅ | ✅ |
| Compute | ||||
| SQL Queries | ✅ | ❌ | ❌ | ❌ |
| Optimized Query Engine | ✅ | ❌ | ❌ | ✅ |
| Random Access | ✅ | ✅ | ✅ | ✅ |
| Use Cases | ||||
| Scanpy Integration | ✅ | ✅ | ✅ | ❌ |
| Lazy Computation | ✅ | ❌ | ❌ | ❌ |
| Tokenizers | ✅ | ❌ | ❌ | ❌ |
| Dataloaders | ✅ | ❌ | ❌ | ❌ |
| Embeddings Support | 🔄 | ❌ | ❌ | ❌ |
| Visualization Support | 🔄 | ✅ | ❌ | ❌ |
Legend:
- ✅ = Supported
- ❌ = Not supported
- 🔄 = Coming soon
Next Steps¶
- Learn about Migrating to SLAF
- Explore SQL Queries
- See Examples for real-world usage