Skip to content

How SLAF Works

SLAF (Sparse Lazy Array Format) is a high-performance format for single-cell data that combines the power of SQL with lazy evaluation. It's designed to solve the performance and scalability challenges of traditional single-cell data formats.

Key Design Principles

1. SQL-Native Design with Relational Schema

SLAF is built on a SQL-native relational schema that enables direct SQL queries while providing lazy AnnData/Scanpy equivalences for seamless migration:

Relational Schema

SLAF stores single-cell data in three core tables:

  • cells table: Cell metadata, QC metrics, and annotations with cell_id (string) and cell_integer_id (integer)
  • genes table: Gene metadata, annotations, and feature information with gene_id (string) and gene_integer_id (integer)
  • expression table: Sparse expression matrix with cell_integer_id, gene_integer_id, and value columns

The expression table uses integer IDs for efficiency, so you need to JOIN with metadata tables to get string identifiers.

This relational design enables direct SQL queries for everything:

# Direct SQL for complex aggregations
results = slaf.query("""
    SELECT cell_type,
           COUNT(*) as cell_count,
           AVG(total_counts) as avg_counts,
           SUM(e.value) as total_expression
    FROM cells c
    JOIN expression e ON c.cell_integer_id = e.cell_integer_id
    WHERE batch = 'batch1' AND n_genes_by_counts >= 500
    GROUP BY cell_type
    ORDER BY cell_count DESC
""")

# Window functions for advanced analysis
ranked_genes = slaf.query("""
    SELECT g.gene_id,
           c.cell_type,
           e.value,
           ROW_NUMBER() OVER (
               PARTITION BY c.cell_type
               ORDER BY e.value DESC
           ) as rank
    FROM expression e
    JOIN cells c ON e.cell_integer_id = c.cell_integer_id
    JOIN genes g ON e.gene_integer_id = g.gene_integer_id
    WHERE g.gene_id IN ('MS4A1', 'CD3D', 'CD8A')
""")

Lazy AnnData/Scanpy Equivalences

For users migrating from Scanpy, SLAF provides drop-in lazy equivalents:

# Load as LazyAnnData (no data loaded yet)
adata = read_slaf("data.slaf")

# Use familiar Scanpy-style operations
subset = adata[adata.obs.cell_type == "T cells", :] # This is lazy

# Access expression data lazily
expression = subset.X.compute()  # Only loads the subset

# Use Scanpy preprocessing (lazy)
from slaf.scanpy import pp
pp.normalize_total(adata, target_sum=1e4, inplace=True)
pp.log1p(adata)
pp.highly_variable_genes(adata)
expression = adata[cell_ids, gene_ids].X.compute()

Seamless Switching Between Interfaces

You can switch between SQL and AnnData interfaces as needed:

# Start with AnnData interface
lazy_adata = read_slaf("data.slaf")
t_cells = lazy_adata[lazy_adata.obs.cell_type == "T cells", :]

# Switch to SQL for complex operations
t_cells_slaf = t_cells.slaf  # Access underlying SLAFArray object
complex_query_result = t_cells_slaf.query("""
    SELECT g.gene_id,
           COUNT(*) as expressing_cells,
           AVG(e.value) as mean_expression
    FROM expression e
    JOIN genes g ON e.gene_integer_id = g.gene_integer_id
    WHERE e.cell_integer_id IN (
        SELECT cell_integer_id FROM cells
        WHERE cell_type = 'T cells'
    )
    GROUP BY g.gene_id
    HAVING expressing_cells >= 10
    ORDER BY mean_expression DESC
""")

# Back to AnnData for visualization
import scanpy as sc
t_cells_as_adata = t_cells.compute()  # Convert to native scanpy
sc.pl.umap(t_cells_as_adata, color='leiden')

Benefits:

  • SQL-native: Direct access to relational database capabilities
  • SQL-native: Complex aggregations and window functions
  • Migration-friendly: Drop-in replacement for existing Scanpy workflows
  • Flexible: Switch between SQL and AnnData interfaces as needed

2. Polars-Like: OLAP Database with Pushdown Filters

SLAF leverages modern OLAP databases and pushdown filters on optimized storage formats rather than in-memory operations:

  • cells table: Cell metadata and QC metrics
  • genes table: Gene metadata and annotations
  • expression table: Sparse expression matrix data

Like Polars, SLAF pushes complex operations down to the query engine:

# Metadata-only filtering without loading expression data
filtered_cells = slaf.filter_cells(n_genes_by_counts=">=500")
high_quality = slaf.filter_cells(
    n_genes_by_counts=">=1000",
    pct_counts_mt="<=10"
)

Benefits:

  • Memory efficient: Only load metadata when filtering
  • Faster metadata filtering vs h5ad as datasets scale

3. Zarr-Like: Lazy Loading of Sparse Matrices

SLAF provides lazy loading of sparse matrices from cloud storage with concurrent access patterns:

# No data loaded yet - just metadata
adata = read_slaf("data.slaf")

# Lazy slicing like Zarr chunked arrays
subset = adata[adata.obs.cell_type == "T cells", :]
single_cell = adata.get_cell_expression("AAACCTGAGAAACCAT-1")
gene_expression = adata.get_gene_expression("MS4A1")

# Data loaded only when .compute() is called
expression = subset.X.compute()

Benefits:

  • Memory efficient for submatrix operations
  • Concurrent access: Multiple readers can access different slices
  • Cloud-native: Direct access to data in S3/GCS without downloading
  • Chunked processing: Handle datasets larger than RAM

4. Dask-Like: Lazy Computation Graphs

SLAF enables building computational graphs of operations that execute lazily on demand:

# Build lazy computation graph
adata = LazyAnnData("data.slaf")

# Each operation is lazy - no data loaded yet
pp.calculate_qc_metrics(adata, inplace=True)
pp.filter_cells(adata, min_counts=500, min_genes=200, inplace=True)
pp.normalize_total(adata, target_sum=1e4, inplace=True)
pp.log1p(adata)

# Execute only on the slice of interest
expression = adata.X[cell_ids, gene_ids].compute()

Benefits:

  • Complex pipelines: Build preprocessing workflows impossible with eager processing
  • Composable: Chain operations without intermediate materialization
  • Memory efficient: Only process the slice you need
  • Scalable: Handle datasets that would cause memory explosions

5. Advanced Query Optimization

SLAF includes sophisticated query optimization to overcome current limitations of LanceDB:

# Adaptive batching for large scattered ID sets
batched_query = QueryOptimizer.build_optimized_query(
    entity_ids=large_id_list,
    entity_type="cell",
    use_adaptive_batching=True
)

# CTE optimization for complex queries
cte_query = QueryOptimizer.build_cte_query(
    entity_ids=scattered_ids,
    entity_type="gene"
)

Key optimizations:

  • Submatrix optimization: Efficient slicing for complex selectors
  • Adaptive batching: Optimize query patterns based on ID distribution
  • Range vs IN optimization: Choose BETWEEN vs IN clauses intelligently

6. Foundation Model Training Support

SLAF's versatile SQL combined with OLAP-optimized query engine enables window function queries that directly support tokenization and streaming dataloaders:

# Streaming tokenization for transformer models
from slaf.ml.dataloaders import SLAFDataLoader

# Create production-ready DataLoader
dataloader = SLAFDataLoader(
    slaf_array=slaf_array,
    tokenizer_type="geneformer",
    batch_size=32,
    max_genes=2048,
    vocab_size=50000
)

# Stream batches for training
for batch in dataloader:
    input_ids = batch["input_ids"]      # Already tokenized
    attention_mask = batch["attention_mask"]
    cell_ids = batch["cell_ids"]
    # Your training code here

# High-throughput dataloading
# 15k cells/sec peak throughput
# 30M tokens/sec for large batches

Benefits:

  • Streaming architecture: Supports asynchronous pre-fetching and concurrent streaming
  • GPU-optimized: Batch sizes up to 2048 cells with high GPU utilization
  • Multi-node ready: Shard-aware streaming for distributed training
  • Foundation model support: Direct integration with scGPT, Geneformer, etc.

Comparison with Other Formats

Feature SLAF H5AD Zarr SOMA
Storage
Cloud-Native
Sparse Arrays
Chunked Reads
Schema Evolution
Compute
SQL Queries
Optimized Query Engine
Random Access
Use Cases
Scanpy Integration
Lazy Computation
Tokenizers
Dataloaders
Embeddings Support 🔄
Visualization Support 🔄

Legend:

  • ✅ = Supported
  • ❌ = Not supported
  • 🔄 = Coming soon

Next Steps