Bioinformatics Benchmarks: SLAF vs Traditional Formats¶
SLAF provides dramatic performance improvements over traditional single-cell data formats for common bioinformatics operations. This document presents comprehensive benchmarks comparing SLAF against h5ad (AnnData) and TileDB SOMA across realistic bioinformatics workflows.
Overview¶
These benchmarks demonstrate SLAF's performance advantages in three key areas:
- Metadata Filtering - Cell and gene filtering operations
- Expression Queries - Retrieving expression data for specific cells/genes
- Preprocessing Pipelines - Scanpy-based preprocessing workflows
Test Dataset: synthetic_50k_processed¶
- Cells: 49,955 cells
- Genes: 25,000 genes
- Input File Size: ~722MB h5ad file
Hardware Configuration¶
- Machine: Apple MacBook Pro with M1 Max
- Memory: 32 GB RAM
- Storage: 1 TB NVMe SSD (local disk)
- OS: macOS 13.6.1
- Python: 3.12.0
Cell Filtering Benchmarks¶
Cell filtering is a fundamental operation in single-cell analysis, used for quality control, cell type selection, and data subsetting.
Performance Results¶
| Scenario | h5ad Total (ms) | SLAF Total (ms) | TileDB Total (ms) | SLAF vs h5ad | SLAF vs TileDB | Description |
|---|---|---|---|---|---|---|
| S1 | 530.0 | 2.9 | 20.9 | 183.6x | 7.3x | Cells with >=500 genes |
| S2 | 169.7 | 2.0 | 21.3 | 83.3x | 10.5x | High UMI count (total_counts > 2000) |
| S3 | 170.7 | 1.9 | 21.3 | 92.2x | 11.5x | Mitochondrial fraction < 0.1 |
| S4 | 177.1 | 2.0 | 18.6 | 86.7x | 9.1x | Complex multi-condition filter |
| S5 | 186.6 | 2.8 | 18.1 | 67.2x | 6.5x | Cell type annotation filter |
| S6 | 171.4 | 2.0 | 20.9 | 86.3x | 10.5x | Cells from batch_1 |
| S7 | 207.0 | 2.3 | 23.8 | 89.4x | 10.3x | Cells in clusters 0,1 from batch_1 |
| S8 | 170.9 | 2.1 | 23.0 | 79.6x | 10.7x | High-quality cells (>=1000 genes, <=10% mt) |
| S9 | 172.1 | 2.5 | 19.3 | 70.2x | 7.9x | Cells with 800-2000 total counts |
| S10 | 173.5 | 2.1 | 20.8 | 84.6x | 10.1x | Cells with 200-1500 genes |
Average Performance:
- SLAF vs h5ad: 92.3x faster
- SLAF vs TileDB: 9.4x faster
- Memory Usage: SLAF uses 115.7x less memory than h5ad
Key Insights¶
Dramatic Performance Advantage
SLAF achieves 92.3x average speedup over h5ad for cell filtering operations, demonstrating the massive performance benefits of modern columnar storage and optimized querying.
Columnar Format Efficiency
Both SLAF and TileDB (Arrow-interoperable formats) significantly outperform h5ad, with SLAF providing an additional 9.4x advantage over TileDB through its optimized streaming architecture.
Gene Filtering Benchmarks¶
Gene filtering operations are essential for feature selection, quality control, and differential expression in single-cell analysis.
Performance Results¶
| Scenario | h5ad Total (ms) | SLAF Total (ms) | TileDB Total (ms) | SLAF vs h5ad | SLAF vs TileDB | Description |
|---|---|---|---|---|---|---|
| S1 | 43.4 | 3.0 | 22.3 | 14.6x | 7.5x | Genes expressed in >=10 cells |
| S2 | 32.3 | 1.7 | 19.5 | 19.4x | 11.7x | Genes with >=100 total counts |
| S3 | 32.1 | 1.8 | 9.3 | 17.4x | 5.0x | Genes with mean expression >=0.1 |
| S4 | 31.1 | 1.6 | 15.8 | 19.9x | 10.1x | Exclude mitochondrial genes |
| S5 | 32.7 | 1.7 | 16.9 | 19.7x | 10.2x | Highly variable genes |
| S6 | 31.7 | 2.1 | 15.8 | 15.4x | 7.7x | Non-highly variable genes |
| S7 | 31.5 | 2.0 | 18.2 | 15.8x | 9.1x | Genes in >=50 cells with >=500 total counts |
| S8 | 31.7 | 1.9 | 19.8 | 17.0x | 10.6x | Genes with 100-10000 total counts |
| S9 | 33.2 | 2.0 | 11.3 | 16.4x | 5.6x | Genes in 5-1000 cells |
Average Performance:
- SLAF vs h5ad: 17.3x faster
- SLAF vs TileDB: 8.6x faster
- Memory Usage: SLAF uses 2.2x less memory than h5ad
Key Insights¶
Consistent High Performance
SLAF maintains consistent 14x+ speedups across all gene filtering scenarios, demonstrating robust optimization of Polars operations and modern storage formats.
Memory Efficiency
Gene filtering operations show moderate memory efficiency gains, with SLAF using 2.2x less memory than h5ad for equivalent operations.
Expression Queries Benchmarks¶
Expression queries retrieve specific expression data for cells or genes, supporting analysis workflows that require targeted data access.
Performance Results¶
| Scenario | h5ad Total (ms) | SLAF Total (ms) | TileDB Total (ms) | SLAF vs h5ad | SLAF vs TileDB | Description |
|---|---|---|---|---|---|---|
| S1 | 484.5 | 16.1 | 63.8 | 30.1x | 4.0x | Single cell expression |
| S2 | 251.3 | 13.9 | 19.3 | 18.1x | 1.4x | Another single cell |
| S3 | 328.3 | 14.2 | 18.1 | 23.1x | 1.3x | Two cells |
| S4 | 233.2 | 15.5 | 19.0 | 15.1x | 1.2x | Three cells |
| S5 | 232.7 | 523.7 | 150.8 | 0.4x | 0.3x | Single gene across all cells |
| S6 | 203.4 | 442.6 | 84.2 | 0.5x | 0.2x | Another single gene |
| S7 | 256.1 | 303.0 | 97.7 | 0.8x | 0.3x | Two genes |
| S8 | 212.0 | 655.9 | 83.3 | 0.3x | 0.1x | Three genes |
| S9 | 221.4 | 22.5 | 9.9 | 9.9x | 0.4x | 100x50 submatrix |
| S10 | 168.3 | 61.9 | 12.1 | 2.7x | 0.2x | 500x100 submatrix |
| S11 | 212.2 | 63.2 | 10.8 | 3.4x | 0.2x | 500x500 submatrix |
Average Performance:
- SLAF vs h5ad: 9.5x faster
- SLAF vs TileDB: 0.9x faster
- Memory Usage: SLAF uses 154.6x less memory than h5ad
Key Insights¶
Query Optimization
SLAF's expression query performance demonstrates efficient sparse matrix operations and optimized data access patterns, achieving 9.5x average speedup over h5ad.
TileDB's Expression Query Strengths
TileDB demonstrates impressive performance for gene expression queries and submatrix operations, often outperforming both SLAF and h5ad. For single gene queries across all cells (S5-S8), TileDB shows 2.5-10x speedup over SLAF, highlighting its optimized columnar access patterns for gene-centric operations.
SLAF's Cell-Centric Advantages
SLAF maintains strong performance for cell-centric queries (S1-S4, S9-S11), achieving 15-30x speedup over h5ad for single cell and submatrix operations, while TileDB shows competitive or superior performance for larger submatrices.
Mixed Performance Profile
The benchmarks reveal a nuanced performance landscape: SLAF excels at cell-centric operations and metadata filtering, while TileDB demonstrates superior performance for gene-centric expression queries and large submatrix operations. This suggests different systems may be optimal for different analysis workflows.
Lazy Computation (Preprocessing Pipelines)¶
SLAF enables lazy computation graphs that build complex preprocessing pipelines and only execute them on the slice of interest, similar to Dask's delayed computation patterns.
Traditional Approach (Eager Processing)¶
# Each step loads data into memory
adata = sc.read_h5ad("data.h5ad", backed="r")
# QC metrics calculation
sc.pp.calculate_qc_metrics(adata, inplace=True)
# Cell filtering with min_counts and min_genes
sc.pp.filter_cells(adata, min_counts=500, min_genes=200, inplace=True)
# Gene filtering
sc.pp.filter_genes(adata, min_counts=10, min_cells=5, inplace=True)
# Normalization
sc.pp.normalize_total(adata, target_sum=1e4, inplace=True)
# Log transformation
sc.pp.log1p(adata)
# Each operation processes the entire dataset
expression = adata.X[cell_ids, gene_ids]
SLAF Approach (Lazy Computation Graph)¶
# Build lazy computation graph
adata = LazyAnnData("data.slaf") # LazyAnnData object
# QC metrics calculation (lazy)
pp.calculate_qc_metrics(adata, inplace=True)
# Cell filtering (lazy)
pp.filter_cells(adata, min_counts=500, min_genes=200, inplace=True)
# Gene filtering (lazy)
pp.filter_genes(adata, min_counts=10, min_cells=5, inplace=True)
# Normalization (lazy)
pp.normalize_total(adata, target_sum=1e4, inplace=True)
# Log transformation (lazy)
pp.log1p(adata)
# Only execute the computation on the slice of interest
expression = adata.X[cell_ids, gene_ids].compute() # LazyExpressionMatrix.compute()
Performance Results¶
| Operation | Traditional Total (ms) | SLAF Total (ms) | Total Speedup | Memory Efficiency | Description |
|---|---|---|---|---|---|
| S1 | 1150.4 | 2577.0 | 0.4x | 148.1x | Calculate QC metrics |
| S2 | 941.0 | 2045.1 | 0.5x | 1.5x | Filter cells (min_counts=500, min_genes=200) |
| S3 | 929.7 | 2058.9 | 0.5x | 1.5x | Filter cells (min_counts=100, min_genes=50) |
| S4 | 862.1 | 2184.9 | 0.4x | 1.5x | Filter cells (max_counts=10000, max_genes=3000) |
| S5 | 1208.9 | 2586.7 | 0.5x | 1.5x | Filter genes (min_counts=10, min_cells=5) |
| S6 | 1183.5 | 2516.4 | 0.5x | 1.5x | Filter genes (min_counts=20, min_cells=5) |
| S7 | 436.2 | 3675.7 | 0.1x | 1.5x | Normalize total (target_sum=1e4) |
| S8 | 410.9 | 2843.9 | 0.1x | 1.5x | Normalize total (target_sum=1e6) |
| S9 | 628.2 | 2083.7 | 0.3x | 0.0x | Log1p transformation |
| S10 | 653.9 | 1660.6 | 0.4x | 98.9x | Find highly variable genes |
| S11 | 695.6 | 1642.6 | 0.4x | 98.9x | Find top 2000 highly variable genes |
| S12 | 2359.9 | 6418.5 | 0.4x | 1.5x | QC metrics + cell filtering + gene filtering |
| S13 | 338.8 | 3417.2 | 0.1x | 2.0x | Normalize total + slice 100x50 submatrix (lazy) |
| S14 | 617.9 | 2174.4 | 0.3x | 2.0x | Log1p + slice 200x100 submatrix (lazy) |
| S15 | 627.9 | 3510.9 | 0.2x | 2.0x | Normalize + Log1p + slice 500x250 submatrix (lazy) |
| S16 | 922.6 | 3061.3 | 0.3x | 1.9x | Normalize + Log1p + mean per gene (lazy) |
| S17 | 620.4 | 3174.6 | 0.2x | 1.8x | Normalize + Log1p + variance per cell (lazy) |
Key Insight: Lazy computation enables complex preprocessing pipelines that would cause memory explosions with traditional tools. The computation cost is paid when materializing results, but the memory efficiency enables workflows impossible with eager processing. This is similar to Dask's delayed computation patterns.
For detailed migration guides, see SLAF vs h5ad Benchmarks and SLAF vs TileDB Benchmarks.