SLAF vs TileDB Performance Benchmarks¶

This document provides a comprehensive performance comparison between SLAF and TileDB SOMA across bioinformatics and machine learning workflows, focusing on modern columnar storage formats.

Overview¶

Both SLAF and TileDB represent modern approaches to single-cell data storage, using Arrow-interoperable columnar formats. However, SLAF demonstrates consistent performance advantages across metadata filtering and dataloader throughput, highlighting the benefits of its optimized streaming architecture and efficient data access patterns.

Key Performance Summary¶

Category	SLAF vs TileDB Speedup	Dataset
Conversion Performance	9.9x faster	synthetic_50k_processed
Cell Filtering	9.4x faster	synthetic_50k_processed
Gene Filtering	8.6x faster	synthetic_50k_processed
Cell-centric Queries	1.1x-1.4x faster	synthetic_50k_processed
Gene-centric Queries	2.5x-3.3x slower	synthetic_50k_processed
Submatrix Queries	2.5x-5.0x slower	synthetic_50k_processed
ML Data Loading	45.7x faster	Tahoe-100M

Modern Format Advantage

Both SLAF and TileDB significantly outperform traditional h5ad-based approaches, demonstrating the benefits of modern cloud-native storage. SLAF provides additional performance optimizations for streaming and data access.

Conversion Performance¶

Conversion from h5ad to modern formats demonstrates SLAF's efficiency in data ingestion workflows.

Input Dataset: synthetic_50k_processed (49,955 cells × 25,000 genes, ~722MB h5ad file)

Metric	SLAF	TileDB SOMA	SLAF vs TileDB Improvement
Conversion Time	1.87s	18.44s	9.9x faster
Output Size	349.2MB	561.1MB	38% smaller
Peak Memory	826.6MB	4,399.6MB	5.3x more memory efficient

Key Advantages:

9.9x faster conversion from h5ad to SLAF format
38% smaller output size compared to TileDB SOMA
5.3x more memory efficient - SLAF uses only 827MB vs TileDB's 4.4GB peak memory
Optimized chunked processing with efficient memory management

TileDB to SLAF Conversion¶

Converting from TileDB SOMA to SLAF format is straightforward and efficient:

TileDB to SLAF Performance: synthetic_50k_processed (49,955 cells × 25,000 genes)

Migration Benefits:

Fast conversion: 50k cell dataset converts in just 2 seconds
Linear scaling: Performance scales linearly with the number of cells
Simple command: slaf convert data.tiledb output.slaf

Easy Migration from TileDB

Converting from TileDB to SLAF is simple and fast. A 50k cell dataset takes only 2 seconds to convert, with linear scaling performance. The conversion preserves all data types and metadata while providing significant performance improvements for downstream analysis.

Fast Migration Path

SLAF's superior conversion performance enables rapid migration of existing h5ad datasets, with conversion times under 2 seconds for 50k cell datasets and significantly smaller output files.

Bioinformatics Benchmarks¶

Cell Filtering Performance¶

Cell filtering operations demonstrate the efficiency of modern columnar storage for metadata operations.

Scenario	TileDB Total (ms)	SLAF Total (ms)	Speedup	Description
S1	20.9	2.9	7.3x	Cells with >=500 genes
S2	21.3	2.0	10.5x	High UMI count (total_counts > 2000)
S3	21.3	1.9	11.5x	Mitochondrial fraction < 0.1
S4	18.6	2.0	9.1x	Complex multi-condition filter
S5	18.1	2.8	6.5x	Cell type annotation filter
S6	20.9	2.0	10.5x	Cells from batch_1
S7	23.8	2.3	10.3x	Cells in clusters 0,1 from batch_1
S8	23.0	2.1	10.7x	High-quality cells (>=1000 genes, <=10% mt)
S9	19.3	2.5	7.9x	Cells with 800-2000 total counts
S10	20.8	2.1	10.1x	Cells with 200-1500 genes

Average Performance: 9.4x faster

Gene Filtering Performance¶

Gene filtering operations show consistent performance advantages for SLAF's optimized Polars operations.

Scenario	TileDB Total (ms)	SLAF Total (ms)	Speedup	Description
S1	22.3	3.0	7.5x	Genes expressed in >=10 cells
S2	19.5	1.7	11.7x	Genes with >=100 total counts
S3	9.3	1.8	5.0x	Genes with mean expression >=0.1
S4	15.8	1.6	10.1x	Exclude mitochondrial genes
S5	16.9	1.7	10.2x	Highly variable genes
S6	15.8	2.1	7.7x	Non-highly variable genes
S7	18.2	2.0	9.1x	Genes in >=50 cells with >=500 total counts
S8	19.8	1.9	10.6x	Genes with 100-10000 total counts
S9	11.3	2.0	5.6x	Genes in 5-1000 cells

Average Performance: 8.6x faster

Expression Queries Performance¶

Expression queries demonstrate the efficiency of optimized sparse matrix operations, where TileDB often wins.

Scenario	TileDB Total (ms)	SLAF Total (ms)	Speedup	Description
S1	63.8	16.1	4.0x	Single cell expression
S2	19.3	13.9	1.4x	Another single cell
S3	18.1	14.2	1.3x	Two cells
S4	19.0	15.5	1.2x	Three cells
S5	150.8	523.7	0.3x	Single gene across all cells
S6	84.2	442.6	0.2x	Another single gene
S7	97.7	303.0	0.3x	Two genes
S8	83.3	655.9	0.1x	Three genes
S9	9.9	22.5	0.4x	100x50 submatrix
S10	12.1	61.9	0.2x	500x100 submatrix
S11	10.8	63.2	0.2x	500x500 submatrix

Average Performance:

SLAF wins cell-centric (1.2x-4.0x),
TileDB wins gene-centric (3.3x-10x) and submatrix (2.5x-5.0x)

Machine Learning Benchmarks¶

Raw Data Loading Performance¶

Raw data loading performance demonstrates the advantages of SLAF's optimized streaming architecture.

System	Throughput (cells/sec)	Notes
SLAF	24,587	Optimized streaming
TileDB DataLoader	518	Custom PyTorch loader

Performance Comparison: 47.5x faster

GPU-Ready Output Performance¶

SLAF provides pre-tokenized sequences ready for GPU training, while TileDB DataLoader only provides raw data.

System	Throughput (cells/sec)	Throughput (tokens/sec)	Output Type
SLAF	7,487	15,332,896	Pre-tokenized sequences
TileDB DataLoader	N/A	N/A	Raw data only

GPU Training Advantage

SLAF is the only system providing GPU-ready tokenized output, enabling efficient training of foundation models like Geneformer and scGPT.

Technical Implementation Comparison¶

Aspect	SLAF	TileDB
Storage	Arrow-interoperable columnar storage with Lance backend	Arrow-interoperable array-native storage with TileDB backend
Metadata	Polars DataFrames for efficient filtering operations	Arrow tables with Polars for filtering operations
Expression	Optimized sparse COO matrices with zero-copy access	Smart indexing for both cell and gene based slicing
ML Integration	Native PyTorch DataLoader with tokenization support	Needs third party custom PyTorch DataLoader for raw data

Key Technical Advantages¶

Optimized Streaming Architecture

SLAF's asynchronous prefetching and background processing provide significant performance advantages over TileDB's synchronous loading approach.

Enhanced ML Integration

SLAF provides native PyTorch DataLoader integration with built-in tokenization support, while TileDB requires custom loader implementation.

Memory Efficiency

SLAF's optimized memory management and zero-copy operations result in 1.5-2.3x memory efficiency gains over TileDB.

Simplified API

SLAF provides a more streamlined API for common bioinformatics operations, while TileDB requires more manual configuration.

Migration Benefits¶

Performance Improvements¶

9.4x faster bioinformatics operations
47.5x faster machine learning data loading
Faster conversion from h5ad to SLAF format, enabling rapid migration of existing datasets
Easy TileDB migration: 2-second conversion for 50k cell datasets with linear scaling

Developer Experience¶

Simplified API with familiar Polars operations
Scanpy-native lazy workflows with drop-in replacement of AnnData objects, enabling efficient lazy computation graphs for preprocessing and filtering pipelines
Enhanced ML integration with native PyTorch support
Built-in tokenization for foundation model training

Use Case Recommendations¶

Choose SLAF for:¶

High-throughput machine learning workflows requiring fast data loading
Foundation model training requiring GPU-ready tokenized sequences
Streaming applications where continuous data flow is critical
Memory-constrained environments where efficiency is important

Choose TileDB for:¶

Existing TileDB infrastructure where migration costs are high
Multi-modal data where TileDB's broader ecosystem is beneficial

Conclusion¶

SLAF provides strong performance advantages over TileDB across most benchmark categories:

Metadata filtering: 9.4x-8.6x faster (cell and gene filtering)
Machine learning data loading: 47.5x faster
Expression queries: Mixed performance (SLAF wins cell-centric, TileDB wins gene-centric)

While both systems represent modern approaches to single-cell data storage, SLAF's optimized streaming architecture, enhanced ML integration, and simplified API provide significant advantages for most use cases. The benchmarks demonstrate that SLAF's performance optimizations and developer-friendly design make it the preferred choice for high-throughput bioinformatics and machine learning workflows.

For detailed migration guidance including TileDB to SLAF conversion, see Migrating to SLAF. For comprehensive benchmark results, see Bioinformatics Benchmarks and ML Benchmarks.