ML Benchmarks: SLAF vs State-of-the-Art Dataloaders¶

SLAF provides state-of-the-art (SOTA) performance in data loading throughput for machine learning workflows, reaching 2.6x speedups relative to current SOTA, particularly for training transformer-based single-cell foundation models. What follows are comprehensive benchmarks comparing SLAF against state-of-the-art dataloaders including scDataset, BioNeMo SCDL, AnnDataLoader, AnnLoader, and TileDB DataLoader.

Motivation¶

The goal of these benchmarks is to demonstrate that SLAF can stream tokens to modern GPUs at a rate sufficient to prevent idle time between training loops. For a 1B parameter model like scGPT, fast enough means delivering training batches within 50 ms to keep the GPU utilization high. This benchmark establishes SLAF's ability to meet the throughput requirements for efficient foundation model training on massive single-cell datasets.

Dataset and Hardware¶

Dataset: Tahoe-100M¶

We downloaded one of the 7 h5ad files comprising the Tahoe-100M dataset made accessible by ARC Institute. This slice of the dataset contains 5,481,420 cells and 62,710 genes, with approximately 8B non-zero expression values. All benchmarks reported below used this dataset except for the TileDB dataloader since we couldn't successfully convert a 5M-cell dataset to the Tile DB SOMA Experiment format with 32G RAM. For the TileDB DataLoader alone, we report numbers on a smaller 50k-cell synthetic dataset.

Conversion and Optimization¶

We used the SLAF converter (see Migrating to SLAF) to convert the h5ad file to SLAF format. The Lance table fragments (Lance's term for partitions) were optimized for compression/query tradeoffs, with 5-10M non-zeros (rows) per fragment in the expression table. While inherently parallelizable, conversion is currently single process, and took about 10 minutes for this dataset.

Hardware Configuration¶

Machine: Apple MacBook Pro with M1 Max
Memory: 32 GB RAM
Storage: 1 TB NVMe SSD
OS: macOS 13.6.1

Note

These benchmarks represent performance on a high-end laptop. Production deployments on dedicated servers with faster storage may show different performance characteristics. Likewise, performance from object storage to non-colocated compute might be worse.

Internal Benchmarks¶

Methodology¶

We used a batch size of 32 with an enhanced warmup and measurement procedure to ensure accurate and consistent results, especially for the Mixture of Scanners (MoS) strategy:

Initial Warmup: 15 batches to initialize the dataloader
Extended Warmup: 10 seconds to allow MoS to fully stabilize all fragment generators
Measurement Period: 40 seconds of pure performance measurement (excluding warmup time)
Total Runtime: 50 seconds per benchmark (10s warmup + 40s measurement)

This methodology ensures that all dataloader strategies reach steady-state performance before measurement begins, eliminating variance from incomplete initialization.

Tokenization Strategy Comparison¶

We benchmarked different tokenization strategies to understand the performance impact of various preprocessing options:

Tokenization Strategy	Throughput (cells/sec)	Throughput (tokens/sec)
scGPT with binning	5,350	10,967,687
scGPT without binning	5,157	10,572,411
Geneformer with percentile filtering	6,999	14,335,137
Geneformer without percentile filtering	6,494	13,300,337
Raw mode (no tokenization)	22,323	N/A

Strategy Insights

Geneformer strategies show ~30% higher throughput than scGPT strategies
Binning and filtering have minimal performance impact (~7% difference)
Raw mode provides 3.4x higher throughput than tokenized modes, demonstrating the tokenization overhead

Raw Mode Performance Scaling¶

Raw mode bypasses tokenization and returns Polars DataFrames that have the exact schema as sparse CSR tensors, demonstrating SLAF's base data loading performance.

Batch Size	Throughput (cells/sec)	Total Cells	Measurement Time (s)
32	23,783	713,577	30.0
64	25,259	765,957	30.3
128	28,079	842,394	30.0
256	28,169	850,146	30.2

Optimization Validation

Raw mode throughput shows 1.2x improvement from batch size 32 to 256, demonstrating that SLAF's data loading pipeline scales efficiently with larger batch sizes while maintaining high performance.

Fragment vs Batch Loading Comparison¶

SLAF supports two loading strategies: fragment-based and batch-based loading. Fragment-based loading processes entire Lance fragments at once, while batch-based loading processes multiple Lance batches sequentially.

Strategy	Throughput (cells/sec)	Total Cells	Total Batches
Fragment-Based Loading	22,472	229,669	7,180
Batch-Based Loading	24,354	243,554	8,038

Fragment Strategy Performance

Batch-based loading shows modestly higher throughput than fragment-based loading in this benchmark, but test-retest repeatability shows high variance. The performance difference should not be overinterpreted as it may vary significantly across different runs and hardware configurations.

Strategy Selection

Mixture of Scanners (MoS) is the default strategy in SLAF for foundation model training, providing 88% of random entropy with only 3.2% throughput penalty. Sequential loading is available for maximum throughput by setting use_mixture_of_scanners=False, by_fragment=False to the SLAFDataLoader for users who prioritize speed over entropy.

Tokenized Mode: Tokens/sec Scaling¶

Tokenized mode provides pre-tokenized sequences ready for GPU training, demonstrating SLAF's end-to-end pipeline performance.

Batch Size	Throughput (cells/sec)	Throughput (tokens/sec)	Total Cells	Measurement Time (s)
32	7,141	14,624,846	215,990	30.2
64	7,147	14,637,356	223,872	31.3
128	7,309	14,969,420	224,663	30.7
256	7,269	14,885,945	224,511	30.9

Tokenization Efficiency

Token throughput remains remarkably constant across batch sizes (1.0x scaling), demonstrating that SLAF's tokenization pipeline is well-optimized and not the bottleneck. This validates that tokens/sec is the meaningful metric for GPU training workloads.

Entropy Measurement: Training Batch Randomness¶

To ensure models don't converge to local minima due to biased and highly correlated training batches, we want to make training batches as random as possible. However, random reads are more expensive than sequential reads, so we need to balance randomness with performance.

To address this challenge, we developed a novel dataloader strategy called the Mixture of Scanners (MoS) approach, which randomly tasks a small randomized group of scanners to populate a queue of training batches by reading from different starting points of the dataset. A deeper dive into our approach to optimize dataloaders is available here and a more detailed write up of the MoS dataloader is here.

To measure entropy without using metadata, we simulate random cell IDs and measure L1 distance between pairs of cell IDs both within and across adjacent training batches for our different dataloaders to show how each dataloader strategy performs relative to a purely sequential (lowerbound) vs a truly random approach (upperbound).

We ran a test on 10,000 batches with a batch_size of 32 from a 5.4M cell dataset and found these results:

Entropy Measurement Results:

Strategy	Within-Batch L1	Across-Batch L1
sequential	94.1	104.5
fragment	1,643.5	1,672.6
mos	1,608,648.2	1,642,829.9
random	1,828,595.2	1,824,468.9

Normalized Entropy Scores [0=Sequential, 1=Random]:

Strategy	Within-Batch L1	Across-Batch L1
sequential	0.000	0.000
fragment	0.001	0.001
mos	0.880	0.900

Throughput Performance Results:

Strategy	Throughput (cells/sec)	Total Cells	Total Batches
sequential	23,728	711,990	23,509
fragment	26,769	803,072	25,216
mos	22,972	689,234	21,546

Entropy Strategy Performance

Sequential loading provides the lowest entropy (0.000), with contiguous cell IDs from Lance batches
Fragment-based loading shows minimal improvement (0.001), processing complete Lance fragments for slightly higher entropy
Mixture of Scanners (MoS) achieves near-random entropy (0.88+), demonstrating effective randomization while maintaining high throughput
MoS approach provides 88% of the entropy of truly random sampling while maintaining the performance benefits of structured data access

Throughput Performance Analysis

Fragment-based loading achieves the highest throughput (26,769 cells/sec), showing 12.8% improvement over sequential loading
MoS approach maintains competitive throughput (22,972 cells/sec), only 3.2% slower than sequential loading despite providing 88% random entropy
Performance-entropy trade-off: MoS successfully balances high entropy (0.88) with minimal throughput penalty (3.2% vs sequential)
All strategies maintain excellent throughput (>22K cells/sec), demonstrating SLAF's efficient data loading architecture

Entropy Interpretation Guide

Within-Batch: How random are the cells within each batch
Across-Batch: How much batch composition changes between batches
L1 Distance: Mean absolute difference between cell ID pairs
Scores closer to 0 = more sequential, closer to 1 = more random

MoS Implementation Benefits

The Mixture of Scanners approach successfully balances the competing demands of training batch randomness and data loading performance. By using multiple scanners reading from different dataset locations, MoS achieves 88% of the entropy of truly random sampling without creating pre-randomized copies of datasets. The approach maintains 96.8% of sequential loading throughput while providing near-random batch composition, making it ideal for training foundation models that require both high throughput and effective batch randomization.

External Benchmarks¶

Alternate Dataloaders¶

We compared SLAF against five state-of-the-art dataloaders:

AnnLoader - Experimental PyTorch DataLoader for AnnData objects from anndata.experimental
AnnDataLoader - From scvi-tools, designed for training variational autoencoder (VAE)-style models
scDataset - Recently released high-performance dataloader with multiprocessing support
TileDB DataLoader - An internal custom PyTorch DataLoader for TileDB SOMA experiments
BioNeMo SCDL - NVIDIA's single-cell data loading framework for scalable training of foundation models

Methodology¶

To match the benchmarks from the scDataset paper as closely as possible, we used a batch_size=64 across all comparisons. For scDataset itself, we used the optimal parameters in our hardware (block_size=8, fetch_factor=64, which were different from the ones found to be optimal in the paper). However, we couldn't use num_workers=12 out of the box because h5ad datasets aren't pickle-able and PyTorch DataLoaders expect this since they use multiprocessing.

Enhanced Measurement Procedure: All external benchmarks now use the same enhanced measurement procedure as internal benchmarks for fair comparison:

Initial Warmup: 15 batches to initialize each dataloader
Extended Warmup: 10 seconds to allow all systems to reach steady state
Measurement Period: 30 seconds of pure performance measurement (excluding warmup time)
Total Runtime: 40 seconds per benchmark (10s warmup + 30s measurement)

This ensures fair and consistent performance comparisons across all dataloader systems.

Tier 1: Raw Data Loading Comparison¶

Raw data loading performance measures the base throughput of each system without any tokenization overhead.

System	Throughput (cells/sec)
SLAF	24,587
scDataset	9,550
BioNeMo SCDL	3,101
TileDB DataLoader	518
AnnDataLoader	422
AnnLoader	239

SOTA Performance

SLAF achieves 2.6x higher throughput than scDataset, 7.9x higher throughput than BioNeMo SCDL, 47.5x higher throughput than TileDB DataLoader, 58.3x higher throughput than AnnDataLoader, and 102.9x higher throughput than AnnLoader in raw data loading.

scDataset Performance Analysis

Our comprehensive benchmarks reveal that scDataset can achieve excellent performance with proper parameter tuning. We observed 9,550 cells/sec with optimized parameters, which is 4.8x higher than the paper's reported ~2,000 cells/sec, even without using multiprocessing. Note that these are completely different systems though (M1 Max vs NVIDIA DGX CPU).

However, we found significant limitations with multiprocessing due to pickling issues with h5py-backed AnnData objects. See our detailed scDataset benchmarks for complete analysis including parameter scaling and multiprocessing limitations.

Parameter Scaling Validation

Our parameter sweeps confirm scDataset's strong scaling behavior: 23.1x improvement from worst to best configuration. The fetch_factor parameter shows the strongest scaling (20x+ improvement), while block_size shows more moderate effects. This validates the design approach described in their paper, though optimal parameters may vary by hardware.

Multiprocessing Limitations

We were unable to test num_workers > 0 due to pickling errors with h5py objects. We're still working with the scDataset team to figure out implementation differences.

Tier 2: GPU-Ready Output Comparison¶

Raw data loading benchmarks are great, provided that we intend to train on gene expression counts directly. However, for modern foundation models like Geneformer, scGPT, Transcriptformer, or STATE, cell sentences are constructed using tokens that represent gene identity and expression bins. A lot of these workflows require dataframe-friendly operations like sorting, windowing, ranking, and filtering. Our view is that it much better to situate these computations within the (typically) CPU-bound dataloader, rather than expect the GPU in the training loop to do the heavy lifting. Accordingly, SLAF dataloaders take a tokenizer and transform raw data into training-ready token sequences.

This GPU-ready throughput measures end-to-end performance including tokenization (that includes windowing, ranking, vocabulary mapping and padding), which is critical for training workflows involving models that turn cells into sentences.

Even though SLAF's tokenizing dataloaders do more work (tokenization), we find that their throughput remains competitive with scDataset's raw-data dataloader, achieving comparable performance despite the additional processing overhead.

System	Throughput (cells/sec)	Throughput (tokens/sec)
SLAF	7,487	15,332,896

GPU-Ready Cell Sentences

SLAF dataloaders provide the only GPU-ready input among the available alternatives.

Some Discrepancies¶

Tokens/sec is better than Cells/sec

Cells/sec can be a misleading metric for GPU training workloads. A better measure of throughput to minimize GPU idle time is tokens/sec, since pre-made token sequences are ready for matrix multiplication on the GPU. SLAF's tokenized mode demonstrates this principle: while cells/sec decreases due to tokenization overhead, relative to the raw mode, the constant tokens/sec across batch sizes shows that the tokenization pipeline is well-optimized across scales.

Scaling behaviors reveal hidden optimization opportunities

SLAF's constant scaling with batch size suggests that the loading and processing are impedance matched: loading more data per batch does not slow down throughput. Constant dataloader throughput for larger batch sizes implies that the bottleneck to batch size is not dataloading but GPU memory. In contrast, we observed that scDataset's throughput scales linearly with batch size (not shown in these results), suggesting that it is doing more work than needed at small batch sizes, and could achieve better performance with optimizations like async prefetching.

In-memory formats matter

The performance difference between AnnDataLoader (422 cells/sec) and scDataset (9,550 cells/sec) is dramatic. While scDataset is smarter at batching and randomization, since our benchmark tests them on loading from h5ad, it's important to compare apples to apples dataloader outputs. AnnDataLoader and AnnLoader return torch.sparse_csr tensors whereas scDataset returns scipy.sparse.csr_matrix, and these format inter-conversions represent non-zero overhead.

In our work, we noticed different overheads for conversion from polars dataframe (SLAF's preferred format for raw data) to torch and scipy sparse formats, and ultimately decided to keep raw outputs in polars. The performance of AnnLoader and AnnDataLoader relative to scDataset is almost certainly due to the overhead of conversion from scipy sparse arrays to torch arrays and worth benchmarking more carefully to identify low-hanging fruit for optimizations in both AnnLoader and AnnDataLoader.

Conclusion¶

Cloud-Native Architecture¶

While these benchmarks use local SSD, the Lance format is native to cloud storage. Early tests suggest that latency between S3 and EC2 in the same region is not appreciably different from local storage. This opens up a cloud-native, store-once, query-multiple-times zero-copy architecture that eliminates data duplication.

GPU Throughput Requirements¶

What's a good enough cells/sec rate to keep an 8×H100 node at $2/hr busy? Assuming 50 ms per training loop for a model like scGPT:

8 GPUs × 32 cells/batch × 20 batches/sec = 5,120 cells/sec would maximize GPU utilization
Tahoe-100M training: 100M cells ÷ 5,120 cells/sec = ~5.4 hours per epoch ~ $86 / epoch
Anything faster than 5,120 cells/sec opens up multi-node training possibilities, trading off cost and wall clock time.

This raises the question: can we build towards a $100 scGPT model through efficient multi-node training enabled by high-throughput data loading? More on this soon!

High Concurrency and Multi-User Training¶

The Lance format's high concurrency, optimized for production multimodal data lakes with high QPS, enables not only multi-node training but multiple users training multiple models simultaneously without their own copies of the dataset. This contrasts with h5ad, which requires:

Local storage: The dataset must be local to the CPU instance loading it for attached GPUs
Non-concurrent access: One copy of the dataset per user

SLAF, with Lance under the hood, enables a truly scalable architecture for foundation model training on massive single-cell datasets.

These benchmarks demonstrate SLAF's position as the leading solution for high-performance single-cell data loading, enabling efficient training of foundation models on massive datasets with minimal resource requirements and maximum scalability.

SLAF is a young project with a bus factor of 1. You can help improve that by using it and contributing to it. Read about the SLAF vision in this blog post and contribute at github.com/slaf-project/slaf.