Skip to content

ML Benchmarks: SLAF vs State-of-the-Art Dataloaders

SLAF provides state-of-the-art (SOTA) performance in data loading throughput for machine learning workflows, reaching 2x speedups relative to current standards, particularly for training transformer-based single-cell foundation models. What follows are comprehensive benchmarks comparing SLAF against state-of-the-art dataloaders including scDataset, AnnDataLoader, and AnnLoader.

Motivation

The goal of these benchmarks is to demonstrate that SLAF can stream tokens to modern GPUs at a rate sufficient to prevent idle time between training loops. For a 1B parameter model like scGPT, fast enough means delivering training batches within 50 ms to keep the GPU utilization high. This benchmark establishes SLAF's ability to meet the throughput requirements for efficient foundation model training on massive single-cell datasets.

Dataset and Hardware

Dataset: Tahoe-100M

We downloaded one of the 7 h5ad files comprising the Tahoe-100M dataset made accessible by ARC Institute. This slice of the dataset contains 5,481,420 cells and 62,710 genes, with approximately 8B non-zero expression values. All benchmarks reported below used this dataset unless indicated otherwise.

Conversion and Optimization

We used the SLAF converter (see Migrating to SLAF) to convert the h5ad file to SLAF format. The Lance table fragments (Lance's term for partitions) were optimized for compression/query tradeoffs, with 50M non-zeros (rows) per fragment in the expression table. While inherently parallelizable, conversion is currently single process, and took about 10 minutes for this dataset.

Hardware Configuration

  • Machine: Apple MacBook Pro with M1 Max
  • Memory: 32 GB RAM
  • Storage: 1 TB NVMe SSD
  • OS: macOS 13.6.1

Note

These benchmarks represent performance on a high-end laptop. Production deployments on dedicated servers with faster storage may show different performance characteristics. Likewise, performance from object storage to non-colocated compute might be worse.

Internal Benchmarks

Methodology

We used a batch size of 32, 3 warmup batches, and a measurement duration of 10 seconds for all internal benchmarks.

Tokenization Strategy Comparison

We benchmarked different tokenization strategies to understand the performance impact of various preprocessing options:

Tokenization Strategy Throughput (cells/sec) Throughput (tokens/sec)
scGPT with binning 6,633 13,598,202
scGPT without binning 6,750 13,839,536
Geneformer with percentile filtering 9,246 18,937,785
Geneformer without percentile filtering 9,496 19,448,877
Raw mode (no tokenization) 25,184 N/A

Strategy Insights

  • Geneformer strategies show ~40% higher throughput than scGPT strategies
  • Binning and filtering have minimal performance impact (~2% difference)
  • Raw mode provides 2.8x higher throughput than tokenized modes, demonstrating the tokenization overhead

Raw Mode Performance Scaling

Raw mode bypasses tokenization and returns Polars DataFrames that have the exact schema as sparse CSR tensors, demonstrating SLAF's base data loading performance.

Batch Size Throughput (cells/sec) Total Cells Measurement Time (s)
32 23,988 240,374 10.0
64 23,650 236,505 10.0
128 27,691 277,052 10.0
256 28,125 281,315 10.0

Optimization Validation

Raw mode throughput shows 1.2x improvement from batch size 32 to 256, demonstrating that SLAF's data loading pipeline scales efficiently with larger batch sizes while maintaining high performance.

Fragment vs Batch Loading Comparison

SLAF supports two loading strategies: fragment-based and batch-based loading. Fragment-based loading processes entire Lance fragments at once, while batch-based loading processes multiple Lance batches sequentially.

Strategy Throughput (cells/sec) Total Cells Total Batches
Fragment-Based Loading 22,472 229,669 7,180
Batch-Based Loading 24,354 243,554 8,038

Fragment Strategy Performance

Batch-based loading shows modestly higher throughput than fragment-based loading in this benchmark, but test-retest repeatability shows high variance. The performance difference should not be overinterpreted as it may vary significantly across different runs and hardware configurations.

Strategy Selection

Batch-based loading is the default strategy in SLAF as it has lower memory overhead. Fragment-based loading is available as an alternative with just a single additional argument (by_fragment=True) to the SLAFDataLoader for users who prefer processing larger data chunks.

Tokenized Mode: Tokens/sec Scaling

Tokenized mode provides pre-tokenized sequences ready for GPU training, demonstrating SLAF's end-to-end pipeline performance.

Batch Size Throughput (cells/sec) Throughput (tokens/sec) Total Cells Measurement Time (s)
32 9,424 19,299,581 95,157 10.1
64 9,526 19,508,342 95,436 10.0
128 9,598 19,657,469 96,038 10.0
256 9,655 19,773,029 96,769 10.0

Tokenization Efficiency

Token throughput remains remarkably constant across batch sizes (1.0x scaling), demonstrating that SLAF's tokenization pipeline is well-optimized and not the bottleneck. This validates that tokens/sec is the meaningful metric for GPU training workloads.

External Benchmarks

Alternate Dataloaders

We compared SLAF against three state-of-the-art dataloaders:

  1. AnnLoader - Experimental PyTorch DataLoader for AnnData objects from anndata.experimental
  2. AnnDataLoader - From scvi-tools, designed for training variational autoencoder (VAE)-style models
  3. scDataset - Recently released high-performance dataloader with multiprocessing support

Help

At the time of writing, we couldn't find a submodule called scdl from NVIDIA BioNeMo's PyPI package that implements the scdl dataloader; it seems to have been deprecated.

Methodology

To match the benchmarks from the scDataset paper as closely as possible, we used a batch_size=64 across all comparisons. For scDataset itself, we used the optimal parameters in our hardware (block_size=8, fetch_factor=64, which were different from the ones found to be optimal in the paper). However, we couldn't use num_workers=12 out of the box because h5ad datasets aren't pickle-able and PyTorch DataLoaders expect this since they use multiprocessing.

Tier 1: Raw Data Loading Comparison

Raw data loading performance measures the base throughput of each system without any tokenization overhead.

System Throughput (cells/sec)
SLAF 26,816
scDataset 10,849
AnnDataLoader 408
AnnLoader 224

SOTA Performance

SLAF achieves 2.5x higher throughput than scDataset and 66x higher throughput than AnnDataLoader in raw data loading.

scDataset Performance Analysis

Our comprehensive benchmarks reveal that scDataset can achieve excellent performance with proper parameter tuning. We observed 10,849 cells/sec with optimized parameters, which is 5.4x higher than the paper's reported ~2,000 cells/sec, even without using multiprocessing. Note that these are completely different systems though (M1 Max vs NVIDIA DGX CPU).

However, we found significant limitations with multiprocessing due to pickling issues with h5py-backed AnnData objects. See our detailed scDataset benchmarks for complete analysis including parameter scaling and multiprocessing limitations.

Parameter Scaling Validation

Our parameter sweeps confirm scDataset's strong scaling behavior: 23.1x improvement from worst to best configuration. The fetch_factor parameter shows the strongest scaling (20x+ improvement), while block_size shows more moderate effects. This validates the design approach described in their paper, though optimal parameters may vary by hardware.

Multiprocessing Limitations

We were unable to test num_workers > 0 due to pickling errors with h5py objects. We're still working with the scDataset team to figure out implementation differences.

Tier 2: GPU-Ready Output Comparison

Raw data loading benchmarks are great, provided that we intend to train on gene expression counts directly. However, for modern foundation models like Geneformer, scGPT, Transcriptformer, or STATE, cell sentences are constructed using tokens that represent gene identity and expression bins. A lot of these workflows require dataframe-friendly operations like sorting, windowing, ranking, and filtering. Our view is that it much better to situate these computations within the (typically) CPU-bound dataloader, rather than expect the GPU in the training loop to do the heavy lifting. Accordingly, SLAF dataloaders take a tokenizer and transform raw data into training-ready token sequences.

This GPU-ready throughput measures end-to-end performance including tokenization (that includes windowing, ranking, vocabulary mapping and padding), which is critical for training workflows involving models that turn cells into sentences.

Even though SLAF's tokenizing dataloaders do more work, we find that their throughput exceeds scDataset's raw-data dataloader by 1.8x.

System Throughput (cells/sec) Throughput (tokens/sec)
SLAF 9,607 19,675,243

GPU-Ready Cell Sentences

SLAF dataloaders provide the only GPU-ready input among the available alternatives.

Some Discrepancies

Tokens/sec is better than Cells/sec

Cells/sec can be a misleading metric for GPU training workloads. A better measure of throughput to minimize GPU idle time is tokens/sec, since pre-made token sequences are ready for matrix multiplication on the GPU. SLAF's tokenized mode demonstrates this principle: while cells/sec decreases due to tokenization overhead, relative to the raw mode, the constant tokens/sec across batch sizes shows that the tokenization pipeline is well-optimized across scales.

Scaling behaviors reveal hidden optimization opportunities

SLAF's constant scaling with batch size suggests that the loading and processing are impedance matched: loading more data per batch does not slow down throughput. Constant dataloader throughput for larger batch sizes implies that the bottleneck to batch size is not dataloading but GPU memory. In contrast, we observed that scDataset's throughput scales linearly with batch size (not shown in these results), suggesting that it is doing more work than needed at small batch sizes, and could achieve better performance with optimizations like async prefetching.

In-memory formats matter

The performance difference between AnnDataLoader (392 cells/sec) and scDataset (10,785 cells/sec) is dramatic. While scDataset is smarter at batching and randomization, since our benchmark tests them on loading from h5ad, it's important to compare apples to apples dataloader outputs. AnnDataLoader and AnnLoader return torch.sparse_csr tensors whereas scDataset returns scipy.sparse.csr_matrix.

In our work, we noticed different overheads for conversion from polars dataframe (SLAF's preferred format for raw data) to torch and scipy sparse formats, and ultimately decided to keep raw outputs in polars. The performance of AnnLoader and AnnDataLoader relative to scDataset is almost certainly due to the overhead of conversion from scipy sparse arrays to torch arrays and worth benchmarking more carefully to identify low-hanging fruit for optimizations in both AnnLoader and AnnDataLoader.

Conclusion

Cloud-Native Architecture

While these benchmarks use local SSD, the Lance format is native to cloud storage. Early tests suggest that latency between S3 and EC2 in the same region is not appreciably different from local storage. This opens up a cloud-native, store-once, query-multiple-times zero-copy architecture that eliminates data duplication.

GPU Throughput Requirements

What's a good enough cells/sec rate to keep an 8×H100 node at $2/hr busy? Assuming 50 ms per training loop for a model like scGPT:

  • 8 GPUs × 32 cells/batch × 20 batches/sec = 5,120 cells/sec would maximize GPU utilization
  • Tahoe-100M training: 100M cells ÷ 5,120 cells/sec = ~5.4 hours per epoch ~ $86 / epoch
  • Anything faster than 5,120 cells/sec opens up multi-node training possibilities, trading off cost and wall clock time.

This raises the question: can we build towards a $100 scGPT model through efficient multi-node training enabled by high-throughput data loading? More on this soon!

High Concurrency and Multi-User Training

The Lance format's high concurrency, optimized for production multimodal data lakes with high QPS, enables not only multi-node training but multiple users training multiple models simultaneously without their own copies of the dataset. This contrasts with h5ad, which requires:

  1. Local storage: The dataset must be local to the CPU instance loading it for attached GPUs
  2. Non-concurrent access: One copy of the dataset per user

SLAF, with Lance under the hood, enables a truly scalable architecture for foundation model training on massive single-cell datasets.


These benchmarks demonstrate SLAF's position as the leading solution for high-performance single-cell data loading, enabling efficient training of foundation models on massive datasets with minimal resource requirements and maximum scalability.

SLAF is a young project with a bus factor of 1. You can help improve that by using it and contributing to it. Read about the SLAF vision in this blog post and contribute at github.com/slaf-project/slaf.