scDataset Benchmarks¶

This document reports comprehensive benchmark results for scDataset, a high-performance dataloader for single-cell data. These benchmarks were conducted using the Tahoe-100M dataset to evaluate scDataset's performance characteristics and scaling behavior.

Dataset and Hardware¶

Dataset: Tahoe-100M¶

Cells: 5,481,420 cells
Genes: 62,710 genes
Format: h5ad (backed mode)
Batch Size: 64 cells per batch

Hardware Configuration¶

Machine: Apple MacBook Pro with M1 Max
Memory: 32 GB RAM
Storage: 1 TB NVMe SSD
OS: macOS 13.6.1

Parameter Scaling Benchmarks¶

Methodology¶

We tested scDataset performance across different combinations of block_size and fetch_factor parameters, which are key to scDataset's performance according to their paper. All tests used:

Single worker (num_workers=0) to isolate parameter effects
10-second measurement duration
Module-level callback to avoid pickling issues
Backed AnnData to match real-world usage

Results Summary¶

Parameter Range	Best Performance	Best Configuration	Scaling Factor
`block_size`: [1, 2, 4, 8, 16, 32, 64]	10,976 cells/sec	`block_size=8, fetch_factor=64`	23.1x
`fetch_factor`: [1, 2, 4, 8, 16, 32, 64]

Key Findings¶

1. Strong Scaling with fetch_factor¶

Minimal fetch_factor (1): ~500 cells/sec
Optimal fetch_factor (64): ~10,000+ cells/sec
Scaling factor: 20x+ improvement with higher fetch_factor

2. Moderate Scaling with block_size¶

Small block_size (1-4): Good performance with high fetch_factor
Medium block_size (8-16): Optimal performance
Large block_size (32-64): Slightly reduced performance

Detailed Results Table¶

Block Size	Fetch Factor	Throughput (cells/sec)
1	1	474
1	64	10,093
4	16	5,103
8	64	10,976
16	64	10,484
32	64	9,382
64	64	10,159

Parameter Optimization

The optimal configuration for scDataset on our hardware is block_size=8, fetch_factor=64, achieving 10,976 cells/sec.

Multiprocessing Benchmarks¶

Methodology¶

We tested scDataset's multiprocessing capabilities using the optimal parameters (block_size=4, fetch_factor=16) and varying num_workers values.

Results¶

Num Workers	Throughput (cells/sec)	Status
0	5,363	✅ Success
1	N/A	❌ Pickling Error
2	N/A	❌ Pickling Error
4	N/A	❌ Pickling Error
8	N/A	❌ Pickling Error

Key Findings¶

1. Multiprocessing Limitations¶

Single worker only: scDataset works reliably with num_workers=0
Pickling errors: All multiprocessing attempts failed with "h5py objects cannot be pickled"
Callback issues: Module-level callbacks didn't resolve the pickling problem

2. Performance Comparison¶

Single worker: 5,363 cells/sec with optimal parameters
No multiprocessing scaling: Unable to test due to pickling limitations

Multiprocessing Limitation

scDataset's multiprocessing capabilities are limited by pickling issues with h5py-backed AnnData objects. This prevents the scaling benefits reported in their paper.

Comparison with Paper Results¶

Reported vs Observed Performance¶

Metric	Paper Claim	Our Results	Difference
Best throughput	~2,000 cells/sec	10,976 cells/sec	5.5x higher
Parameter scaling	Significant	23.1x scaling	Matches
Multiprocessing	12 workers	0 workers only	Limited

Possible Explanations¶

Hardware differences: Our M1 Max may be faster than the paper's hardware
Dataset differences: Different datasets may have different characteristics
Implementation differences: Different scDataset versions or configurations
Measurement methodology: Different benchmark setups

Performance Validation

Our results show that scDataset can achieve excellent performance with proper parameter tuning, even exceeding the paper's reported numbers on modern hardware.

Technical Challenges¶

1. Pickling Issues¶

Problem: h5py objects cannot be pickled for multiprocessing
Impact: Prevents multiprocessing scaling
Workaround: Use single worker with optimized parameters

2. Parameter Sensitivity¶

Problem: Performance varies dramatically with parameters
Impact: Requires careful tuning for optimal performance
Solution: Systematic parameter sweeps

Recommendations¶

For scDataset Users¶

Parameter Tuning: Always test different block_size and fetch_factor combinations
Single Worker: Use num_workers=0 to avoid pickling issues
Hardware Testing: Test on your specific hardware for optimal parameters

For Developers¶

Pickling Fix: Address h5py pickling issues for multiprocessing support
Parameter Documentation: Provide clearer guidance on parameter selection
Benchmark Suite: Include comprehensive benchmark tools

Conclusion¶

scDataset demonstrates excellent performance potential with proper parameter tuning, achieving 10,976 cells/sec in our benchmarks. However, multiprocessing limitations prevent the scaling benefits reported in their paper. The strong parameter scaling validates their design approach, but the pickling issues need to be addressed for broader adoption.

The benchmark results show that scDataset can be a viable high-performance dataloader for single-cell data, but requires careful configuration and has limitations for multiprocessing scenarios.

These benchmarks were conducted using scDataset with backed AnnData objects on the Tahoe-100M dataset. Results may vary with different datasets, hardware configurations, or scDataset versions.