Migrating to SLAF¶
Quick Start¶
Convert your single-cell data to SLAF format with just one command:
# Convert any supported format (auto-detection)
slaf convert data.h5ad output.slaf
slaf convert filtered_feature_bc_matrix/ output.slaf
slaf convert data.h5 output.slaf
slaf convert data.tiledb output.slaf
That's it! SLAF automatically detects your file format and converts it with optimized settings.
Multi-File Conversion¶
Convert multiple files to a single SLAF dataset:
# Convert multiple files from a directory
slaf convert data_folder/ output.slaf
# Convert specific files
slaf convert file1.h5ad file2.h5ad file3.h5ad output.slaf
# Auto-detection works for all formats
slaf convert 10x_data_folder/ output.slaf
SLAF automatically:
- ✅ Validates all files are compatible
- ✅ Assigns unique cell IDs across all files
- ✅ Tracks which file each cell came from
- ✅ Combines metadata intelligently
Appending to Existing Datasets¶
Add new data to an existing SLAF dataset:
# Append a single file
slaf append new_data.h5ad existing.slaf
# Append multiple files from a directory
slaf append new_data_folder/ existing.slaf
# Skip validation if already validated (faster)
slaf append new_data.h5ad existing.slaf --skip-validation
Perfect for:
- Incremental data collection - Add new batches as they arrive
- Data updates - Append new samples to existing datasets
- Combining datasets - Merge related datasets over time
Supported Formats¶
SLAF supports conversion from these common single-cell formats:
- AnnData (.h5ad files) - The standard single-cell format
- 10x MTX (filtered_feature_bc_matrix directories) - Cell Ranger output
- 10x H5 (.h5 files) - Cell Ranger H5 output format
- TileDB SOMA (.tiledb directories) - High-performance single-cell format
Python API¶
from slaf.data import SLAFConverter
# Basic conversion (auto-detection)
converter = SLAFConverter()
converter.convert("data.h5ad", "output.slaf")
converter.convert("filtered_feature_bc_matrix/", "output.slaf")
converter.convert("data.h5", "output.slaf")
converter.convert("data.tiledb", "output.slaf")
# Multi-file conversion
converter.convert("data_folder/", "output.slaf") # Directory of files
converter.convert(["file1.h5ad", "file2.h5ad"], "output.slaf") # List of files
# Append to existing dataset
converter.append("new_data.h5ad", "existing.slaf")
converter.append("new_data_folder/", "existing.slaf")
# Convert existing AnnData object
import scanpy as sc
adata = sc.read_h5ad("data.h5ad")
converter.convert_anndata(adata, "output.slaf")
Large Datasets¶
For large datasets (>100k cells), you can optimize performance:
# Use larger chunks for speed (if you have enough RAM)
slaf convert large_data.h5ad output.slaf --chunk-size 100000
# Create indices for faster queries
slaf convert large_data.h5ad output.slaf --create-indices
# Python API for large datasets
converter = SLAFConverter(chunk_size=100000, create_indices=True)
converter.convert("large_data.h5ad", "output.slaf")
Validation and Quality Control¶
The slaf validate-input-files command helps you catch compatibility issues before conversion:
Basic Validation¶
# Validate a single file
slaf validate-input-files data.h5ad
# Validate multiple files from a directory
slaf validate-input-files data_folder/
# Validate specific files
slaf validate-input-files file1.h5ad file2.h5ad file3.h5ad
Verbose Output¶
# Get detailed information about files being validated
slaf validate-input-files data_folder/ --verbose
# Output shows:
# 📁 Found 3 h5ad files
# 1. batch_001.h5ad
# 2. batch_002.h5ad
# 3. batch_003.h5ad
# ✅ All files are compatible for conversion
Format-Specific Validation¶
# Validate 10x MTX directories
slaf validate-input-files filtered_feature_bc_matrix/ --format 10x_mtx
# Validate 10x H5 files
slaf validate-input-files data.h5 --format 10x_h5
# Validate TileDB SOMA files
slaf validate-input-files experiment.tiledb --format tiledb
What Validation Checks¶
The validation command performs comprehensive compatibility checks:
- ✅ File Integrity: Files exist, are readable, and not empty
- ✅ Format Consistency: All files use the same format (h5ad, 10x_mtx, etc.)
- ✅ Gene Compatibility: All files have identical gene sets
- ✅ Metadata Schema: Cell metadata columns are compatible across files
- ✅ Value Types: Expression data types are consistent (uint16, float32, etc.)
- ✅ File Sizes: Ensures files contain actual data (not empty)
Common Validation Scenarios¶
# Validate before multi-file conversion
slaf validate-input-files batch1/ batch2/ batch3/
slaf convert batch1/ output.slaf # Safe to proceed
# Validate before appending
slaf validate-input-files new_batch/
slaf append existing.slaf new_batch/ # Safe to proceed
# Check specific format compatibility
slaf validate-input-files 10x_data/ --format 10x_mtx
Error Examples¶
When validation fails, you get clear error messages:
# Gene mismatch error
❌ Validation failed: File batch_002.h5ad is incompatible:
Missing genes: GENE_001, GENE_002, GENE_003
Extra genes: GENE_999, GENE_1000
# Schema mismatch error
❌ Validation failed: File batch_003.h5ad has incompatible cell metadata schema:
Missing columns: ['cell_type', 'batch']
Extra columns: ['cluster_id']
# Format mismatch error
❌ Validation failed: Multiple formats detected in directory
Found: h5ad, 10x_mtx
All files must use the same format
Integration with Conversion¶
Validation runs automatically during conversion, but you can skip it for performance:
# Automatic validation (default)
slaf convert data_folder/ output.slaf
# Skip validation (faster, but less safe)
slaf convert data_folder/ output.slaf --skip-validation
slaf append new_data.h5ad existing.slaf --skip-validation
Advanced Options¶
Most users won't need these, but they're available if needed:
CLI Options¶
# Specify format explicitly (if auto-detection fails)
slaf convert data.h5 output.slaf --format 10x_h5
slaf convert data.tiledb output.slaf --format tiledb
# Use non-chunked processing (not recommended for large datasets)
slaf convert small_data.h5ad output.slaf --no-chunked
# Disable storage optimization (larger files but includes string IDs)
slaf convert data.h5ad output.slaf --no-optimize-storage
# Verbose output
slaf convert data.h5ad output.slaf --verbose
# Skip validation (if already validated)
slaf convert data_folder/ output.slaf --skip-validation
slaf append new_data.h5ad existing.slaf --skip-validation
# TileDB-specific options
slaf convert data.tiledb output.slaf --tiledb-collection RNA
Python API Options¶
# Custom settings
converter = SLAFConverter(
chunk_size=50000, # Cells per chunk
create_indices=True, # Faster queries
optimize_storage=True, # Smaller files (default)
use_optimized_dtypes=True, # Better compression (default)
tiledb_collection_name="RNA", # TileDB collection name (default: "RNA")
)
converter.convert("data.h5ad", "output.slaf")
converter.convert("data.tiledb", "output.slaf")
TileDB SOMA Conversion¶
SLAF provides excellent support for TileDB SOMA format, which is increasingly popular for large-scale single-cell datasets:
Basic TileDB Conversion¶
# Auto-detect TileDB format
slaf convert experiment.tiledb output.slaf
# Specify collection name (default: "RNA")
slaf convert experiment.tiledb output.slaf --tiledb-collection RNA
Python API for TileDB¶
from slaf.data import SLAFConverter
# Basic TileDB conversion
converter = SLAFConverter()
converter.convert("experiment.tiledb", "output.slaf")
# With custom collection name
converter = SLAFConverter(tiledb_collection_name="RNA")
converter.convert("experiment.tiledb", "output.slaf")
TileDB Benefits¶
- Memory Efficient: TileDB's chunked storage works seamlessly with SLAF's chunked processing
- Large Datasets: Optimized for datasets with millions of cells
- Data Preservation: Maintains exact floating-point precision from TileDB
- Fast Conversion: Leverages TileDB's efficient data access patterns
What SLAF Does¶
SLAF converts your data to an optimized format that:
- Enables fast SQL queries on your data
- Works with any size dataset (memory-efficient processing)
- Preserves all metadata including:
- Cell and gene annotations (
obsandvarcolumns) - Alternative expression matrices (
layerslikespliced,unspliced,counts) - Multi-dimensional arrays (
obsmlike UMAP coordinates, PCA embeddings) - Gene-level embeddings (
varmlike PCA loadings) - Unstructured metadata (
unslike analysis parameters)
Workflow Examples¶
Multi-File Workflow¶
# 1. Validate files first (recommended)
slaf validate-input-files batch1/ batch2/ batch3/
# 2. Convert all batches to single SLAF
slaf convert batch1/ initial.slaf
# 3. Append additional batches
slaf append batch2/ initial.slaf
slaf append batch3/ initial.slaf
# 4. Explore the combined dataset
slaf info initial.slaf
Incremental Data Collection¶
# Start with first batch
slaf convert batch_001/ dataset.slaf
# Add new batches as they arrive
slaf append batch_002/ dataset.slaf
slaf append batch_003/ dataset.slaf
slaf append batch_004/ dataset.slaf
# Each append maintains data integrity and source tracking
Quality Control Workflow¶
# 1. Validate all files before conversion
slaf validate-input-files all_batches/
# 2. Convert with validation (automatic)
slaf convert all_batches/ combined.slaf
# 3. Check source file tracking
slaf query combined.slaf "SELECT source_file, COUNT(*) FROM cells GROUP BY source_file"
Next Steps¶
After converting your data:
- Explore:
slaf info output.slaf - Query:
slaf query output.slaf "SELECT * FROM expression LIMIT 10" - Use in Python:
import slaf; data = slaf.SLAFArray("output.slaf") - Check Source Files:
slaf query output.slaf "SELECT DISTINCT source_file FROM cells"
See the Getting Started guide for more examples.