Skip to content

Migrating to SLAF

Quick Start

Convert your single-cell data to SLAF format with just one command:

# Convert any supported format (auto-detection)
slaf convert data.h5ad output.slaf
slaf convert filtered_feature_bc_matrix/ output.slaf
slaf convert data.h5 output.slaf
slaf convert data.tiledb output.slaf

That's it! SLAF automatically detects your file format and converts it with optimized settings.

Multi-File Conversion

Convert multiple files to a single SLAF dataset:

# Convert multiple files from a directory
slaf convert data_folder/ output.slaf

# Convert specific files
slaf convert file1.h5ad file2.h5ad file3.h5ad output.slaf

# Auto-detection works for all formats
slaf convert 10x_data_folder/ output.slaf

SLAF automatically:

  • ✅ Validates all files are compatible
  • ✅ Assigns unique cell IDs across all files
  • ✅ Tracks which file each cell came from
  • ✅ Combines metadata intelligently

Harmonized vs. non-harmonized .h5ad files

When converting multiple .h5ad files, SLAF expects harmonized data: all files should share the same schema (same .obs, .obsm, and .var columns). If your files are harmonized, you can use slaf convert directly on a directory or list of files.

If your files have different schemas (e.g. some have extra or missing .obs/.obsm/.var columns), conversion may fail or behave unexpectedly. In that case, merge your files into a single .h5ad first using AnnData’s concatenation, then convert the result with SLAF:

  • In memory: use anndata.concat to merge AnnData objects. You can control how missing columns are handled (e.g. join="outer" to fill with missing values).
  • On disk (recommended for large data): use anndata.concat_on_disk to merge .h5ad files without loading everything into memory.

After you have a single merged .h5ad, run:

slaf convert merged.h5ad output.slaf

Summary: Harmonized files (exact same schema) → use slaf convert on multiple files. Non-harmonized files → merge with anndata.concat or anndata.concat_on_disk, then slaf convert the merged file.

Appending to Existing Datasets

Add new data to an existing SLAF dataset:

# Append a single file
slaf append new_data.h5ad existing.slaf

# Append multiple files from a directory
slaf append new_data_folder/ existing.slaf

# Skip validation if already validated (faster)
slaf append new_data.h5ad existing.slaf --skip-validation

Perfect for:

  • Incremental data collection - Add new batches as they arrive
  • Data updates - Append new samples to existing datasets
  • Combining datasets - Merge related datasets over time

Supported Formats

SLAF supports conversion from these common single-cell formats:

  • AnnData (.h5ad files) - The standard single-cell format
  • 10x MTX (filtered_feature_bc_matrix directories) - Cell Ranger output
  • 10x H5 (.h5 files) - Cell Ranger H5 output format
  • TileDB SOMA (.tiledb directories) - High-performance single-cell format

Python API

from slaf.data import SLAFConverter

# Basic conversion (auto-detection)
converter = SLAFConverter()
converter.convert("data.h5ad", "output.slaf")
converter.convert("filtered_feature_bc_matrix/", "output.slaf")
converter.convert("data.h5", "output.slaf")
converter.convert("data.tiledb", "output.slaf")

# Multi-file conversion
converter.convert("data_folder/", "output.slaf")  # Directory of files
converter.convert(["file1.h5ad", "file2.h5ad"], "output.slaf")  # List of files

# Append to existing dataset
converter.append("new_data.h5ad", "existing.slaf")
converter.append("new_data_folder/", "existing.slaf")

# Convert existing AnnData object
import scanpy as sc
adata = sc.read_h5ad("data.h5ad")
converter.convert_anndata(adata, "output.slaf")

Large Datasets

For large datasets (>100k cells), you can optimize performance:

# Use larger chunks for speed (if you have enough RAM)
slaf convert large_data.h5ad output.slaf --chunk-size 100000

# Create indices for faster queries
slaf convert large_data.h5ad output.slaf --create-indices

# Control fragment size (important for HuggingFace uploads)
# Default: 100M rows per fragment (stays under 10k file limit)
slaf convert large_data.h5ad output.slaf --max-rows-per-file 100000000
# Python API for large datasets
converter = SLAFConverter(
    chunk_size=100000,
    create_indices=True,
    max_rows_per_file=100_000_000,  # Default: 100M (stays under HuggingFace's 10k limit)
)
converter.convert("large_data.h5ad", "output.slaf")

Validation and Quality Control

The slaf validate-input-files command helps you catch compatibility issues before conversion:

Basic Validation

# Validate a single file
slaf validate-input-files data.h5ad

# Validate multiple files from a directory
slaf validate-input-files data_folder/

# Validate specific files
slaf validate-input-files file1.h5ad file2.h5ad file3.h5ad

Verbose Output

# Get detailed information about files being validated
slaf validate-input-files data_folder/ --verbose

# Output shows:
# 📁 Found 3 h5ad files
#   1. batch_001.h5ad
#   2. batch_002.h5ad
#   3. batch_003.h5ad
# ✅ All files are compatible for conversion

Format-Specific Validation

# Validate 10x MTX directories
slaf validate-input-files filtered_feature_bc_matrix/ --format 10x_mtx

# Validate 10x H5 files
slaf validate-input-files data.h5 --format 10x_h5

# Validate TileDB SOMA files
slaf validate-input-files experiment.tiledb --format tiledb

What Validation Checks

The validation command performs comprehensive compatibility checks:

  • File Integrity: Files exist, are readable, and not empty
  • Format Consistency: All files use the same format (h5ad, 10x_mtx, etc.)
  • Gene Compatibility: All files have identical gene sets
  • Metadata Schema: Cell metadata columns are compatible across files (same .obs/.obsm/.var schema). If your files have different schemas, merge them first with anndata.concat_on_disk() or anndata.concat(), then convert the single .h5ad—see Harmonized vs. non-harmonized .h5ad files.
  • Value Types: Expression data types are consistent (uint16, float32, etc.)
  • File Sizes: Ensures files contain actual data (not empty)

Common Validation Scenarios

# Validate before multi-file conversion
slaf validate-input-files batch1/ batch2/ batch3/
slaf convert batch1/ output.slaf  # Safe to proceed

# Validate before appending
slaf validate-input-files new_batch/
slaf append existing.slaf new_batch/  # Safe to proceed

# Check specific format compatibility
slaf validate-input-files 10x_data/ --format 10x_mtx

Error Examples

When validation fails, you get clear error messages:

# Gene mismatch error
 Validation failed: File batch_002.h5ad is incompatible:
  Missing genes: GENE_001, GENE_002, GENE_003
  Extra genes: GENE_999, GENE_1000

# Schema mismatch error
 Validation failed: File batch_003.h5ad has incompatible cell metadata schema:
  Missing columns: ['cell_type', 'batch']
  Extra columns: ['cluster_id']
# Tip: Merge non-harmonized files first with anndata.concat_on_disk() or anndata.concat(), then convert the single .h5ad.

# Format mismatch error
 Validation failed: Multiple formats detected in directory
  Found: h5ad, 10x_mtx
  All files must use the same format

Integration with Conversion

Validation runs automatically during conversion, but you can skip it for performance:

# Automatic validation (default)
slaf convert data_folder/ output.slaf

# Skip validation (faster, but less safe)
slaf convert data_folder/ output.slaf --skip-validation
slaf append new_data.h5ad existing.slaf --skip-validation

Advanced Options

Most users won't need these, but they're available if needed:

CLI Options

# Specify format explicitly (if auto-detection fails)
slaf convert data.h5 output.slaf --format 10x_h5
slaf convert data.tiledb output.slaf --format tiledb

# Use non-chunked processing (not recommended for large datasets)
slaf convert small_data.h5ad output.slaf --no-chunked

# Disable storage optimization (larger files but includes string IDs)
slaf convert data.h5ad output.slaf --no-optimize-storage

# Verbose output
slaf convert data.h5ad output.slaf --verbose

# Skip validation (if already validated)
slaf convert data_folder/ output.slaf --skip-validation
slaf append new_data.h5ad existing.slaf --skip-validation

# Control fragment size (for HuggingFace uploads or large datasets)
# Default: 100M rows per fragment (stays under HuggingFace's 10k file limit)
slaf convert large_data.h5ad output.slaf --max-rows-per-file 200000000

# TileDB-specific options
slaf convert data.tiledb output.slaf --tiledb-collection RNA

Python API Options

# Custom settings
converter = SLAFConverter(
    chunk_size=50000,           # Cells per chunk
    create_indices=True,        # Faster queries
    optimize_storage=True,      # Smaller files (default)
    use_optimized_dtypes=True,  # Better compression (default)
    tiledb_collection_name="RNA",  # TileDB collection name (default: "RNA")
    max_rows_per_file=100_000_000,  # Max rows per fragment (default: 100M)
                                    # Increase for fewer fragments (e.g., for HuggingFace)
)

converter.convert("data.h5ad", "output.slaf")
converter.convert("data.tiledb", "output.slaf")

TileDB SOMA Conversion

SLAF provides excellent support for TileDB SOMA format, which is increasingly popular for large-scale single-cell datasets:

Basic TileDB Conversion

# Auto-detect TileDB format
slaf convert experiment.tiledb output.slaf

# Specify collection name (default: "RNA")
slaf convert experiment.tiledb output.slaf --tiledb-collection RNA

Python API for TileDB

from slaf.data import SLAFConverter

# Basic TileDB conversion
converter = SLAFConverter()
converter.convert("experiment.tiledb", "output.slaf")

# With custom collection name
converter = SLAFConverter(tiledb_collection_name="RNA")
converter.convert("experiment.tiledb", "output.slaf")

TileDB Benefits

  • Memory Efficient: TileDB's chunked storage works seamlessly with SLAF's chunked processing
  • Large Datasets: Optimized for datasets with millions of cells
  • Data Preservation: Maintains exact floating-point precision from TileDB
  • Fast Conversion: Leverages TileDB's efficient data access patterns

What SLAF Does

SLAF converts your data to an optimized format that:

  • Enables fast SQL queries on your data
  • Works with any size dataset (memory-efficient processing)
  • Preserves all metadata including:
  • Cell and gene annotations (obs and var columns)
  • Alternative expression matrices (layers like spliced, unspliced, counts)
  • Multi-dimensional arrays (obsm like UMAP coordinates, PCA embeddings)
  • Gene-level embeddings (varm like PCA loadings)
  • Unstructured metadata (uns like analysis parameters)

Workflow Examples

Multi-File Workflow

# 1. Validate files first (recommended)
slaf validate-input-files batch1/ batch2/ batch3/

# 2. Convert all batches to single SLAF
slaf convert batch1/ initial.slaf

# 3. Append additional batches
slaf append batch2/ initial.slaf
slaf append batch3/ initial.slaf

# 4. Explore the combined dataset
slaf info initial.slaf

Incremental Data Collection

# Start with first batch
slaf convert batch_001/ dataset.slaf

# Add new batches as they arrive
slaf append batch_002/ dataset.slaf
slaf append batch_003/ dataset.slaf
slaf append batch_004/ dataset.slaf

# Each append maintains data integrity and source tracking

Quality Control Workflow

# 1. Validate all files before conversion
slaf validate-input-files all_batches/

# 2. Convert with validation (automatic)
slaf convert all_batches/ combined.slaf

# 3. Check source file tracking
slaf query combined.slaf "SELECT source_file, COUNT(*) FROM cells GROUP BY source_file"

Next Steps

After converting your data:

  1. Explore: slaf info output.slaf
  2. Query: slaf query output.slaf "SELECT * FROM expression LIMIT 10"
  3. Use in Python: import slaf; data = slaf.SLAFArray("output.slaf")
  4. Check Source Files: slaf query output.slaf "SELECT DISTINCT source_file FROM cells"

See the Getting Started guide for more examples.