Skip to content

Core API

Core SLAF functionality including SLAFArray and query optimization.

SLAFArray

slaf.core.slaf.SLAFArray

High-performance single-cell data storage and querying format.

SLAFArray provides SQL-native access to single-cell data with lazy evaluation. Data is stored in a relational format with three main tables: cells, genes, and expression. The class enables direct SQL queries, efficient filtering, and seamless integration with the single-cell analysis ecosystem.

Key Features
  • SQL-native querying with Polars integration
  • Lazy evaluation for memory efficiency
  • Direct access to cell and gene metadata
  • High-performance storage with Lance format
  • Scanpy/AnnData compatibility

Examples:

>>> # Load a SLAF dataset
>>> slaf_array = SLAFArray("path/to/data.slaf")
>>> print(f"Dataset shape: {slaf_array.shape}")
Dataset shape: (1000, 20000)
>>> # Access metadata
>>> print(f"Cell metadata columns: {list(slaf_array.obs.columns)}")
Cell metadata columns: ['cell_type', 'total_counts', 'batch']
>>> print(f"Gene metadata columns: {list(slaf_array.var.columns)}")
Gene metadata columns: ['gene_type', 'chromosome']
>>> # Filter cells by metadata
>>> t_cells = slaf_array.filter_cells(cell_type="T cells")
>>> print(f"Found {len(t_cells)} T cells")
Found 250 T cells
>>> # Execute SQL query
>>> results = slaf_array.query("
...     SELECT cell_type, AVG(total_counts) as avg_counts
...     FROM cells
...     GROUP BY cell_type
...     ORDER BY avg_counts DESC
... ")
>>> print(results)
   cell_type  avg_counts
0  T cells      1250.5
1  B cells      1100.2
2  Monocytes     950.8
>>> # Get expression data
>>> expression = slaf_array.get_cell_expression(["cell_001", "cell_002"])
>>> print(f"Expression matrix shape: {expression.shape}")
Expression matrix shape: (2, 20000)
Source code in slaf/core/slaf.py
  23
  24
  25
  26
  27
  28
  29
  30
  31
  32
  33
  34
  35
  36
  37
  38
  39
  40
  41
  42
  43
  44
  45
  46
  47
  48
  49
  50
  51
  52
  53
  54
  55
  56
  57
  58
  59
  60
  61
  62
  63
  64
  65
  66
  67
  68
  69
  70
  71
  72
  73
  74
  75
  76
  77
  78
  79
  80
  81
  82
  83
  84
  85
  86
  87
  88
  89
  90
  91
  92
  93
  94
  95
  96
  97
  98
  99
 100
 101
 102
 103
 104
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114
 115
 116
 117
 118
 119
 120
 121
 122
 123
 124
 125
 126
 127
 128
 129
 130
 131
 132
 133
 134
 135
 136
 137
 138
 139
 140
 141
 142
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
class SLAFArray:
    """
    High-performance single-cell data storage and querying format.

    SLAFArray provides SQL-native access to single-cell data with lazy evaluation.
    Data is stored in a relational format with three main tables: cells, genes, and expression.
    The class enables direct SQL queries, efficient filtering, and seamless integration
    with the single-cell analysis ecosystem.

    Key Features:
        - SQL-native querying with Polars integration
        - Lazy evaluation for memory efficiency
        - Direct access to cell and gene metadata
        - High-performance storage with Lance format
        - Scanpy/AnnData compatibility

    Examples:
        >>> # Load a SLAF dataset
        >>> slaf_array = SLAFArray("path/to/data.slaf")
        >>> print(f"Dataset shape: {slaf_array.shape}")
        Dataset shape: (1000, 20000)

        >>> # Access metadata
        >>> print(f"Cell metadata columns: {list(slaf_array.obs.columns)}")
        Cell metadata columns: ['cell_type', 'total_counts', 'batch']
        >>> print(f"Gene metadata columns: {list(slaf_array.var.columns)}")
        Gene metadata columns: ['gene_type', 'chromosome']

        >>> # Filter cells by metadata
        >>> t_cells = slaf_array.filter_cells(cell_type="T cells")
        >>> print(f"Found {len(t_cells)} T cells")
        Found 250 T cells

        >>> # Execute SQL query
        >>> results = slaf_array.query("
        ...     SELECT cell_type, AVG(total_counts) as avg_counts
        ...     FROM cells
        ...     GROUP BY cell_type
        ...     ORDER BY avg_counts DESC
        ... ")
        >>> print(results)
           cell_type  avg_counts
        0  T cells      1250.5
        1  B cells      1100.2
        2  Monocytes     950.8

        >>> # Get expression data
        >>> expression = slaf_array.get_cell_expression(["cell_001", "cell_002"])
        >>> print(f"Expression matrix shape: {expression.shape}")
        Expression matrix shape: (2, 20000)
    """

    def __init__(self, slaf_path: str | Path):
        """
        Initialize SLAF array from a SLAF dataset directory.

        Args:
            slaf_path: Path to SLAF directory containing config.json and .lance files.
                       The directory should contain the dataset configuration and Lance tables.

        Raises:
            FileNotFoundError: If the SLAF config file is not found at the specified path.
            ValueError: If the config file is invalid or missing required tables.
            KeyError: If required configuration keys are missing.

        Examples:
            >>> # Load from local directory
            >>> slaf_array = SLAFArray("./data/pbmc3k.slaf")
            >>> print(f"Loaded dataset: {slaf_array.shape}")
            Loaded dataset: (2700, 32738)

            >>> # Load from cloud storage
            >>> slaf_array = SLAFArray("s3://bucket/data.slaf")
            >>> print(f"Cloud dataset: {slaf_array.shape}")
            Cloud dataset: (5000, 25000)

            >>> # Error handling for missing directory
            >>> try:
            ...     slaf_array = SLAFArray("nonexistent/path")
            ... except FileNotFoundError as e:
            ...     print(f"Error: {e}")
            Error: SLAF config not found at nonexistent/path/config.json
        """
        self.slaf_path = Path(slaf_path)

        # Validate dataset exists
        if not self.slaf_path.exists():
            raise FileNotFoundError(f"SLAF dataset not found: {self.slaf_path}")

        # Load configuration
        config_path = self.slaf_path / "config.json"
        if not config_path.exists():
            raise FileNotFoundError(f"SLAF config not found at {config_path}")

        with open(config_path) as f:
            self.config = json.load(f)

        # Initialize shape
        self.shape = tuple(self.config["array_shape"])

        # Setup datasets
        self._setup_datasets()

        # Initialize metadata loading
        self._obs = None
        self._var = None
        self._metadata_loaded = False
        self._metadata_loading = False
        self._metadata_thread = None
        self._metadata_error = None

        # Initialize row mapper for cell-based queries
        from slaf.core.query_optimizer import RowIndexMapper

        self.row_mapper = RowIndexMapper(self)

        # Start async metadata loading in background
        self._start_async_metadata_loading()

        # Display helpful initialization message
        self._display_initialization_message()

    def _display_initialization_message(self):
        """Display helpful initialization message with basic metadata"""
        # Display ASCII art
        display_ascii_art()
        logger.info("")

        n_cells, n_genes = self.shape

        # Format large numbers with commas
        cells_str = f"{n_cells:,}" if n_cells >= 1000 else str(n_cells)
        genes_str = f"{n_genes:,}" if n_genes >= 1000 else str(n_genes)

        # Get format version
        format_version = self.config.get("format_version", "unknown")

        # Get dataset name from path
        dataset_name = self.slaf_path.name

        # Display message
        logger.info(f"📊 SLAF Dataset Loaded: {dataset_name}")
        logger.info(f"   • Shape: {cells_str} cells × {genes_str} genes")
        logger.info(f"   • Format: SLAF v{format_version}")

        # Add metadata info if available
        if "metadata" in self.config:
            metadata = self.config["metadata"]
            if "expression_count" in metadata:
                expr_count = metadata["expression_count"]
                expr_str = f"{expr_count:,}" if expr_count >= 1000 else str(expr_count)
                logger.info(f"   • Expression records: {expr_str}")

            if "sparsity" in metadata:
                sparsity = metadata["sparsity"]
                logger.info(f"   • Sparsity: {sparsity:.1%}")

        # Add optimization info if available
        optimizations = self.config.get("optimizations", {})
        if optimizations:
            opt_info = ", ".join([f"{k}: {v}" for k, v in optimizations.items()])
            logger.info(f"   • Optimizations: {opt_info}")

        # Status message
        logger.info("   • Status: Ready for queries (metadata loading in background)")
        logger.info("")

    def _start_async_metadata_loading(self):
        """Start async metadata loading in background thread"""
        if self._metadata_thread and self._metadata_thread.is_alive():
            return  # Already loading

        self._metadata_loading = True
        self._metadata_thread = threading.Thread(
            target=self._load_metadata_async, daemon=True
        )
        self._metadata_thread.start()

    def _load_metadata_async(self):
        """Load metadata in background thread"""
        try:
            self._load_metadata()
            self._metadata_loaded = True
        except Exception as e:
            self._metadata_error = e
            logger.error(f"Error loading metadata: {e}")
        finally:
            self._metadata_loading = False

    def _setup_datasets(self):
        """Setup Lance datasets for the new table structure"""
        # Load all Lance datasets
        self.expression = lance.dataset(
            str(self.slaf_path / self.config["tables"]["expression"])
        )
        self.cells = lance.dataset(str(self.slaf_path / self.config["tables"]["cells"]))
        self.genes = lance.dataset(str(self.slaf_path / self.config["tables"]["genes"]))

    def _ensure_metadata_loaded(self):
        """Ensure metadata is loaded (lazy loading with async support)"""
        if self._metadata_loaded:
            return

        if self._metadata_loading:
            # Wait for async loading to complete
            if self._metadata_thread and self._metadata_thread.is_alive():
                logger.info(
                    "Loading metadata in background... (this may take a few seconds)"
                )
                self._metadata_thread.join()

        if not self._metadata_loaded:
            # Fallback to synchronous loading if async failed
            self._load_metadata()
            self._metadata_loaded = True

    @property
    def obs(self):
        """Cell metadata (lazy loaded)"""
        self._ensure_metadata_loaded()
        return self._obs

    @property
    def var(self):
        """Gene metadata (lazy loaded)"""
        self._ensure_metadata_loaded()
        return self._var

    def is_metadata_ready(self) -> bool:
        """Check if metadata is ready for use"""
        return self._metadata_loaded

    def is_metadata_loading(self) -> bool:
        """Check if metadata is currently loading"""
        return self._metadata_loading

    def wait_for_metadata(self, timeout: float = None):
        """Wait for metadata to be loaded (with optional timeout)"""
        if self._metadata_loaded:
            return

        if self._metadata_loading and self._metadata_thread:
            if timeout:
                self._metadata_thread.join(timeout=timeout)
            else:
                self._metadata_thread.join()

        if not self._metadata_loaded:
            # Fallback to synchronous loading
            self._load_metadata()
            self._metadata_loaded = True

    def _load_metadata(self):
        """Load cell and gene metadata into polars DataFrames for fast operations"""
        # Load cell metadata using optimized Lance operations
        # Use more efficient loading strategy for large datasets
        # Load in chunks if dataset is very large
        if self.shape[0] > 1000000:  # 1M+ cells
            # Use scan() for large datasets to avoid loading everything at once
            cells_df = pl.from_arrow(self.cells.scanner().to_table())
        else:
            # Use direct to_table() for smaller datasets
            cells_df = pl.from_arrow(self.cells.to_table())

        self._obs = cells_df.sort("cell_integer_id")

        # Load gene metadata using optimized Lance operations
        # Genes are typically smaller, so use direct loading
        genes_df = pl.from_arrow(self.genes.to_table())
        self._var = genes_df.sort("gene_integer_id")

        # Cache column information for faster access
        self._obs_columns = list(self._obs.columns)
        self._var_columns = list(self._var.columns)

        # Build cell start indices for efficient row access (zero-copy Polars)
        if "cell_start_index" in self._obs.columns:
            # Use precomputed cell_start_index column
            logger.info("Using precomputed cell_start_index from cells table")
            # The cell_start_index column contains n_cells values (cumulative sums with leading 0)
            # We need to append the total expression count for boundary checking
            total_expression_count = self.expression.count_rows()
            self._cell_start_index = pl.concat(
                [self._obs["cell_start_index"], pl.Series([total_expression_count])]
            )
        elif "n_genes" in self._obs.columns:
            # Use existing n_genes column
            cumsum = self._obs["n_genes"].cum_sum()
            self._cell_start_index = pl.concat([pl.Series([0]), cumsum])
        elif "gene_count" in self._obs.columns:
            # Use existing gene_count column
            cumsum = self._obs["gene_count"].cum_sum()
            self._cell_start_index = pl.concat([pl.Series([0]), cumsum])
        else:
            # Calculate n_genes per cell from expression data
            logger.info(
                "No precomputed cell_start_index found, calculating from expression data..."
            )

            # Query expression data to count genes per cell
            gene_counts_query = """
            SELECT cell_integer_id, COUNT(*) as n_genes
            FROM expression
            GROUP BY cell_integer_id
            ORDER BY cell_integer_id
            """
            gene_counts_df = self.query(gene_counts_query)

            # Convert to polars if needed and join with obs
            if hasattr(gene_counts_df, "to_pandas"):
                gene_counts_df = gene_counts_df.to_pandas()
            gene_counts_pl = pl.from_pandas(gene_counts_df)

            # Join with obs to get n_genes for each cell
            obs_with_counts = self._obs.join(
                gene_counts_pl.select(["cell_integer_id", "n_genes"]),
                on="cell_integer_id",
                how="left",
            )

            # Fill missing values with 0 (cells with no expression)
            obs_with_counts = obs_with_counts.with_columns(
                pl.col("n_genes").fill_null(0)
            )

            cumsum = obs_with_counts["n_genes"].cum_sum()
            self._cell_start_index = pl.concat([pl.Series([0]), cumsum])

        # Restore dtypes for obs using polars
        obs_dtypes = self.config.get("obs_dtypes", {})
        for col, dtype_info in obs_dtypes.items():
            if col in self._obs.columns:
                if dtype_info["dtype"] == "category":
                    # Convert to polars categorical
                    self._obs = self._obs.with_columns(
                        pl.col(col).cast(pl.Categorical).cast(pl.Utf8)
                    )
                else:
                    # Map pandas dtypes to polars dtypes
                    polars_dtype = self._map_pandas_to_polars_dtype(dtype_info["dtype"])
                    self._obs = self._obs.with_columns(pl.col(col).cast(polars_dtype))

        # Restore dtypes for var using polars
        var_dtypes = self.config.get("var_dtypes", {})
        for col, dtype_info in var_dtypes.items():
            if col in self._var.columns:
                if dtype_info["dtype"] == "category":
                    # Convert to polars categorical
                    self._var = self._var.with_columns(
                        pl.col(col).cast(pl.Categorical).cast(pl.Utf8)
                    )
                else:
                    # Map pandas dtypes to polars dtypes
                    polars_dtype = self._map_pandas_to_polars_dtype(dtype_info["dtype"])
                    self._var = self._var.with_columns(pl.col(col).cast(polars_dtype))

        # Infer categorical columns if not in config
        self._infer_categorical_columns()

    def _map_pandas_to_polars_dtype(self, pandas_dtype: str) -> type[pl.DataType]:
        """Map pandas dtype to polars dtype"""
        dtype_mapping = {
            "int64": pl.Int64,
            "int32": pl.Int32,
            "int16": pl.Int16,
            "int8": pl.Int8,
            "uint64": pl.UInt64,
            "uint32": pl.UInt32,
            "uint16": pl.UInt16,
            "uint8": pl.UInt8,
            "float64": pl.Float64,
            "float32": pl.Float32,
            "bool": pl.Boolean,
            "object": pl.Utf8,
            "string": pl.Utf8,
        }
        return dtype_mapping.get(pandas_dtype, pl.Utf8)

    def _infer_categorical_columns(self):
        """Infer categorical columns based on data characteristics using polars"""
        # For obs: infer categoricals for string columns with few unique values
        for col in self._obs.columns:
            if col not in self.config.get("obs_dtypes", {}):
                if self._obs[col].dtype == pl.Utf8:
                    unique_count = self._obs[col].n_unique()
                    total_count = len(self._obs)
                    unique_ratio = unique_count / total_count
                    # If less than 20% unique values, likely categorical
                    if unique_ratio < 0.2 and unique_count < 50:
                        self._obs = self._obs.with_columns(
                            pl.col(col).cast(pl.Categorical).cast(pl.Utf8)
                        )

        # For var: infer categoricals for string columns with few unique values
        for col in self._var.columns:
            if col not in self.config.get("var_dtypes", {}):
                if self._var[col].dtype == pl.Utf8:
                    unique_count = self._var[col].n_unique()
                    total_count = len(self._var)
                    unique_ratio = unique_count / total_count
                    # If less than 20% unique values, likely categorical
                    if unique_ratio < 0.2 and unique_count < 50:
                        self._var = self._var.with_columns(
                            pl.col(col).cast(pl.Categorical).cast(pl.Utf8)
                        )

    def query(self, sql: str) -> pl.DataFrame:
        """
        Execute SQL query on the SLAF dataset.

        Executes SQL queries directly on the underlying Lance tables using Polars.
        The query can reference three tables: 'cells', 'genes', and 'expression'.
        This enables complex aggregations, joins, and filtering operations.

        Args:
            sql: SQL query string to execute. Can reference tables: cells, genes, expression.
                 Supports standard SQL operations including WHERE, GROUP BY, ORDER BY, etc.

        Returns:
            Polars DataFrame containing the query results.

        Raises:
            ValueError: If the SQL query is malformed or references non-existent tables.
            RuntimeError: If the query execution fails.

        Examples:
            >>> # Basic query to count cells
            >>> slaf_array = SLAFArray("path/to/data.slaf")
            >>> result = slaf_array.query("SELECT COUNT(*) as total_cells FROM cells")
            >>> print(f"Total cells: {result['total_cells'][0]}")
            Total cells: 1000

            >>> # Complex aggregation query
            >>> result = slaf_array.query("
            ...     SELECT cell_type,
            ...            COUNT(*) as cell_count,
            ...            AVG(total_counts) as avg_counts
            ...     FROM cells
            ...     WHERE total_counts > 500
            ...     GROUP BY cell_type
            ...     ORDER BY avg_counts DESC
            ... ")
            >>> print(result)
            shape: (3, 3)
            ┌────────────┬────────────┬────────────┐
            │ cell_type  ┆ cell_count ┆ avg_counts │
            │ ---        ┆ ---        ┆ ---        │
            │ str        ┆ i64        ┆ f64        │
            ╞════════════╪════════════╪════════════╡
            │ T cells    ┆ 250        ┆ 1250.5     │
            │ B cells    ┆ 200        ┆ 1100.2     │
            │ Monocytes  ┆ 150        ┆ 950.8      │
            └────────────┴────────────┴────────────┘

            >>> # Join query across tables
            >>> result = slaf_array.query("
            ...     SELECT c.cell_type, g.gene_type, AVG(e.value) as avg_expression
            ...     FROM cells c
            ...     JOIN expression e ON c.cell_integer_id = e.cell_integer_id
            ...     JOIN genes g ON e.gene_integer_id = g.gene_integer_id
            ...     WHERE c.cell_type = 'T cells'
            ...     GROUP BY c.cell_type, g.gene_type
            ... ")
            >>> print(f"Found {len(result)} expression patterns")
            Found 5 expression patterns
        """
        # Create Polars context with all three Lance datasets
        ctx = pl.SQLContext()

        # Register Lance datasets with Polars context
        ctx.register("expression", pl.scan_pyarrow_dataset(self.expression))
        ctx.register("cells", pl.scan_pyarrow_dataset(self.cells))
        ctx.register("genes", pl.scan_pyarrow_dataset(self.genes))

        # Execute the query using Polars SQL
        result = ctx.execute(sql).collect()

        return result

    def filter_cells(self, **filters: Any) -> pl.DataFrame:
        """
        Filter cells based on metadata columns.

        Provides a convenient interface for filtering cells using metadata columns.
        Supports exact matches, list values, and range queries with operators.
        Uses in-memory polars filtering when metadata is loaded, falls back to SQL otherwise.

        Args:
            **filters: Column name and filter value pairs. Supports:
                - Exact matches: cell_type="T cells"
                - List values: cell_type=["T cells", "B cells"]
                - Range queries: total_counts=">1000", total_counts="<=2000"
                - Multiple conditions: cell_type="T cells", total_counts=">500"

        Returns:
            Polars DataFrame containing filtered cell metadata.

        Raises:
            ValueError: If a specified column is not found in cell metadata.
            TypeError: If filter values are of unsupported types.

        Examples:
            >>> # Filter by cell type
            >>> slaf_array = SLAFArray("path/to/data.slaf")
            >>> t_cells = slaf_array.filter_cells(cell_type="T cells")
            >>> print(f"Found {len(t_cells)} T cells")
            Found 250 T cells

            >>> # Filter by multiple criteria
            >>> high_quality_t_cells = slaf_array.filter_cells(
            ...     cell_type="T cells",
            ...     total_counts=">1000",
            ...     batch=["batch1", "batch2"]
            ... )
            >>> print(f"Found {len(high_quality_t_cells)} high-quality T cells")
            Found 180 high-quality T cells

            >>> # Range query
            >>> medium_counts = slaf_array.filter_cells(
            ...     total_counts=">=500",
            ...     total_counts="<=2000"
            ... )
            >>> print(f"Found {len(medium_counts)} cells with medium counts")
            Found 450 cells with medium counts

            >>> # Error handling for invalid column
            >>> try:
            ...     result = slaf_array.filter_cells(invalid_column="value")
            ... except ValueError as e:
            ...     print(f"Error: {e}")
            Error: Column 'invalid_column' not found in cell metadata
        """
        return self._filter("cells", **filters)

    def filter_genes(self, **filters: Any) -> pl.DataFrame:
        """
        Filter genes based on metadata columns.

        Provides a convenient interface for filtering genes using metadata columns.
        Supports exact matches, list values, and range queries with operators.
        Uses in-memory polars filtering when metadata is loaded, falls back to SQL otherwise.

        Args:
            **filters: Column name and filter value pairs. Supports:
                - Exact matches: gene_type="protein_coding"
                - List values: gene_type=["protein_coding", "lncRNA"]
                - Range queries: expression_mean=">5.0", expression_mean="<=10.0"
                - Multiple conditions: gene_type="protein_coding", chromosome="chr1"

        Returns:
            Polars DataFrame containing filtered gene metadata.

        Raises:
            ValueError: If a specified column is not found in gene metadata.
            TypeError: If filter values are of unsupported types.

        Examples:
            >>> # Filter by gene type
            >>> slaf_array = SLAFArray("path/to/data.slaf")
            >>> protein_coding = slaf_array.filter_genes(gene_type="protein_coding")
            >>> print(f"Found {len(protein_coding)} protein-coding genes")
            Found 15000 protein-coding genes

            >>> # Filter by multiple criteria
            >>> high_expr_proteins = slaf_array.filter_genes(
            ...     gene_type="protein_coding",
            ...     expression_mean=">5.0",
            ...     chromosome=["chr1", "chr2"]
            ... )
            >>> print(f"Found {len(high_expr_proteins)} high-expression protein genes")
            Found 2500 high-expression protein genes

            >>> # Range query for expression
            >>> medium_expr = slaf_array.filter_genes(
            ...     expression_mean=">=2.0",
            ...     expression_mean="<=8.0"
            ... )
            >>> print(f"Found {len(medium_expr)} genes with medium expression")
            Found 8000 genes with medium expression

            >>> # Error handling for invalid column
            >>> try:
            ...     result = slaf_array.filter_genes(invalid_column="value")
            ... except ValueError as e:
            ...     print(f"Error: {e}")
            Error: Column 'invalid_column' not found in gene metadata
        """
        return self._filter("genes", **filters)

    def _filter(self, table_name: str, **filters: Any) -> pl.DataFrame:
        """
        Generic filtering method that uses polars for metadata operations.

        Args:
            table_name: Either "cells" or "genes"
            **filters: Filter conditions

        Returns:
            Filtered polars DataFrame
        """
        if not filters:
            # Return all metadata if no filters
            if table_name == "cells":
                return (
                    self.obs
                    if self._metadata_loaded
                    else self.query("SELECT * FROM cells")
                )
            else:
                return (
                    self.var
                    if self._metadata_loaded
                    else self.query("SELECT * FROM genes")
                )

        # For filtering, we need metadata to be loaded to validate columns
        # If metadata is not loaded, load it first
        if not self._metadata_loaded:
            self._ensure_metadata_loaded()

        # Use polars filtering for metadata operations
        if table_name == "cells":
            return self._filter_with_polars(self.obs, **filters)
        elif table_name == "genes":
            return self._filter_with_polars(self.var, **filters)
        else:
            raise ValueError(f"Invalid table name: {table_name}")

    def _filter_with_polars(
        self, metadata_df: pl.DataFrame, **filters: Any
    ) -> pl.DataFrame:
        """
        Filter metadata using polars operations for high performance.

        Args:
            metadata_df: Polars DataFrame to filter (self.obs or self.var)
            **filters: Filter conditions

        Returns:
            Filtered polars DataFrame
        """
        # Start with all rows
        result = metadata_df

        for column, value in filters.items():
            if column not in metadata_df.columns:
                raise ValueError(f"Column '{column}' not found in metadata")

            if isinstance(value, str) and value.startswith((">", "<", ">=", "<=")):
                # Handle range queries
                operator = value[:2] if value.startswith((">=", "<=")) else value[0]
                filter_value = (
                    value[2:] if value.startswith((">=", "<=")) else value[1:]
                )

                # Convert to numeric for comparison
                try:
                    numeric_filter_value = float(filter_value)

                    # Create boolean mask for the comparison using polars expressions
                    if operator == ">":
                        result = result.filter(pl.col(column) > numeric_filter_value)
                    elif operator == "<":
                        result = result.filter(pl.col(column) < numeric_filter_value)
                    elif operator == ">=":
                        result = result.filter(pl.col(column) >= numeric_filter_value)
                    elif operator == "<=":
                        result = result.filter(pl.col(column) <= numeric_filter_value)
                    else:
                        raise ValueError(f"Unsupported operator: {operator}")

                except Exception as err:
                    raise ValueError(
                        f"Cannot perform numeric comparison on non-numeric column '{column}'"
                    ) from err

            elif isinstance(value, list):
                # Handle list values using polars is_in
                result = result.filter(pl.col(column).is_in(value))
            else:
                # Handle exact matches
                result = result.filter(pl.col(column) == value)

        return result

    def _normalize_entity_ids(
        self, entity_ids: str | list[str], entity_type: str
    ) -> list[int]:
        """Convert string entity IDs to integer IDs using metadata mappings"""
        if entity_type == "cell":
            metadata = self.obs
            id_col = "cell_integer_id"
            index_col = "cell_id"
        else:
            metadata = self.var
            id_col = "gene_integer_id"
            index_col = "gene_id"

        # Convert to list if single ID
        if isinstance(entity_ids, str):
            entity_ids = [entity_ids]

        # Map string IDs to integer IDs using polars
        # Get the index values and corresponding integer IDs
        index_values = metadata[index_col].to_list()
        integer_id_values = metadata[id_col].to_list()

        # Create a mapping from index to integer ID
        id_mapping = dict(zip(index_values, integer_id_values, strict=False))

        # Map string IDs to integer IDs
        integer_ids = []
        for entity_id in entity_ids:
            if entity_id in id_mapping:
                integer_ids.append(id_mapping[entity_id])
            else:
                raise ValueError(f"{entity_type} ID '{entity_id}' not found")

        return integer_ids

    def get_cell_expression(self, cell_ids: str | list[str]) -> pl.DataFrame:
        """
        Get expression data for specific cells using Lance take() and Polars.

        Retrieves expression data for specified cells using efficient Lance row access
        and Polars for in-memory operations. This method provides significant
        performance improvements over SQL-based queries.

        Args:
            cell_ids: Single cell ID (string) or list of cell IDs to retrieve.
                     Can be string identifiers or integer IDs.

        Returns:
            Polars DataFrame containing expression data for the specified cells.
            Columns include cell_id, gene_id, and expression values.

        Raises:
            ValueError: If any cell ID is not found in the dataset.
            RuntimeError: If the query execution fails.

        Examples:
            >>> # Get expression for a single cell
            >>> slaf_array = SLAFArray("path/to/data.slaf")
            >>> expression = slaf_array.get_cell_expression("cell_001")
            >>> print(f"Expression data shape: {expression.shape}")
            Expression data shape: (15000, 3)

            >>> # Get expression for multiple cells
            >>> expression = slaf_array.get_cell_expression(["cell_001", "cell_002", "cell_003"])
            >>> print(f"Expression data shape: {expression.shape}")
            Expression data shape: (45000, 3)

            >>> # Error handling for invalid cell ID
            >>> try:
            ...     expression = slaf_array.get_cell_expression("invalid_cell")
            ... except ValueError as e:
            ...     print(f"Error: {e}")
            Error: cell ID 'invalid_cell' not found
        """
        # Convert to integer IDs
        integer_ids = self._normalize_entity_ids(cell_ids, "cell")

        if not integer_ids:
            return pl.DataFrame({"cell_id": [], "gene_id": [], "value": []})

        # Get row indices using RowIndexMapper
        row_indices = self.row_mapper.get_cell_row_ranges(integer_ids)

        # Load data with Lance take()
        expression_data = self.expression.take(row_indices)

        # Convert PyArrow Table to Polars DataFrame and join with metadata
        expression_df = pl.from_arrow(expression_data)
        return self._join_with_metadata(expression_df)

    def get_gene_expression(self, gene_ids: str | list[str]) -> pl.DataFrame:
        """
        Get gene expression data for specified genes.

        This method uses QueryOptimizer to generate optimal SQL queries based on the
        gene ID distribution (consecutive ranges use BETWEEN, scattered values use IN).

        Args:
            gene_ids: Gene ID(s) to query. Can be a single gene ID string or list of gene IDs.

        Returns:
            Polars DataFrame containing expression data for the specified genes.
            Columns include cell_id, gene_id, and expression values.

        Raises:
            ValueError: If any gene ID is not found in the dataset.
            RuntimeError: If the query execution fails.

        Examples:
            >>> # Get expression for a single gene
            >>> slaf_array = SLAFArray("path/to/data.slaf")
            >>> expression = slaf_array.get_gene_expression("GENE1")
        """

        # Convert to list if single gene ID
        if isinstance(gene_ids, str):
            gene_ids = [gene_ids]

        # Convert string gene IDs to integer IDs
        integer_ids = self._normalize_entity_ids(gene_ids, "gene")

        if not integer_ids:
            return pl.DataFrame({"cell_id": [], "gene_id": [], "value": []})

        # Use QueryOptimizer for optimal performance
        from slaf.core.query_optimizer import QueryOptimizer

        # Build optimized SQL query using QueryOptimizer
        sql_query = QueryOptimizer.build_optimized_query(
            entity_ids=integer_ids,
            entity_type="gene",
            use_adaptive_batching=True,
            max_batch_size=100,
        )

        # Fix the table name for Polars scan (use 'self' instead of 'expression')
        sql_query = sql_query.replace("expression", "self")

        # Execute the optimized query
        ldf = pl.scan_pyarrow_dataset(self.expression)
        expression_df = ldf.sql(sql_query).collect()

        # Join with metadata using Polars
        return self._join_with_metadata(expression_df)

    def get_submatrix(
        self, cell_selector: Any | None = None, gene_selector: Any | None = None
    ) -> pl.DataFrame:
        """
        Get expression data using cell/gene selectors with Lance take() and Polars.

        Retrieves a subset of expression data based on cell and gene selectors.
        The selectors can be slices, lists, boolean masks, or None for all cells/genes.
        This method provides a flexible interface for subsetting expression data with
        significant performance improvements over SQL-based queries.

        Args:
            cell_selector: Cell selector for subsetting. Can be:
                - None: Include all cells
                - slice: e.g., slice(0, 100) for first 100 cells
                - list: e.g., [0, 5, 10] for specific cell indices
                - boolean mask: e.g., [True, False, True, ...] for boolean selection
            gene_selector: Gene selector for subsetting. Can be:
                - None: Include all genes
                - slice: e.g., slice(0, 5000) for first 5000 genes
                - list: e.g., [0, 100, 200] for specific gene indices
                - boolean mask: e.g., [True, False, True, ...] for boolean selection

        Returns:
            Polars DataFrame containing expression data for the selected subset.
            Columns include cell_id, gene_id, and expression values.

        Raises:
            ValueError: If selectors are invalid or out of bounds.
            RuntimeError: If the query execution fails.

        Examples:
            >>> # Get first 100 cells and first 5000 genes
            >>> slaf_array = SLAFArray("path/to/data.slaf")
            >>> submatrix = slaf_array.get_submatrix(
            ...     cell_selector=slice(0, 100),
            ...     gene_selector=slice(0, 5000)
            ... )
            >>> print(f"Submatrix shape: {submatrix.shape}")
            Submatrix shape: (500000, 3)

            >>> # Get specific cells and genes
            >>> submatrix = slaf_array.get_submatrix(
            ...     cell_selector=[0, 5, 10, 15],
            ...     gene_selector=[100, 200, 300]
            ... )
            >>> print(f"Submatrix shape: {submatrix.shape}")
            Submatrix shape: (12, 3)

            >>> # Get all cells for specific genes
            >>> submatrix = slaf_array.get_submatrix(
            ...     gene_selector=[0, 100, 200, 300, 400]
            ... )
            >>> print(f"Submatrix shape: {submatrix.shape}")
            Submatrix shape: (5000, 3)

            >>> # Error handling for invalid selector
            >>> try:
            ...     submatrix = slaf_array.get_submatrix(
            ...         cell_selector=slice(0, 1000000)  # Out of bounds
            ...     )
            ... except ValueError as e:
            ...     print(f"Error: {e}")
            Error: Cell selector out of bounds
        """
        # Get row indices for cells using RowIndexMapper
        cell_indices = self.row_mapper.get_cell_row_ranges_by_selector(cell_selector)

        # Load data with Lance take()
        expression_data = self.expression.take(cell_indices)
        expression_df = pl.from_arrow(expression_data)

        # Filter by genes if specified
        if gene_selector is not None:
            # Convert gene selector to integer IDs
            if isinstance(gene_selector, int):
                if gene_selector < 0:
                    gene_selector = self.shape[1] + gene_selector
                if 0 <= gene_selector < self.shape[1]:
                    gene_integer_id = self.var["gene_integer_id"][gene_selector]
                    expression_df = expression_df.filter(
                        pl.col("gene_integer_id") == gene_integer_id  # type: ignore[arg-type]
                    )
                else:
                    raise ValueError(f"Gene index {gene_selector} out of bounds")
            elif isinstance(gene_selector, slice):
                start = gene_selector.start or 0
                stop = gene_selector.stop or self.shape[1]
                step = gene_selector.step or 1

                # Handle negative indices
                if start < 0:
                    start = self.shape[1] + start
                if stop < 0:
                    stop = self.shape[1] + stop

                # Clamp bounds
                start = max(0, min(start, self.shape[1]))
                stop = max(0, min(stop, self.shape[1]))

                gene_integer_ids = self.var["gene_integer_id"][
                    start:stop:step
                ].to_list()
                expression_df = expression_df.filter(
                    pl.col("gene_integer_id").is_in(gene_integer_ids)  # type: ignore[arg-type]
                )
            elif isinstance(gene_selector, list):
                gene_integer_ids = []
                for idx in gene_selector:
                    if isinstance(idx, int):
                        if idx < 0:
                            idx = self.shape[1] + idx
                        if 0 <= idx < self.shape[1]:
                            gene_integer_id = self.var["gene_integer_id"][idx]
                            gene_integer_ids.append(gene_integer_id)
                        else:
                            raise ValueError(f"Gene index {idx} out of bounds")
                    else:
                        raise ValueError(f"Invalid gene index type: {type(idx)}")
                expression_df = expression_df.filter(
                    pl.col("gene_integer_id").is_in(gene_integer_ids)  # type: ignore[arg-type]
                )
            elif isinstance(gene_selector, np.ndarray) and gene_selector.dtype == bool:
                if len(gene_selector) != self.shape[1]:
                    raise ValueError(
                        f"Boolean mask length {len(gene_selector)} doesn't match gene count {self.shape[1]}"
                    )
                gene_integer_ids = self.var["gene_integer_id"][gene_selector].to_list()
                expression_df = expression_df.filter(
                    pl.col("gene_integer_id").is_in(gene_integer_ids)  # type: ignore[arg-type]
                )
            else:
                raise ValueError(
                    f"Unsupported gene selector type: {type(gene_selector)}"
                )

        # Join with metadata using Polars
        return self._join_with_metadata(expression_df)

    def _join_with_metadata(
        self, expression_data: pl.DataFrame | pl.Series
    ) -> pl.DataFrame:
        """
        Join expression data with cell/gene metadata using Polars.

        Args:
            expression_data: Polars DataFrame or Series containing expression data

        Returns:
            Polars DataFrame with cell_id, gene_id, and value columns
        """
        # Convert Series to DataFrame if needed
        if isinstance(expression_data, pl.Series):
            expression_data = expression_data.to_frame()

        # Join with cell metadata
        result = expression_data.join(
            self.obs.select(["cell_integer_id", "cell_id"]),
            on="cell_integer_id",
            how="left",
        )

        # Join with gene metadata
        result = result.join(
            self.var.select(["gene_integer_id", "gene_id"]),
            on="gene_integer_id",
            how="left",
        )

        # Select final columns
        return result.select(["cell_id", "gene_id", "value"])

    def info(self):
        """Print information about the SLAF dataset"""
        # Build output string for both logging and printing
        output_lines = []

        output_lines.append("SLAF Dataset")
        output_lines.append(f"  Shape: {self.shape[0]} cells × {self.shape[1]} genes")
        output_lines.append(
            f"  Format version: {self.config.get('format_version', 'unknown')}"
        )

        # Cell metadata columns
        cell_cols = self.obs.columns
        output_lines.append(f"  Cell metadata columns: {len(cell_cols)}")
        if cell_cols:
            output_lines.append(
                f"    {', '.join(cell_cols[:5])}{'...' if len(cell_cols) > 5 else ''}"
            )

        # Gene metadata columns
        gene_cols = self.var.columns
        output_lines.append(f"  Gene metadata columns: {len(gene_cols)}")
        if gene_cols:
            output_lines.append(
                f"    {', '.join(gene_cols[:5])}{'...' if len(gene_cols) > 5 else ''}"
            )

        # Record counts
        output_lines.append("  Record counts:")
        output_lines.append(f"    Cells: {len(self.obs):,}")
        output_lines.append(f"    Genes: {len(self.var):,}")

        # Expression metadata - use pre-computed if available, otherwise query
        format_version = self.config.get("format_version", "0.1")
        if format_version >= "0.2" and "metadata" in self.config:
            # Use pre-computed metadata from config
            metadata = self.config["metadata"]
            expression_count = metadata["expression_count"]
            sparsity = metadata["sparsity"]
            density = metadata["density"]

            output_lines.append(f"    Expression records: {expression_count:,}")
            output_lines.append(f"    Sparsity: {sparsity:.1%}")
            output_lines.append(f"    Density: {density:.1%}")

            # Show expression statistics if available
            if "expression_stats" in metadata:
                stats = metadata["expression_stats"]
                output_lines.append("  Expression statistics:")
                output_lines.append(f"    Min value: {stats['min_value']:.3f}")
                output_lines.append(f"    Max value: {stats['max_value']:.3f}")
                output_lines.append(f"    Mean value: {stats['mean_value']:.3f}")
                output_lines.append(f"    Std value: {stats['std_value']:.3f}")
        else:
            # Backward compatibility: query expression count for older format versions
            output_lines.append("    Expression records: computing...")
            expression_count = self.query("SELECT COUNT(*) as count FROM expression")
            output_lines.append(
                f"    Expression records: {expression_count.item(0, 0):,}"
            )

        # Optimization info
        optimizations = self.config.get("optimizations", {})
        if optimizations:
            output_lines.append("  Optimizations:")
            for opt, value in optimizations.items():
                output_lines.append(f"    {opt}: {value}")

        # Print to stdout for backward compatibility
        output = "\n".join(output_lines)
        print(output)

        # Also log for consistency with other methods
        logger.info("SLAF Dataset info displayed")

Attributes

obs property

Cell metadata (lazy loaded)

var property

Gene metadata (lazy loaded)

Functions

__init__(slaf_path: str | Path)

Initialize SLAF array from a SLAF dataset directory.

Parameters:

Name Type Description Default
slaf_path str | Path

Path to SLAF directory containing config.json and .lance files. The directory should contain the dataset configuration and Lance tables.

required

Raises:

Type Description
FileNotFoundError

If the SLAF config file is not found at the specified path.

ValueError

If the config file is invalid or missing required tables.

KeyError

If required configuration keys are missing.

Examples:

>>> # Load from local directory
>>> slaf_array = SLAFArray("./data/pbmc3k.slaf")
>>> print(f"Loaded dataset: {slaf_array.shape}")
Loaded dataset: (2700, 32738)
>>> # Load from cloud storage
>>> slaf_array = SLAFArray("s3://bucket/data.slaf")
>>> print(f"Cloud dataset: {slaf_array.shape}")
Cloud dataset: (5000, 25000)
>>> # Error handling for missing directory
>>> try:
...     slaf_array = SLAFArray("nonexistent/path")
... except FileNotFoundError as e:
...     print(f"Error: {e}")
Error: SLAF config not found at nonexistent/path/config.json
Source code in slaf/core/slaf.py
def __init__(self, slaf_path: str | Path):
    """
    Initialize SLAF array from a SLAF dataset directory.

    Args:
        slaf_path: Path to SLAF directory containing config.json and .lance files.
                   The directory should contain the dataset configuration and Lance tables.

    Raises:
        FileNotFoundError: If the SLAF config file is not found at the specified path.
        ValueError: If the config file is invalid or missing required tables.
        KeyError: If required configuration keys are missing.

    Examples:
        >>> # Load from local directory
        >>> slaf_array = SLAFArray("./data/pbmc3k.slaf")
        >>> print(f"Loaded dataset: {slaf_array.shape}")
        Loaded dataset: (2700, 32738)

        >>> # Load from cloud storage
        >>> slaf_array = SLAFArray("s3://bucket/data.slaf")
        >>> print(f"Cloud dataset: {slaf_array.shape}")
        Cloud dataset: (5000, 25000)

        >>> # Error handling for missing directory
        >>> try:
        ...     slaf_array = SLAFArray("nonexistent/path")
        ... except FileNotFoundError as e:
        ...     print(f"Error: {e}")
        Error: SLAF config not found at nonexistent/path/config.json
    """
    self.slaf_path = Path(slaf_path)

    # Validate dataset exists
    if not self.slaf_path.exists():
        raise FileNotFoundError(f"SLAF dataset not found: {self.slaf_path}")

    # Load configuration
    config_path = self.slaf_path / "config.json"
    if not config_path.exists():
        raise FileNotFoundError(f"SLAF config not found at {config_path}")

    with open(config_path) as f:
        self.config = json.load(f)

    # Initialize shape
    self.shape = tuple(self.config["array_shape"])

    # Setup datasets
    self._setup_datasets()

    # Initialize metadata loading
    self._obs = None
    self._var = None
    self._metadata_loaded = False
    self._metadata_loading = False
    self._metadata_thread = None
    self._metadata_error = None

    # Initialize row mapper for cell-based queries
    from slaf.core.query_optimizer import RowIndexMapper

    self.row_mapper = RowIndexMapper(self)

    # Start async metadata loading in background
    self._start_async_metadata_loading()

    # Display helpful initialization message
    self._display_initialization_message()

is_metadata_ready() -> bool

Check if metadata is ready for use

Source code in slaf/core/slaf.py
def is_metadata_ready(self) -> bool:
    """Check if metadata is ready for use"""
    return self._metadata_loaded

is_metadata_loading() -> bool

Check if metadata is currently loading

Source code in slaf/core/slaf.py
def is_metadata_loading(self) -> bool:
    """Check if metadata is currently loading"""
    return self._metadata_loading

wait_for_metadata(timeout: float = None)

Wait for metadata to be loaded (with optional timeout)

Source code in slaf/core/slaf.py
def wait_for_metadata(self, timeout: float = None):
    """Wait for metadata to be loaded (with optional timeout)"""
    if self._metadata_loaded:
        return

    if self._metadata_loading and self._metadata_thread:
        if timeout:
            self._metadata_thread.join(timeout=timeout)
        else:
            self._metadata_thread.join()

    if not self._metadata_loaded:
        # Fallback to synchronous loading
        self._load_metadata()
        self._metadata_loaded = True

query(sql: str) -> pl.DataFrame

Execute SQL query on the SLAF dataset.

Executes SQL queries directly on the underlying Lance tables using Polars. The query can reference three tables: 'cells', 'genes', and 'expression'. This enables complex aggregations, joins, and filtering operations.

Parameters:

Name Type Description Default
sql str

SQL query string to execute. Can reference tables: cells, genes, expression. Supports standard SQL operations including WHERE, GROUP BY, ORDER BY, etc.

required

Returns:

Type Description
DataFrame

Polars DataFrame containing the query results.

Raises:

Type Description
ValueError

If the SQL query is malformed or references non-existent tables.

RuntimeError

If the query execution fails.

Examples:

>>> # Basic query to count cells
>>> slaf_array = SLAFArray("path/to/data.slaf")
>>> result = slaf_array.query("SELECT COUNT(*) as total_cells FROM cells")
>>> print(f"Total cells: {result['total_cells'][0]}")
Total cells: 1000
>>> # Complex aggregation query
>>> result = slaf_array.query("
...     SELECT cell_type,
...            COUNT(*) as cell_count,
...            AVG(total_counts) as avg_counts
...     FROM cells
...     WHERE total_counts > 500
...     GROUP BY cell_type
...     ORDER BY avg_counts DESC
... ")
>>> print(result)
shape: (3, 3)
┌────────────┬────────────┬────────────┐
│ cell_type  ┆ cell_count ┆ avg_counts │
│ ---        ┆ ---        ┆ ---        │
│ str        ┆ i64        ┆ f64        │
╞════════════╪════════════╪════════════╡
│ T cells    ┆ 250        ┆ 1250.5     │
│ B cells    ┆ 200        ┆ 1100.2     │
│ Monocytes  ┆ 150        ┆ 950.8      │
└────────────┴────────────┴────────────┘
>>> # Join query across tables
>>> result = slaf_array.query("
...     SELECT c.cell_type, g.gene_type, AVG(e.value) as avg_expression
...     FROM cells c
...     JOIN expression e ON c.cell_integer_id = e.cell_integer_id
...     JOIN genes g ON e.gene_integer_id = g.gene_integer_id
...     WHERE c.cell_type = 'T cells'
...     GROUP BY c.cell_type, g.gene_type
... ")
>>> print(f"Found {len(result)} expression patterns")
Found 5 expression patterns
Source code in slaf/core/slaf.py
def query(self, sql: str) -> pl.DataFrame:
    """
    Execute SQL query on the SLAF dataset.

    Executes SQL queries directly on the underlying Lance tables using Polars.
    The query can reference three tables: 'cells', 'genes', and 'expression'.
    This enables complex aggregations, joins, and filtering operations.

    Args:
        sql: SQL query string to execute. Can reference tables: cells, genes, expression.
             Supports standard SQL operations including WHERE, GROUP BY, ORDER BY, etc.

    Returns:
        Polars DataFrame containing the query results.

    Raises:
        ValueError: If the SQL query is malformed or references non-existent tables.
        RuntimeError: If the query execution fails.

    Examples:
        >>> # Basic query to count cells
        >>> slaf_array = SLAFArray("path/to/data.slaf")
        >>> result = slaf_array.query("SELECT COUNT(*) as total_cells FROM cells")
        >>> print(f"Total cells: {result['total_cells'][0]}")
        Total cells: 1000

        >>> # Complex aggregation query
        >>> result = slaf_array.query("
        ...     SELECT cell_type,
        ...            COUNT(*) as cell_count,
        ...            AVG(total_counts) as avg_counts
        ...     FROM cells
        ...     WHERE total_counts > 500
        ...     GROUP BY cell_type
        ...     ORDER BY avg_counts DESC
        ... ")
        >>> print(result)
        shape: (3, 3)
        ┌────────────┬────────────┬────────────┐
        │ cell_type  ┆ cell_count ┆ avg_counts │
        │ ---        ┆ ---        ┆ ---        │
        │ str        ┆ i64        ┆ f64        │
        ╞════════════╪════════════╪════════════╡
        │ T cells    ┆ 250        ┆ 1250.5     │
        │ B cells    ┆ 200        ┆ 1100.2     │
        │ Monocytes  ┆ 150        ┆ 950.8      │
        └────────────┴────────────┴────────────┘

        >>> # Join query across tables
        >>> result = slaf_array.query("
        ...     SELECT c.cell_type, g.gene_type, AVG(e.value) as avg_expression
        ...     FROM cells c
        ...     JOIN expression e ON c.cell_integer_id = e.cell_integer_id
        ...     JOIN genes g ON e.gene_integer_id = g.gene_integer_id
        ...     WHERE c.cell_type = 'T cells'
        ...     GROUP BY c.cell_type, g.gene_type
        ... ")
        >>> print(f"Found {len(result)} expression patterns")
        Found 5 expression patterns
    """
    # Create Polars context with all three Lance datasets
    ctx = pl.SQLContext()

    # Register Lance datasets with Polars context
    ctx.register("expression", pl.scan_pyarrow_dataset(self.expression))
    ctx.register("cells", pl.scan_pyarrow_dataset(self.cells))
    ctx.register("genes", pl.scan_pyarrow_dataset(self.genes))

    # Execute the query using Polars SQL
    result = ctx.execute(sql).collect()

    return result

filter_cells(**filters: Any) -> pl.DataFrame

Filter cells based on metadata columns.

Provides a convenient interface for filtering cells using metadata columns. Supports exact matches, list values, and range queries with operators. Uses in-memory polars filtering when metadata is loaded, falls back to SQL otherwise.

Parameters:

Name Type Description Default
**filters Any

Column name and filter value pairs. Supports: - Exact matches: cell_type="T cells" - List values: cell_type=["T cells", "B cells"] - Range queries: total_counts=">1000", total_counts="<=2000" - Multiple conditions: cell_type="T cells", total_counts=">500"

{}

Returns:

Type Description
DataFrame

Polars DataFrame containing filtered cell metadata.

Raises:

Type Description
ValueError

If a specified column is not found in cell metadata.

TypeError

If filter values are of unsupported types.

Examples:

>>> # Filter by cell type
>>> slaf_array = SLAFArray("path/to/data.slaf")
>>> t_cells = slaf_array.filter_cells(cell_type="T cells")
>>> print(f"Found {len(t_cells)} T cells")
Found 250 T cells
>>> # Filter by multiple criteria
>>> high_quality_t_cells = slaf_array.filter_cells(
...     cell_type="T cells",
...     total_counts=">1000",
...     batch=["batch1", "batch2"]
... )
>>> print(f"Found {len(high_quality_t_cells)} high-quality T cells")
Found 180 high-quality T cells
>>> # Range query
>>> medium_counts = slaf_array.filter_cells(
...     total_counts=">=500",
...     total_counts="<=2000"
... )
>>> print(f"Found {len(medium_counts)} cells with medium counts")
Found 450 cells with medium counts
>>> # Error handling for invalid column
>>> try:
...     result = slaf_array.filter_cells(invalid_column="value")
... except ValueError as e:
...     print(f"Error: {e}")
Error: Column 'invalid_column' not found in cell metadata
Source code in slaf/core/slaf.py
def filter_cells(self, **filters: Any) -> pl.DataFrame:
    """
    Filter cells based on metadata columns.

    Provides a convenient interface for filtering cells using metadata columns.
    Supports exact matches, list values, and range queries with operators.
    Uses in-memory polars filtering when metadata is loaded, falls back to SQL otherwise.

    Args:
        **filters: Column name and filter value pairs. Supports:
            - Exact matches: cell_type="T cells"
            - List values: cell_type=["T cells", "B cells"]
            - Range queries: total_counts=">1000", total_counts="<=2000"
            - Multiple conditions: cell_type="T cells", total_counts=">500"

    Returns:
        Polars DataFrame containing filtered cell metadata.

    Raises:
        ValueError: If a specified column is not found in cell metadata.
        TypeError: If filter values are of unsupported types.

    Examples:
        >>> # Filter by cell type
        >>> slaf_array = SLAFArray("path/to/data.slaf")
        >>> t_cells = slaf_array.filter_cells(cell_type="T cells")
        >>> print(f"Found {len(t_cells)} T cells")
        Found 250 T cells

        >>> # Filter by multiple criteria
        >>> high_quality_t_cells = slaf_array.filter_cells(
        ...     cell_type="T cells",
        ...     total_counts=">1000",
        ...     batch=["batch1", "batch2"]
        ... )
        >>> print(f"Found {len(high_quality_t_cells)} high-quality T cells")
        Found 180 high-quality T cells

        >>> # Range query
        >>> medium_counts = slaf_array.filter_cells(
        ...     total_counts=">=500",
        ...     total_counts="<=2000"
        ... )
        >>> print(f"Found {len(medium_counts)} cells with medium counts")
        Found 450 cells with medium counts

        >>> # Error handling for invalid column
        >>> try:
        ...     result = slaf_array.filter_cells(invalid_column="value")
        ... except ValueError as e:
        ...     print(f"Error: {e}")
        Error: Column 'invalid_column' not found in cell metadata
    """
    return self._filter("cells", **filters)

filter_genes(**filters: Any) -> pl.DataFrame

Filter genes based on metadata columns.

Provides a convenient interface for filtering genes using metadata columns. Supports exact matches, list values, and range queries with operators. Uses in-memory polars filtering when metadata is loaded, falls back to SQL otherwise.

Parameters:

Name Type Description Default
**filters Any

Column name and filter value pairs. Supports: - Exact matches: gene_type="protein_coding" - List values: gene_type=["protein_coding", "lncRNA"] - Range queries: expression_mean=">5.0", expression_mean="<=10.0" - Multiple conditions: gene_type="protein_coding", chromosome="chr1"

{}

Returns:

Type Description
DataFrame

Polars DataFrame containing filtered gene metadata.

Raises:

Type Description
ValueError

If a specified column is not found in gene metadata.

TypeError

If filter values are of unsupported types.

Examples:

>>> # Filter by gene type
>>> slaf_array = SLAFArray("path/to/data.slaf")
>>> protein_coding = slaf_array.filter_genes(gene_type="protein_coding")
>>> print(f"Found {len(protein_coding)} protein-coding genes")
Found 15000 protein-coding genes
>>> # Filter by multiple criteria
>>> high_expr_proteins = slaf_array.filter_genes(
...     gene_type="protein_coding",
...     expression_mean=">5.0",
...     chromosome=["chr1", "chr2"]
... )
>>> print(f"Found {len(high_expr_proteins)} high-expression protein genes")
Found 2500 high-expression protein genes
>>> # Range query for expression
>>> medium_expr = slaf_array.filter_genes(
...     expression_mean=">=2.0",
...     expression_mean="<=8.0"
... )
>>> print(f"Found {len(medium_expr)} genes with medium expression")
Found 8000 genes with medium expression
>>> # Error handling for invalid column
>>> try:
...     result = slaf_array.filter_genes(invalid_column="value")
... except ValueError as e:
...     print(f"Error: {e}")
Error: Column 'invalid_column' not found in gene metadata
Source code in slaf/core/slaf.py
def filter_genes(self, **filters: Any) -> pl.DataFrame:
    """
    Filter genes based on metadata columns.

    Provides a convenient interface for filtering genes using metadata columns.
    Supports exact matches, list values, and range queries with operators.
    Uses in-memory polars filtering when metadata is loaded, falls back to SQL otherwise.

    Args:
        **filters: Column name and filter value pairs. Supports:
            - Exact matches: gene_type="protein_coding"
            - List values: gene_type=["protein_coding", "lncRNA"]
            - Range queries: expression_mean=">5.0", expression_mean="<=10.0"
            - Multiple conditions: gene_type="protein_coding", chromosome="chr1"

    Returns:
        Polars DataFrame containing filtered gene metadata.

    Raises:
        ValueError: If a specified column is not found in gene metadata.
        TypeError: If filter values are of unsupported types.

    Examples:
        >>> # Filter by gene type
        >>> slaf_array = SLAFArray("path/to/data.slaf")
        >>> protein_coding = slaf_array.filter_genes(gene_type="protein_coding")
        >>> print(f"Found {len(protein_coding)} protein-coding genes")
        Found 15000 protein-coding genes

        >>> # Filter by multiple criteria
        >>> high_expr_proteins = slaf_array.filter_genes(
        ...     gene_type="protein_coding",
        ...     expression_mean=">5.0",
        ...     chromosome=["chr1", "chr2"]
        ... )
        >>> print(f"Found {len(high_expr_proteins)} high-expression protein genes")
        Found 2500 high-expression protein genes

        >>> # Range query for expression
        >>> medium_expr = slaf_array.filter_genes(
        ...     expression_mean=">=2.0",
        ...     expression_mean="<=8.0"
        ... )
        >>> print(f"Found {len(medium_expr)} genes with medium expression")
        Found 8000 genes with medium expression

        >>> # Error handling for invalid column
        >>> try:
        ...     result = slaf_array.filter_genes(invalid_column="value")
        ... except ValueError as e:
        ...     print(f"Error: {e}")
        Error: Column 'invalid_column' not found in gene metadata
    """
    return self._filter("genes", **filters)

get_cell_expression(cell_ids: str | list[str]) -> pl.DataFrame

Get expression data for specific cells using Lance take() and Polars.

Retrieves expression data for specified cells using efficient Lance row access and Polars for in-memory operations. This method provides significant performance improvements over SQL-based queries.

Parameters:

Name Type Description Default
cell_ids str | list[str]

Single cell ID (string) or list of cell IDs to retrieve. Can be string identifiers or integer IDs.

required

Returns:

Type Description
DataFrame

Polars DataFrame containing expression data for the specified cells.

DataFrame

Columns include cell_id, gene_id, and expression values.

Raises:

Type Description
ValueError

If any cell ID is not found in the dataset.

RuntimeError

If the query execution fails.

Examples:

>>> # Get expression for a single cell
>>> slaf_array = SLAFArray("path/to/data.slaf")
>>> expression = slaf_array.get_cell_expression("cell_001")
>>> print(f"Expression data shape: {expression.shape}")
Expression data shape: (15000, 3)
>>> # Get expression for multiple cells
>>> expression = slaf_array.get_cell_expression(["cell_001", "cell_002", "cell_003"])
>>> print(f"Expression data shape: {expression.shape}")
Expression data shape: (45000, 3)
>>> # Error handling for invalid cell ID
>>> try:
...     expression = slaf_array.get_cell_expression("invalid_cell")
... except ValueError as e:
...     print(f"Error: {e}")
Error: cell ID 'invalid_cell' not found
Source code in slaf/core/slaf.py
def get_cell_expression(self, cell_ids: str | list[str]) -> pl.DataFrame:
    """
    Get expression data for specific cells using Lance take() and Polars.

    Retrieves expression data for specified cells using efficient Lance row access
    and Polars for in-memory operations. This method provides significant
    performance improvements over SQL-based queries.

    Args:
        cell_ids: Single cell ID (string) or list of cell IDs to retrieve.
                 Can be string identifiers or integer IDs.

    Returns:
        Polars DataFrame containing expression data for the specified cells.
        Columns include cell_id, gene_id, and expression values.

    Raises:
        ValueError: If any cell ID is not found in the dataset.
        RuntimeError: If the query execution fails.

    Examples:
        >>> # Get expression for a single cell
        >>> slaf_array = SLAFArray("path/to/data.slaf")
        >>> expression = slaf_array.get_cell_expression("cell_001")
        >>> print(f"Expression data shape: {expression.shape}")
        Expression data shape: (15000, 3)

        >>> # Get expression for multiple cells
        >>> expression = slaf_array.get_cell_expression(["cell_001", "cell_002", "cell_003"])
        >>> print(f"Expression data shape: {expression.shape}")
        Expression data shape: (45000, 3)

        >>> # Error handling for invalid cell ID
        >>> try:
        ...     expression = slaf_array.get_cell_expression("invalid_cell")
        ... except ValueError as e:
        ...     print(f"Error: {e}")
        Error: cell ID 'invalid_cell' not found
    """
    # Convert to integer IDs
    integer_ids = self._normalize_entity_ids(cell_ids, "cell")

    if not integer_ids:
        return pl.DataFrame({"cell_id": [], "gene_id": [], "value": []})

    # Get row indices using RowIndexMapper
    row_indices = self.row_mapper.get_cell_row_ranges(integer_ids)

    # Load data with Lance take()
    expression_data = self.expression.take(row_indices)

    # Convert PyArrow Table to Polars DataFrame and join with metadata
    expression_df = pl.from_arrow(expression_data)
    return self._join_with_metadata(expression_df)

get_gene_expression(gene_ids: str | list[str]) -> pl.DataFrame

Get gene expression data for specified genes.

This method uses QueryOptimizer to generate optimal SQL queries based on the gene ID distribution (consecutive ranges use BETWEEN, scattered values use IN).

Parameters:

Name Type Description Default
gene_ids str | list[str]

Gene ID(s) to query. Can be a single gene ID string or list of gene IDs.

required

Returns:

Type Description
DataFrame

Polars DataFrame containing expression data for the specified genes.

DataFrame

Columns include cell_id, gene_id, and expression values.

Raises:

Type Description
ValueError

If any gene ID is not found in the dataset.

RuntimeError

If the query execution fails.

Examples:

>>> # Get expression for a single gene
>>> slaf_array = SLAFArray("path/to/data.slaf")
>>> expression = slaf_array.get_gene_expression("GENE1")
Source code in slaf/core/slaf.py
def get_gene_expression(self, gene_ids: str | list[str]) -> pl.DataFrame:
    """
    Get gene expression data for specified genes.

    This method uses QueryOptimizer to generate optimal SQL queries based on the
    gene ID distribution (consecutive ranges use BETWEEN, scattered values use IN).

    Args:
        gene_ids: Gene ID(s) to query. Can be a single gene ID string or list of gene IDs.

    Returns:
        Polars DataFrame containing expression data for the specified genes.
        Columns include cell_id, gene_id, and expression values.

    Raises:
        ValueError: If any gene ID is not found in the dataset.
        RuntimeError: If the query execution fails.

    Examples:
        >>> # Get expression for a single gene
        >>> slaf_array = SLAFArray("path/to/data.slaf")
        >>> expression = slaf_array.get_gene_expression("GENE1")
    """

    # Convert to list if single gene ID
    if isinstance(gene_ids, str):
        gene_ids = [gene_ids]

    # Convert string gene IDs to integer IDs
    integer_ids = self._normalize_entity_ids(gene_ids, "gene")

    if not integer_ids:
        return pl.DataFrame({"cell_id": [], "gene_id": [], "value": []})

    # Use QueryOptimizer for optimal performance
    from slaf.core.query_optimizer import QueryOptimizer

    # Build optimized SQL query using QueryOptimizer
    sql_query = QueryOptimizer.build_optimized_query(
        entity_ids=integer_ids,
        entity_type="gene",
        use_adaptive_batching=True,
        max_batch_size=100,
    )

    # Fix the table name for Polars scan (use 'self' instead of 'expression')
    sql_query = sql_query.replace("expression", "self")

    # Execute the optimized query
    ldf = pl.scan_pyarrow_dataset(self.expression)
    expression_df = ldf.sql(sql_query).collect()

    # Join with metadata using Polars
    return self._join_with_metadata(expression_df)

get_submatrix(cell_selector: Any | None = None, gene_selector: Any | None = None) -> pl.DataFrame

Get expression data using cell/gene selectors with Lance take() and Polars.

Retrieves a subset of expression data based on cell and gene selectors. The selectors can be slices, lists, boolean masks, or None for all cells/genes. This method provides a flexible interface for subsetting expression data with significant performance improvements over SQL-based queries.

Parameters:

Name Type Description Default
cell_selector Any | None

Cell selector for subsetting. Can be: - None: Include all cells - slice: e.g., slice(0, 100) for first 100 cells - list: e.g., [0, 5, 10] for specific cell indices - boolean mask: e.g., [True, False, True, ...] for boolean selection

None
gene_selector Any | None

Gene selector for subsetting. Can be: - None: Include all genes - slice: e.g., slice(0, 5000) for first 5000 genes - list: e.g., [0, 100, 200] for specific gene indices - boolean mask: e.g., [True, False, True, ...] for boolean selection

None

Returns:

Type Description
DataFrame

Polars DataFrame containing expression data for the selected subset.

DataFrame

Columns include cell_id, gene_id, and expression values.

Raises:

Type Description
ValueError

If selectors are invalid or out of bounds.

RuntimeError

If the query execution fails.

Examples:

>>> # Get first 100 cells and first 5000 genes
>>> slaf_array = SLAFArray("path/to/data.slaf")
>>> submatrix = slaf_array.get_submatrix(
...     cell_selector=slice(0, 100),
...     gene_selector=slice(0, 5000)
... )
>>> print(f"Submatrix shape: {submatrix.shape}")
Submatrix shape: (500000, 3)
>>> # Get specific cells and genes
>>> submatrix = slaf_array.get_submatrix(
...     cell_selector=[0, 5, 10, 15],
...     gene_selector=[100, 200, 300]
... )
>>> print(f"Submatrix shape: {submatrix.shape}")
Submatrix shape: (12, 3)
>>> # Get all cells for specific genes
>>> submatrix = slaf_array.get_submatrix(
...     gene_selector=[0, 100, 200, 300, 400]
... )
>>> print(f"Submatrix shape: {submatrix.shape}")
Submatrix shape: (5000, 3)
>>> # Error handling for invalid selector
>>> try:
...     submatrix = slaf_array.get_submatrix(
...         cell_selector=slice(0, 1000000)  # Out of bounds
...     )
... except ValueError as e:
...     print(f"Error: {e}")
Error: Cell selector out of bounds
Source code in slaf/core/slaf.py
def get_submatrix(
    self, cell_selector: Any | None = None, gene_selector: Any | None = None
) -> pl.DataFrame:
    """
    Get expression data using cell/gene selectors with Lance take() and Polars.

    Retrieves a subset of expression data based on cell and gene selectors.
    The selectors can be slices, lists, boolean masks, or None for all cells/genes.
    This method provides a flexible interface for subsetting expression data with
    significant performance improvements over SQL-based queries.

    Args:
        cell_selector: Cell selector for subsetting. Can be:
            - None: Include all cells
            - slice: e.g., slice(0, 100) for first 100 cells
            - list: e.g., [0, 5, 10] for specific cell indices
            - boolean mask: e.g., [True, False, True, ...] for boolean selection
        gene_selector: Gene selector for subsetting. Can be:
            - None: Include all genes
            - slice: e.g., slice(0, 5000) for first 5000 genes
            - list: e.g., [0, 100, 200] for specific gene indices
            - boolean mask: e.g., [True, False, True, ...] for boolean selection

    Returns:
        Polars DataFrame containing expression data for the selected subset.
        Columns include cell_id, gene_id, and expression values.

    Raises:
        ValueError: If selectors are invalid or out of bounds.
        RuntimeError: If the query execution fails.

    Examples:
        >>> # Get first 100 cells and first 5000 genes
        >>> slaf_array = SLAFArray("path/to/data.slaf")
        >>> submatrix = slaf_array.get_submatrix(
        ...     cell_selector=slice(0, 100),
        ...     gene_selector=slice(0, 5000)
        ... )
        >>> print(f"Submatrix shape: {submatrix.shape}")
        Submatrix shape: (500000, 3)

        >>> # Get specific cells and genes
        >>> submatrix = slaf_array.get_submatrix(
        ...     cell_selector=[0, 5, 10, 15],
        ...     gene_selector=[100, 200, 300]
        ... )
        >>> print(f"Submatrix shape: {submatrix.shape}")
        Submatrix shape: (12, 3)

        >>> # Get all cells for specific genes
        >>> submatrix = slaf_array.get_submatrix(
        ...     gene_selector=[0, 100, 200, 300, 400]
        ... )
        >>> print(f"Submatrix shape: {submatrix.shape}")
        Submatrix shape: (5000, 3)

        >>> # Error handling for invalid selector
        >>> try:
        ...     submatrix = slaf_array.get_submatrix(
        ...         cell_selector=slice(0, 1000000)  # Out of bounds
        ...     )
        ... except ValueError as e:
        ...     print(f"Error: {e}")
        Error: Cell selector out of bounds
    """
    # Get row indices for cells using RowIndexMapper
    cell_indices = self.row_mapper.get_cell_row_ranges_by_selector(cell_selector)

    # Load data with Lance take()
    expression_data = self.expression.take(cell_indices)
    expression_df = pl.from_arrow(expression_data)

    # Filter by genes if specified
    if gene_selector is not None:
        # Convert gene selector to integer IDs
        if isinstance(gene_selector, int):
            if gene_selector < 0:
                gene_selector = self.shape[1] + gene_selector
            if 0 <= gene_selector < self.shape[1]:
                gene_integer_id = self.var["gene_integer_id"][gene_selector]
                expression_df = expression_df.filter(
                    pl.col("gene_integer_id") == gene_integer_id  # type: ignore[arg-type]
                )
            else:
                raise ValueError(f"Gene index {gene_selector} out of bounds")
        elif isinstance(gene_selector, slice):
            start = gene_selector.start or 0
            stop = gene_selector.stop or self.shape[1]
            step = gene_selector.step or 1

            # Handle negative indices
            if start < 0:
                start = self.shape[1] + start
            if stop < 0:
                stop = self.shape[1] + stop

            # Clamp bounds
            start = max(0, min(start, self.shape[1]))
            stop = max(0, min(stop, self.shape[1]))

            gene_integer_ids = self.var["gene_integer_id"][
                start:stop:step
            ].to_list()
            expression_df = expression_df.filter(
                pl.col("gene_integer_id").is_in(gene_integer_ids)  # type: ignore[arg-type]
            )
        elif isinstance(gene_selector, list):
            gene_integer_ids = []
            for idx in gene_selector:
                if isinstance(idx, int):
                    if idx < 0:
                        idx = self.shape[1] + idx
                    if 0 <= idx < self.shape[1]:
                        gene_integer_id = self.var["gene_integer_id"][idx]
                        gene_integer_ids.append(gene_integer_id)
                    else:
                        raise ValueError(f"Gene index {idx} out of bounds")
                else:
                    raise ValueError(f"Invalid gene index type: {type(idx)}")
            expression_df = expression_df.filter(
                pl.col("gene_integer_id").is_in(gene_integer_ids)  # type: ignore[arg-type]
            )
        elif isinstance(gene_selector, np.ndarray) and gene_selector.dtype == bool:
            if len(gene_selector) != self.shape[1]:
                raise ValueError(
                    f"Boolean mask length {len(gene_selector)} doesn't match gene count {self.shape[1]}"
                )
            gene_integer_ids = self.var["gene_integer_id"][gene_selector].to_list()
            expression_df = expression_df.filter(
                pl.col("gene_integer_id").is_in(gene_integer_ids)  # type: ignore[arg-type]
            )
        else:
            raise ValueError(
                f"Unsupported gene selector type: {type(gene_selector)}"
            )

    # Join with metadata using Polars
    return self._join_with_metadata(expression_df)

info()

Print information about the SLAF dataset

Source code in slaf/core/slaf.py
def info(self):
    """Print information about the SLAF dataset"""
    # Build output string for both logging and printing
    output_lines = []

    output_lines.append("SLAF Dataset")
    output_lines.append(f"  Shape: {self.shape[0]} cells × {self.shape[1]} genes")
    output_lines.append(
        f"  Format version: {self.config.get('format_version', 'unknown')}"
    )

    # Cell metadata columns
    cell_cols = self.obs.columns
    output_lines.append(f"  Cell metadata columns: {len(cell_cols)}")
    if cell_cols:
        output_lines.append(
            f"    {', '.join(cell_cols[:5])}{'...' if len(cell_cols) > 5 else ''}"
        )

    # Gene metadata columns
    gene_cols = self.var.columns
    output_lines.append(f"  Gene metadata columns: {len(gene_cols)}")
    if gene_cols:
        output_lines.append(
            f"    {', '.join(gene_cols[:5])}{'...' if len(gene_cols) > 5 else ''}"
        )

    # Record counts
    output_lines.append("  Record counts:")
    output_lines.append(f"    Cells: {len(self.obs):,}")
    output_lines.append(f"    Genes: {len(self.var):,}")

    # Expression metadata - use pre-computed if available, otherwise query
    format_version = self.config.get("format_version", "0.1")
    if format_version >= "0.2" and "metadata" in self.config:
        # Use pre-computed metadata from config
        metadata = self.config["metadata"]
        expression_count = metadata["expression_count"]
        sparsity = metadata["sparsity"]
        density = metadata["density"]

        output_lines.append(f"    Expression records: {expression_count:,}")
        output_lines.append(f"    Sparsity: {sparsity:.1%}")
        output_lines.append(f"    Density: {density:.1%}")

        # Show expression statistics if available
        if "expression_stats" in metadata:
            stats = metadata["expression_stats"]
            output_lines.append("  Expression statistics:")
            output_lines.append(f"    Min value: {stats['min_value']:.3f}")
            output_lines.append(f"    Max value: {stats['max_value']:.3f}")
            output_lines.append(f"    Mean value: {stats['mean_value']:.3f}")
            output_lines.append(f"    Std value: {stats['std_value']:.3f}")
    else:
        # Backward compatibility: query expression count for older format versions
        output_lines.append("    Expression records: computing...")
        expression_count = self.query("SELECT COUNT(*) as count FROM expression")
        output_lines.append(
            f"    Expression records: {expression_count.item(0, 0):,}"
        )

    # Optimization info
    optimizations = self.config.get("optimizations", {})
    if optimizations:
        output_lines.append("  Optimizations:")
        for opt, value in optimizations.items():
            output_lines.append(f"    {opt}: {value}")

    # Print to stdout for backward compatibility
    output = "\n".join(output_lines)
    print(output)

    # Also log for consistency with other methods
    logger.info("SLAF Dataset info displayed")