Concepts

This page explains the current abstractions and on-disk layout used by spora_io.

Dataset Format

The current dataset format separates modality-specific image data from shared segmentations and shared tiling:

dataset/
├── metadata/
│   ├── tissues.parquet
│   └── cells.parquet                # optional
├── he/
│   └── 1_0mpp/
│       └── images/
├── imc/
│   ├── channels.parquet
│   ├── channels_per_tissue.parquet
│   └── 1_0mpp/
│       ├── images/
│       └── standardization/
│           └── quantile_clipping/
│               └── uq_0.99_image/
├── ihc/
│   └── ihc_CD3/
│       └── 1_0mpp/
│           └── images/
├── segmentations/
│   └── 1_0mpp/
│       ├── tissue_masks/
│       └── cell_masks/
└── tiling/
    └── 1_0mpp/
        └── default/
            ├── 224_tile_coordinates.parquet
            └── 224_tile_stats.parquet

Dataset Hierarchy

All unimodal dataset classes inherit from BaseImagingDataset, which handles metadata, shared tissue masks, shared cell masks, and tile coordinate loading.

Modality Types

Each imaging modality is represented by a dataclass such as HEModality or CycIFModality. These store the modality name and canonical directory used on disk.

Tissue and Data Types

Loading a tissue returns a typed dataclass:

  • HETissue – contains a torch.Tensor of shape (3, H, W) or (H, W, 3) depending on image_mode.

  • MultiplexTissue – tensor of shape (C, H, W) plus channel_names, measured_mask, image_loading_mask, and optional uniprot_ids.

  • IHCTissue – RGB IHC tissue image plus the marker name.

  • ComposedTissue – a dictionary mapping modality names to their respective tissue objects.

  • TissueMask / CellMask – binary or integer segmentation masks.

Channel Filtering Pipeline

For multiplex datasets, channels are progressively filtered:

  1. complete – all channels measured for the tissue (from channels_per_tissue.parquet).

  2. qc_filtered – only channels that pass quality control (qc_pass column in channels.parquet).

  3. uniprot_filtered – further restricted to channels with valid UniProt IDs, enabling aligned downstream protein-centric analysis.

Pass kind="complete", kind="qc_filtered", or kind="uniprot_filtered" to get_tissue().

Standardization

Multiplex preprocessing is configured via the standardization argument when constructing MultiplexImagingDataset.

The active implementation is in spora_io.utils.dataset.standardize. The standardizer reads parquet stats from:

<modality>/<resolution>/standardization/<spec>/

Typical specs are:

  • identity – no transform other than tensor conversion

  • quantile_clipping/uq_0.99_image

  • quantile_clipping_log1p/uq_0.99_image

The final suffix, such as _image or _global, records the quantile level used to compute the clipping thresholds. It is part of the standardization spec because the subsequent means and standard deviations depend on that choice.

The factory function build_standardizer() resolves the requested spec to the appropriate standardizer class.

For H&E images, normalization uses ImageNet or HIBOU mean/std presets (controlled by the mean_std_type argument).

Tiling

The function best_mask_tiling_try_to_stop() computes an optimised set of tile coordinates from a binary tissue mask. The result is typically persisted as a parquet file with one row per tile:

  • tissue_id

  • tile_id

  • row

  • col

Two criteria must both be met to stop early:

  1. Coverage has reached coverage_goal.

  2. The best remaining tile contributes fewer than min_gain_ratio new pixels.

For baseline or ablation workflows, get_grid_tile() computes fixed-grid tiles. It pads the mask extent with background so that edge windows are also considered, then keeps windows whose tissue fraction is at least 1 - tolerance. These edge windows keep their original top-left row and col coordinates even when row + tile_size or col + tile_size is outside the image. Dataset loaders therefore pad only non-default tile strategies when an edge tile crosses the image boundary; the default strategy remains a direct-slice fast path.

Resolution Convention

Resolutions are expressed in microns per pixel (MPP) and encoded as directory names with underscores replacing dots: 1_0mpp for 1.0 MPP, 0_65mpp for 0.65 MPP. Pass a float (e.g. 1.0) or the string directly.