Adding a Dataset

This page describes the expected SPORA dataset layout, the commands needed to generate tile coordinates and multiplex standardization statistics, and the extension point for custom standardization classes.

Dataset Layout

Each dataset lives under SPORA_DATASETS_DIR:

<datasets_dir>/
  <dataset_name>/
    metadata/
      tissues.parquet
      cells.parquet                  # optional
    segmentations/
      1_0mpp/
        tissue_masks/
          <tissue_id>.npz
        cell_masks/
          instances/
            <tissue_id>.npz          # optional
    tiling/
      1_0mpp/
        default/
          224_tile_coordinates.parquet
          224_tile_stats.parquet
          128_tile_coordinates.parquet
          128_tile_stats.parquet
    he/
      1_0mpp/
        images/
          <tissue_id>.ome.zarr
    imc/
      channels.parquet
      channels_per_tissue.parquet
      1_0mpp/
        images/
          <tissue_id>.ome.zarr
        standardization/
          quantile_clipping/
            uq_0.99_image/
              image_level_upper_quantiles.parquet
              global_level_means.parquet
              global_level_stds.parquet
    ihc/
      ihc_<marker>/
        1_0mpp/
          images/
            <tissue_id>.ome.zarr

The multiplex modality directory can be imc, codex, cycif, or mibi. IHC is marker-specific, so each marker is stored as ihc/ihc_<marker>.

Metadata

metadata/tissues.parquet is the source of truth for image-level metadata. It should contain one row per image or tissue-modality entry. The required identifier is tissue_id. If the table has tissue_id as its index, the loaders use that index directly.

Recommended columns are:

Column

Meaning

tissue_id

Unique image identifier, either as a column or index.

dataset

Dataset or cohort name.

patient_id

Harmonized patient identifier.

specimen_id

Harmonized specimen or slide identifier.

modality

Image modality, such as he, imc, codex, cycif, mibi, or ihc_<marker>.

split

Optional split label used by dataset constructors that accept split.

For multiplex modalities, add channel metadata:

File

Required content

<modality>/channels.parquet

Cohort-level channel table with marker names, channel indices, optional UniProt IDs, QC flags, and nuclear-marker flags.

<modality>/channels_per_tissue.parquet

Tissue-by-channel availability table. tissue_id should be the index.

Useful channel columns are channel_name, index, uniprot_id, qc_pass, and is_nuclear_marker. Nuclear stains such as DAPI may have missing UniProt IDs, but should be flagged with is_nuclear_marker when available. If you want nuclear channels to remain available in kind="uniprot_filtered" outputs, instantiate MultiplexImagingDataset with replace_nuclear_uniprot_ids=True. This replaces UniProt IDs for channels where is_nuclear_marker is true with Histone H3’s UniProt ID P68431 at load time.

Images And Masks

Images are stored as OME-Zarr directories under the modality’s <resolution>/images directory. The current loaders read the image scale from the OME-Zarr store and return channel-first tensors.

Tissue masks are shared across modalities:

<dataset_name>/segmentations/1_0mpp/tissue_masks/<tissue_id>.npz

The .npz file must contain a mask array. Its height and width should match the image at the same resolution. Cell instance masks are optional and use the same shared segmentations tree.

Validate Basic Loading

Before computing derived files, validate that the dataset can be opened:

from spora_io.datasets import HEImagingDataset, MultiplexImagingDataset

he = HEImagingDataset(
    name="my_dataset",
    resolution=1.0,
    tile_size=None,
)
print(len(he.get_tissue_ids()))

imc = MultiplexImagingDataset(
    name="my_dataset",
    modality="imc",
    resolution=1.0,
    tile_size=None,
    standardization="identity",
)
tissue = imc.get_tissue(imc.get_tissue_ids()[0], preprocess=False)
print(tissue.image.shape)

Generate Tiling

Tile coordinates are generated from the shared tissue masks. They are saved as parquet files with one row per tile:

tissue_id | tile_id | row | col

Run tiling once per tile size:

python -m scripts.compute_tiling \
 --dataset-name my_dataset \
 --resolution 1.0 \
 --tiling-method default \
 --tile-size 224 \
 --stride 112 \
 --tolerance 0.85 \
 --coverage-goal 1.0 \
 --min-gain-ratio 0.05

python -m scripts.compute_tiling \
 --dataset-name my_dataset \
 --resolution 1.0 \
 --tiling-method default \
 --tile-size 128 \
 --stride 64 \
 --tolerance 0.85 \
 --coverage-goal 1.0 \
 --min-gain-ratio 0.05

Outputs are written to:

<dataset_name>/tiling/1_0mpp/default/224_tile_coordinates.parquet
<dataset_name>/tiling/1_0mpp/default/224_tile_stats.parquet

If both output files already exist, the script skips computation unless --overwrite is passed.

To generate a fixed-grid baseline instead of adaptive greedy tiles, pass --grid and use a non-default strategy name:

python -m scripts.compute_tiling \
 --dataset-name my_dataset \
 --resolution 1.0 \
 --tiling-method grid_stride224 \
 --grid \
 --tile-size 224 \
 --stride 224 \
 --tolerance 0.85 \
 --overwrite

For grid tiling, --tolerance 0.85 keeps tiles with at least 15% tissue. Grid tiles at image edges are allowed to extend beyond the image; dataset loaders pad those edge tiles with zeros when tile_strategy is not default. Do not save grid outputs under tiling/.../default.

Generate Standardization Statistics

Standardization statistics are computed per multiplex modality. The current supported methods are quantile_clipping and quantile_clipping_log1p.

Typical commands:

python -m scripts.compute_standardization_stats \
   --dataset-name my_dataset \
   --modality imc \
   --method quantile_clipping \
   --quantile-level image \
   --stats-level global \
   --upper-quantile 0.99 \
   --resolution 1.0

python -m scripts.compute_standardization_stats \
   --dataset-name my_dataset \
   --modality imc \
   --method quantile_clipping_log1p \
   --quantile-level image \
   --stats-level global \
   --upper-quantile 0.99 \
   --resolution 1.0

The first command writes files under:

<dataset_name>/imc/1_0mpp/standardization/quantile_clipping/uq_0.99_image/

The _image suffix comes from --quantile-level image. If --quantile-level global is used, the directory is uq_0.99_global. This suffix is required because means and standard deviations are computed after clipping and therefore depend on the quantile level.

Load the generated statistics by passing the same spec to the dataset:

ds = MultiplexImagingDataset(
    name="my_dataset",
    modality="imc",
    resolution=1.0,
    tile_size=224,
    standardization="quantile_clipping/uq_0.99_image",
    quantile_level="image",
    stats_level="global",
)

Implement A Custom Standardizer

Custom standardization classes live in spora_io/utils/dataset/standardize.py.

Use BaseStandardizer if the transform does not need saved parquet statistics. Use StatsBackedStandardizer if the transform should load quantiles, means, or standard deviations from <modality>/<resolution>/standardization/<spec>.

Example stats-backed standardizer:

class SqrtQuantileClippingStandardizer(StatsBackedStandardizer):
    def _transform(
        self,
        x_t: torch.Tensor,
        upper_t: torch.Tensor,
        lower_t: torch.Tensor,
    ) -> torch.Tensor:
        x_t = torch.clamp(x_t, min=lower_t, max=upper_t)
        x_t = (x_t - lower_t) / (upper_t - lower_t + 1e-8)
        return torch.sqrt(torch.clamp(x_t, min=0.0))

Register the class in build_standardizer:

if method == "quantile_clipping_sqrt":
    return SqrtQuantileClippingStandardizer(
        spec=spec,
        use_mean_std=use_mean_std,
        **kwargs_common,
    )

If the method needs the standard statistics script to generate files, also add the method name to VALID_METHODS in scripts/compute_standardization_stats.py. The directory name written by the script must match the method name used in build_standardizer.

After registering the class, test it by loading one tissue manually:

ds = MultiplexImagingDataset(
    name="my_dataset",
    modality="imc",
    resolution=1.0,
    tile_size=None,
    standardization="quantile_clipping_sqrt/uq_0.99_image",
)
tissue = ds.get_tissue(ds.get_tissue_ids()[0])
print(tissue.image.shape)

Final Checklist

For a new dataset, verify the following before using it in training:

  • metadata/tissues.parquet loads and exposes the expected tissue IDs.

  • Every image has a matching tissue mask at the same resolution.

  • Multiplex modalities have channels.parquet and channels_per_tissue.parquet.

  • 224_tile_coordinates.parquet and 128_tile_coordinates.parquet exist under tiling/1_0mpp/default.

  • quantile_clipping/uq_0.99_image and quantile_clipping_log1p/uq_0.99_image exist for every multiplex modality that will be standardized.

  • A small HEImagingDataset, MultiplexImagingDataset, or SporaDataset instance can load one tissue or tile without errors.