Concepts
This page explains the current abstractions and on-disk layout used by
spora_io.
Dataset Format
The current dataset format separates modality-specific image data from shared segmentations and shared tiling:
dataset/
├── metadata/
│ ├── tissues.parquet
│ └── cells.parquet # optional
├── he/
│ └── 1_0mpp/
│ └── images/
├── imc/
│ ├── channels.parquet
│ ├── channels_per_tissue.parquet
│ └── 1_0mpp/
│ ├── images/
│ └── standardization/
│ └── quantile_clipping/
│ └── uq_0.99_image/
├── ihc/
│ └── ihc_CD3/
│ └── 1_0mpp/
│ └── images/
├── segmentations/
│ └── 1_0mpp/
│ ├── tissue_masks/
│ └── cell_masks/
└── tiling/
└── 1_0mpp/
└── default/
├── 224_tile_coordinates.parquet
└── 224_tile_stats.parquet
Dataset Hierarchy
All unimodal dataset classes inherit from
BaseImagingDataset, which handles metadata,
shared tissue masks, shared cell masks, and tile coordinate loading.
HEImagingDatasetloads RGB H&E images.MultiplexImagingDatasetloads multiplex images and aligned channel metadata.SingleIHCImagingDatasetloads single-marker RGB IHC images.ComposedImagingDatasetwraps several unimodal datasets into a single multi-modal handle.SporaDatasetwraps multiple composed datasets and samples tissues or tiles across cohorts.
Modality Types
Each imaging modality is represented by a dataclass such as
HEModality or
CycIFModality. These store the modality name
and canonical directory used on disk.
Tissue and Data Types
Loading a tissue returns a typed dataclass:
HETissue– contains atorch.Tensorof shape(3, H, W)or(H, W, 3)depending onimage_mode.MultiplexTissue– tensor of shape(C, H, W)pluschannel_names,measured_mask,image_loading_mask, and optionaluniprot_ids.IHCTissue– RGB IHC tissue image plus the marker name.ComposedTissue– a dictionary mapping modality names to their respective tissue objects.TissueMask/CellMask– binary or integer segmentation masks.
Channel Filtering Pipeline
For multiplex datasets, channels are progressively filtered:
complete – all channels measured for the tissue (from
channels_per_tissue.parquet).qc_filtered – only channels that pass quality control (
qc_passcolumn inchannels.parquet).uniprot_filtered – further restricted to channels with valid UniProt IDs, enabling aligned downstream protein-centric analysis.
Pass kind="complete", kind="qc_filtered", or
kind="uniprot_filtered" to
get_tissue().
Standardization
Multiplex preprocessing is configured via the standardization argument when
constructing MultiplexImagingDataset.
The active implementation is in
spora_io.utils.dataset.standardize. The standardizer reads parquet stats
from:
<modality>/<resolution>/standardization/<spec>/
Typical specs are:
identity– no transform other than tensor conversionquantile_clipping/uq_0.99_imagequantile_clipping_log1p/uq_0.99_image
The final suffix, such as _image or _global, records the quantile level
used to compute the clipping thresholds. It is part of the standardization
spec because the subsequent means and standard deviations depend on that
choice.
The factory function
build_standardizer() resolves the
requested spec to the appropriate standardizer class.
For H&E images, normalization uses ImageNet or HIBOU mean/std presets
(controlled by the mean_std_type argument).
Tiling
The function
best_mask_tiling_try_to_stop()
computes an optimised set of tile coordinates from a binary tissue mask. The
result is typically persisted as a parquet file with one row per tile:
tissue_idtile_idrowcol
Two criteria must both be met to stop early:
Coverage has reached
coverage_goal.The best remaining tile contributes fewer than
min_gain_rationew pixels.
For baseline or ablation workflows,
get_grid_tile() computes fixed-grid tiles.
It pads the mask extent with background so that edge windows are also
considered, then keeps windows whose tissue fraction is at least
1 - tolerance. These edge windows keep their original top-left row and
col coordinates even when row + tile_size or col + tile_size is
outside the image. Dataset loaders therefore pad only non-default tile
strategies when an edge tile crosses the image boundary; the default
strategy remains a direct-slice fast path.
Resolution Convention
Resolutions are expressed in microns per pixel (MPP) and encoded as directory
names with underscores replacing dots: 1_0mpp for 1.0 MPP, 0_65mpp for
0.65 MPP. Pass a float (e.g. 1.0) or the string directly.