Quick Start
Loading an H&E dataset
from spora_io import HEImagingDataset
dataset = HEImagingDataset(
name="my_dataset",
resolution=1.0,
tile_size=224,
)
tissue_ids = dataset.get_tissue_ids()
tissue = dataset.get_tissue(tissue_ids[0])
# tissue.image is a torch.Tensor of shape (3, H, W)
Loading a multiplex dataset
from spora_io import MultiplexImagingDataset
dataset = MultiplexImagingDataset(
name="my_dataset",
modality="cycif", # or "imc", "codex", "mibi" etc.
standardization="identity", # or "quantile_clipping/uq_0.99_image"
resolution=1.0,
tile_size=224,
replace_nuclear_uniprot_ids=True, # optional; maps nuclear markers to P68431
)
tissue_ids = dataset.get_tissue_ids()
tissue = dataset.get_tissue(tissue_ids[0], kind="uniprot_filtered")
# tissue.image is a torch.Tensor of shape (C, H, W)
# tissue.channel_names contains the marker names
# tissue.uniprot_ids contains the aligned UniProt IDs when available
Composing multiple modalities
from spora_io import ComposedImagingDataset
dataset = ComposedImagingDataset(
name="my_dataset",
modalities=["he", "cycif"],
resolution=1.0,
tile_size=224,
split="train", # optional; filters tissues.parquet["split"]
modality_kwargs={"cycif": {"standardization": "identity"}},
)
composed = dataset.get_composed_tissue(tissue_ids[0])
# composed.modalities["he"] -> HETissue
# composed.modalities["cycif"] -> MultiplexTissue
Loading across cohorts
Use SporaDataset when training over multiple
datasets. It builds one composed dataset per cohort and samples either tissues
or tiles from a global index.
from spora_io import SporaDataset
dataset = SporaDataset(
["dataset_a", "dataset_b"],
modalities=["he", "codex"],
resolution=1.0,
tile_size=224,
sampling_unit="tiles",
split="train", # optional; forwarded to each composed dataset
modality_kwargs={"codex": {"standardization": "identity"}},
)
sample = dataset.sample_random_tile(preprocess=False)
# sample["dataset_name"], sample["tissue_id"], sample["tile_id"]
# sample["modalities"]["he"] -> HETissue tile
# sample["modalities"]["codex"] -> MultiplexTissue tile
Retrieving tiles
Tiles are fixed-size image regions precomputed from shared tissue masks and stored in
tiling/<resolution>/<strategy>/<size>_tile_coordinates.parquet.
The built-in default tile strategy assumes coordinates stay within image
bounds and uses a direct-slice loading path. If you generate padded fixed-grid
tiles with scripts.compute_tiling --grid, save them under a non-default
strategy such as grid_stride224 and pass that value as tile_strategy.
Out-of-bounds edge tiles are loaded as partial image regions padded with zeros
back to the requested tile size.
Pass a tissue_id and tile_id to retrieve a single tile:
# H&E tile
he_tile = he_dataset.get_tile(tissue_ids[0], tile_id=0)
# he_tile.image shape: (3, 224, 224)
# Multiplex tile
mx_tile = mx_dataset.get_tile(tissue_ids[0], tile_id=0, kind="complete")
# mx_tile.image shape: (C, 224, 224)
Inspecting channel metadata
The multiplex dataset exposes channel-level metadata:
# Full channel list (DataFrame)
dataset.channel_list[["channel_name", "qc_pass", "uniprot_id", "is_nuclear_marker"]]
# Optional constructor behavior:
# replace_nuclear_uniprot_ids=True sets uniprot_id="P68431" for channels
# where is_nuclear_marker is true.
# Per-tissue channel availability matrix
dataset.image_channel_map.head()
Working with masks
# Tissue mask (background vs tissue)
mask = dataset.get_tissue_mask(tissue_ids[0])
# mask.mask is a boolean NDArray of shape (H, W)
# Cell instance mask (integer labels per cell)
cell_mask = dataset.get_cell_instance_mask(tissue_ids[0])
# cell_mask.mask is an integer NDArray of shape (H, W)
# Cell task masks (e.g. cell type annotations)
mask_types = dataset.get_cell_task_mask_types()
task_mask = dataset.get_cell_task_mask(tissue_ids[0], mask_types[0])
# task_mask.mapping maps integer IDs to label strings
Label filtering
Filter tissues by metadata columns at dataset construction time:
dataset = HEImagingDataset(
name="my_dataset",
resolution=1.0,
tile_size=224,
label="Histology",
labels_to_keep=["Adenocarcinoma", "Mucinous"],
label_type="classification",
)
# Only tissues matching the filter are loaded
print(dataset.unique_labels) # array of kept label values
print(dataset.label_encoder) # {label: int} mapping