spateo.tools.cluster

Submodules

Classes

pySTAGATE

Class representing the object of pySTAGATE.

Functions

spagcn_vanilla(→ Optional[anndata.AnnData])

Integrating gene expression and spatial location to identify spatial domains via SpaGCN.

CAST(adata[, sample_key, basis, layer, n_components, ...])

CAST is a Python library for physically aligning different spatial transcriptome regardless of technologies, magnification, individual variation, and experimental batch effects. CAST is composed of three modules: CAST Mark, CAST Stack, and CAST Projection.

kmeans_clustering(adata[, n_clusters, use_rep, ...])

KMeans clustering for spatial transcriptomics data.

mclust_py(adata[, n_components, use_rep, modelNames, ...])

Clustering using Gaussian Mixture Model (GMM), similar to mclust in R.

scc(→ Optional[anndata.AnnData])

Spatially constrained clustering (scc) to identify continuous tissue domains.

smooth(→ list)

Optimize the label by majority voting in the neighborhood.

spagcn_pyg(→ Optional[anndata.AnnData])

Function to find clusters with spagcn.

compute_pca_components(→ Tuple[Any, int, float])

Calculate the inflection point of the PCA curve to

ecp_silhouette(→ float)

Here we evaluate the clustering performance by calculating the Silhouette Coefficient.

integrate(→ anndata.AnnData)

Concatenating all anndata objects.

pca_spateo(adata[, X_data, n_pca_components, pca_key, ...])

Do PCA for dimensional reduction.

pearson_residuals(adata[, n_top_genes, subset, theta, ...])

Preprocess UMI count data with analytic Pearson residuals.

Package Contents

class spateo.tools.cluster.pySTAGATE(adata: anndata.AnnData, num_batch_x, num_batch_y, basis='spatial', spatial_key: list = ['X', 'Y'], batch_size: int = 1, rad_cutoff: int = 200, num_epoch: int = 1000, lr: float = 0.001, weight_decay: float = 0.0001, hidden_dims: list = [512, 30], device: str = 'cuda:0')[source]

Class representing the object of pySTAGATE.

device
loader
num_epoch = 1000
lr = 0.001
weight_decay = 0.0001
hidden_dims = [512, 30]
adata
data
model
optimizer
train()[source]

Train the STAGATE model.

predicted()[source]

Predict the STAGATE representation and ReX values for all cells.

cal_pSM(n_neighbors: int = 20, resolution: int = 1, max_cell_for_subsampling: int = 5000, psm_key='pSM_STAGATE')[source]

Calculate the pseudo-spatial map using diffusion pseudotime (DPT) algorithm.

Parameters:
n_neighbors int

Number of neighbors for constructing the kNN graph.

resolution float

Resolution for clustering.

max_cell_for_subsampling int

Maximum number of cells for subsampling. If the number of cells is larger than this value, the subsampling will be performed.

Returns:

pSM_values – The pseudo-spatial map values.

Return type:

numpy.ndarray

spateo.tools.cluster.spagcn_vanilla(adata: anndata.AnnData, spatial_key: str = 'spatial', key_added: str | None = 'spagcn_pred', n_pca_components: int | None = None, e_neigh: int = 10, resolution: float = 0.4, n_clusters: int | None = None, refine_shape: Literal['hexagon', 'square'] = 'hexagon', p: float = 0.5, seed: int = 100, numIterMaxSpa: int = 2000, copy: bool = False) anndata.AnnData | None[source]

Integrating gene expression and spatial location to identify spatial domains via SpaGCN. Original Code Repository: https://github.com/jianhuupenn/SpaGCN

Reference:

Jian Hu, Xiangjie Li, Kyle Coleman, Amelia Schroeder, Nan Ma, David J. Irwin, Edward B. Lee, Russell T. Shinohara & Mingyao Li. SpaGCN: Integrating gene expression, spatial location and histology to identify spatial domains and spatially variable genes by graph convolutional network. Nature Methods volume 18, pages1342–1351 (2021)

Parameters:
adata

An Anndata object after normalization.

spatial_key

the key in .obsm that corresponds to the spatial coordinate of each bucket.

key_added

adata.obs key under which to add the cluster labels. The initial clustering results of SpaGCN are under key_added, and the refined clustering results are under f’{key_added}_refined’.

n_pca_components

Number of principal components to compute. If n_pca_components == None, the value at the inflection point of the PCA curve is automatically calculated as n_comps.

e_neigh

Number of nearest neighbor in gene expression space. Used in dyn.pp.neighbors(adata, n_neighbors=e_neigh).

resolution

Resolution in the Louvain clustering method. Used when `n_clusters`==None.

n_clusters

Number of spatial domains wanted. If n_clusters != None, the suitable resolution in the initial Louvain clustering method will be automatically searched based on n_clusters.

refine_shape

Smooth the spatial domains with given spatial topology, “hexagon” for Visium data, “square” for ST data. Defaults to None.

p

Percentage of total expression contributed by neighborhoods.

seed

Global seed for random, torch, numpy. Defaults to 100.

numIterMaxSpa

SpaGCN maximum number of training iterations.

copy

Whether to copy adata or modify it inplace.

Returns:

Depending on the parameter copy, when True return an updates adata with the field adata.obs[key_added] and adata.obs[f'{key_added}_refined'], containing the cluster result based on SpaGCN; else inplace update the adata object.

spateo.tools.cluster.CAST(adata, sample_key=None, basis='spatial', layer='norm_1e4', n_components=10, output_path='output/CAST_Mark', gpu_t=0, device='cuda:0', **kwargs)[source]

CAST is a Python library for physically aligning different spatial transcriptome regardless of technologies, magnification, individual variation, and experimental batch effects. CAST is composed of three modules: CAST Mark, CAST Stack, and CAST Projection.

Parameters:
adata

an Anndata object, after normalization.

sample_key

str, optional, default: None The key in .obs that corresponds to the sample labels.

basis

str, optional, default: ‘spatial’ The basis used for CAST.

layer

str, optional, default: ‘norm_1e4’ The layer used for CAST.

output_path

str, optional, default: ‘output/CAST_Mark’ The path to save the CAST results.

gpu_t

int, optional, default: 0 The GPU index to be used.

device

str, optional, default: ‘cuda:0’ The device to be used.

kwargs

additional parameters for CAST.

spateo.tools.cluster.kmeans_clustering(adata, n_clusters=10, use_rep='X_cast', random_state=42, cluster_key='kmeans_clusters')[source]

KMeans clustering for spatial transcriptomics data.

Parameters:
adata

an Anndata object, after normalization.

n_clusters

int, optional, default: 10 The number of clusters.

use_rep

str, optional, default: ‘X_cast’ The representation to be used for clustering.

random_state

int, optional, default: 42 Random seed for reproducibility.

cluster_key

str, optional, default: ‘kmeans_clusters’ The key in .obs that corresponds to the cluster labels

spateo.tools.cluster.mclust_py(adata, n_components=None, use_rep: str = 'X_pca', modelNames='EEE', random_seed=42)[source]

Clustering using Gaussian Mixture Model (GMM), similar to mclust in R.

Parameters:
adata

an Anndata object, after normalization.

n_components

int, optional, default: None The number of mixture components.

use_rep

str, optional, default: ‘X_pca’ The representation to be used for clustering.

modelNames

str, optional, default: ‘EEE’ The model name to be used for clustering.

  • EEE: represents Equal volume, shape, and orientation (spherical).

  • VVV: represents Variable volume, shape, and orientation.

  • EEV: represents Equal volume and shape, variable orientation (tied).

  • VVI: represents Variable volume and shape, equal orientation (diag).

random_seed

int, optional, default: 42 Random seed for reproducibility.

spateo.tools.cluster.scc(adata: anndata.AnnData, spatial_key: str = 'spatial', key_added: str | None = 'scc', pca_key: str = 'pca', e_neigh: int = 30, s_neigh: int = 6, resolution: float | None = None, cluster_method: str = 'louvain') anndata.AnnData | None[source]

Spatially constrained clustering (scc) to identify continuous tissue domains.

Reference:

Ao Chen, Sha Liao, Mengnan Cheng, Kailong Ma, Liang Wu, Yiwei Lai, Xiaojie Qiu, Jin Yang, Wenjiao Li, Jiangshan Xu, Shijie Hao, Xin Wang, Huifang Lu, Xi Chen, Xing Liu, Xin Huang, Feng Lin, Zhao Li, Yan Hong, Defeng Fu, Yujia Jiang, Jian Peng, Shuai Liu, Mengzhe Shen, Chuanyu Liu, Quanshui Li, Yue Yuan, Huiwen Zheng, Zhifeng Wang, H Xiang, L Han, B Qin, P Guo, PM Cánoves, JP Thiery, Q Wu, F Zhao, M Li, H Kuang, J Hui, O Wang, B Wang, M Ni, W Zhang, F Mu, Y Yin, H Yang, M Lisby, RJ Cornall, J Mulder, M Uhlen, MA Esteban, Y Li, L Liu, X Xu, J Wang. Spatiotemporal transcriptomic atlas of mouse organogenesis using DNA nanoball-patterned arrays. Cell, 2022.

Parameters:
adata

an Anndata object, after normalization.

spatial_key

the key in .obsm that corresponds to the spatial coordinate of each bucket.

key_added

adata.obs key under which to add the cluster labels.

pca_key

label for the .obsm key containing PCA information (without the potential prefix “X_”)

e_neigh

the number of nearest neighbor in gene expression space.

s_neigh

the number of nearest neighbor in physical space.

resolution

the resolution parameter of the leiden clustering algorithm.

Returns:

An ~anndata.AnnData object with cluster info in .obs.

Return type:

adata

spateo.tools.cluster.smooth(adata: anndata.AnnData, radius: int = 50, key: str = 'label') list[source]

Optimize the label by majority voting in the neighborhood.

Parameters:
adata

an Anndata object, after normalization.

radius

the radius of the neighborhood.

key

the key in .obs that corresponds to the cluster labels.

spateo.tools.cluster.spagcn_pyg(adata: anndata.AnnData, n_clusters: int, p: float = 0.5, s: int = 1, b: int = 49, refine_shape: str | None = None, his_img_path: str | None = None, total_umi: str | None = None, x_pixel: str = None, y_pixel: str = None, x_array: str = None, y_array: str = None, seed: int = 100, copy: bool = False) anndata.AnnData | None[source]

Function to find clusters with spagcn.

Reference:

Jian Hu, Xiangjie Li, Kyle Coleman, Amelia Schroeder, Nan Ma, David J. Irwin, Edward B. Lee, Russell T. Shinohara & Mingyao Li. SpaGCN: Integrating gene expression, spatial location and histology to identify spatial domains and spatially variable genes by graph convolutional network. Nature Methods volume 18, pages1342–1351 (2021)

Parameters:
adata

an Anndata object, after normalization.

n_clusters

Desired number of clusters.

p

parameter p in spagcn algorithm. See SpaGCN for details. Defaults to 0.5.

s

alpha to control the color scale in calculating adjacent matrix. Defaults to 1.

b

beta to control the range of neighbourhood when calculate grey value for one spot in calculating adjacent matrix. Defaults to 49.

refine_shape

Smooth the spatial domains with given spatial topology, “hexagon” for Visium data, “square” for ST data. Defaults to None.

his_img_path

The file path of histology image used to calculate adjacent matrix in spagcn algorithm. Defaults to None.

total_umi

By providing the key(colname) in adata.obs which contains total UMIs(counts) for each spot, the function use the total counts as a grayscale image when histology image is not provided. Ignored if his_img_path is not None. Defaults to “total_umi”.

x_pixel

The key(colname) in adata.obs which contains corresponding x-pixels in histology image. Defaults to None.

y_pixel

The key(colname) in adata.obs which contains corresponding y-pixels in histology image. Defaults to None.

x_array

The key(colname) in adata.obs which contains corresponding x-coordinates. Defaults to None.

y_array

The key(colname) in adata.obs which contains corresponding y-coordinates. Defaults to None.

seed

Global seed for random, torch, numpy. Defaults to 100.

copy

Whether to return a new deep copy of adata instead of updating adata object passed in arguments. Defaults to False.

Returns:

~anndata.AnnData: An ~anndata.AnnData object with cluster info in “spagcn_pred”, and in “spagcn_pred_refined” if refine_shape is set.

The adjacent matrix used in spagcn algorithm is saved in adata.uns[“adj_spagcn”].

Return type:

class

spateo.tools.cluster.compute_pca_components(matrix: numpy.ndarray | scipy.sparse.spmatrix, random_state: int | None = 1, save_curve_img: str | None = None) Tuple[Any, int, float][source]

Calculate the inflection point of the PCA curve to obtain the number of principal components that the PCA should retain.

Parameters:
matrix

A dense or sparse matrix.

save_curve_img

If save_curve_img != None, save the image of the PCA curve and inflection points.

Returns:

The number of principal components that PCA should retain. new_components_stored: Percentage of variance explained by the retained principal components.

Return type:

new_n_components

spateo.tools.cluster.ecp_silhouette(matrix: numpy.ndarray | scipy.sparse.spmatrix, cluster_labels: numpy.ndarray) float[source]

Here we evaluate the clustering performance by calculating the Silhouette Coefficient. The silhouette analysis is used to choose an optimal value for clustering resolution.

The Silhouette Coefficient is a widely used method for evaluating clustering performance, where a higher Silhouette Coefficient score relates to a model with better defined clusters and indicates a good separation between the celltypes.

Advantages of the Silhouette Coefficient:
  • The score is bounded between -1 for incorrect clustering and +1 for highly dense clustering. Scores around zero indicate overlapping clusters.

  • The score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster.

Original Code Repository: https://scikit-learn.org/stable/modules/clustering.html#silhouette-coefficient

Parameters:
matrix

A dense or sparse matrix of feature.

cluster_labels

A array of labels for each cluster.

Returns:

Mean Silhouette Coefficient for all clusters.

Examples

>>> silhouette_score(matrix=adata.obsm["X_pca"], cluster_labels=adata.obs["leiden"].values)
spateo.tools.cluster.integrate(adatas: List[anndata.AnnData], batch_key: str = 'slices', fill_value: int | float = 0) anndata.AnnData[source]

Concatenating all anndata objects.

Parameters:
adatas

AnnData matrices to concatenate with.

batch_key

Add the batch annotation to obs using this key.

fill_value

Scalar value to fill newly missing values in arrays with.

Returns:

The concatenated AnnData, where adata.obs[batch_key] stores a categorical variable labeling the batch.

Return type:

integrated_adata

spateo.tools.cluster.pca_spateo(adata: anndata.AnnData, X_data: numpy.ndarray | None = None, n_pca_components: int | None = None, pca_key: str | None = 'X_pca', genes: list | None = None, layer: str | None = None, random_state: int | None = 1)[source]

Do PCA for dimensional reduction.

Parameters:
adata

An Anndata object.

X_data

The user supplied data that will be used for dimension reduction directly.

n_pca_components

The number of principal components that PCA will retain. If none, will Calculate the inflection point of the PCA curve to obtain the number of principal components that the PCA should retain.

pca_key

Add the PCA result to obsm using this key.

genes

The list of genes that will be used to subset the data for dimension reduction and clustering. If None, all genes will be used.

layer

The layer that will be used to retrieve data for dimension reduction and clustering. If None, will use adata.X.

Returns:

The processed AnnData, where adata.obsm[pca_key] stores the PCA result.

Return type:

adata_after_pca

spateo.tools.cluster.pearson_residuals(adata: anndata.AnnData, n_top_genes: int | None = 3000, subset: bool = False, theta: float = 100, clip: float | None = None, check_values: bool = True)[source]

Preprocess UMI count data with analytic Pearson residuals.

Pearson residuals transform raw UMI counts into a representation where three aims are achieved:

1.Remove the technical variation that comes from differences in total counts between cells; 2.Stabilize the mean-variance relationship across genes, i.e. ensure that biological signal from both low and

high expression genes can contribute similarly to downstream processing

3.Genes that are homogeneously expressed (like housekeeping genes) have small variance, while genes that are

differentially expressed (like marker genes) have high variance

Parameters:
adata

An anndata object.

n_top_genes

Number of highly-variable genes to keep.

subset

Inplace subset to highly-variable genes if True otherwise merely indicate highly variable genes.

theta

The negative binomial overdispersion parameter theta for Pearson residuals. Higher values correspond to less overdispersion (var = mean + mean^2/theta), and theta=np.Inf corresponds to a Poisson model.

clip

Determines if and how residuals are clipped: * If None, residuals are clipped to the interval [-sqrt(n), sqrt(n)], where n is the number of cells

in the dataset (default behavior).

  • If any scalar c, residuals are clipped to the interval [-c, c]. Set clip=np.Inf for no clipping.

check_values

Check if counts in selected layer are integers. A Warning is returned if set to True.

Returns:

Updates adata with the field adata.obsm["pearson_residuals"], containing pearson_residuals.