`spateo.tools.cluster.utils`#

Module Contents#

Functions#

`compute_pca_components`(→ Tuple[Any, int, float])	Calculate the inflection point of the PCA curve to
`pca_spateo`(adata[, X_data, n_pca_components, pca_key, ...])	Do PCA for dimensional reduction.
`sctransform`(adata, rlib_path[, n_top_genes, ...])	Use sctransform with an additional flag vst.flavor="v2" to perform normalization and dimensionality reduction
`pearson_residuals`(adata[, n_top_genes, subset, theta, ...])	Preprocess UMI count data with analytic Pearson residuals.
`integrate`(→ anndata.AnnData)	Concatenating all anndata objects.
`harmony_debatch`(→ Optional[anndata.AnnData])	Use harmonypy [Korunsky19] to remove batch effects.
`ecp_silhouette`(→ float)	Here we evaluate the clustering performance by calculating the Silhouette Coefficient.
`spatial_adj_dyn`(adata[, spatial_key, pca_key, ...])	Calculate the adjacent matrix based on a neighborhood graph of gene expression space

Attributes#

to_dense_matrix

spateo.tools.cluster.utils.to_dense_matrix[source]#

spateo.tools.cluster.utils.compute_pca_components(matrix: Union[numpy.ndarray, scipy.sparse.spmatrix], random_state: Optional[int] = 1, save_curve_img: Optional[str] = None) → Tuple[Any, int, float][source]#

Calculate the inflection point of the PCA curve to obtain the number of principal components that the PCA should retain.

Parameters

matrix: A dense or sparse matrix.
save_curve_img: If save_curve_img != None, save the image of the PCA curve and inflection points.

Returns

The number of principal components that PCA should retain. new_components_stored: Percentage of variance explained by the retained principal components.

Return type

new_n_components

spateo.tools.cluster.utils.pca_spateo(adata: anndata.AnnData, X_data=None, n_pca_components: Optional[int] = None, pca_key: Optional[str] = 'X_pca', genes: Union[list, None] = None, layer: Union[str, None] = None, random_state: Optional[int] = 1)[source]#

Do PCA for dimensional reduction.

Parameters

adata: An Anndata object.
X_data: The user supplied data that will be used for dimension reduction directly.
n_pca_components: The number of principal components that PCA will retain. If none, will Calculate the inflection point of the PCA curve to obtain the number of principal components that the PCA should retain.
pca_key: Add the PCA result to obsm using this key.
genes: The list of genes that will be used to subset the data for dimension reduction and clustering. If None, all genes will be used.
layer: The layer that will be used to retrieve data for dimension reduction and clustering. If None, will use adata.X.

Returns

The processed AnnData, where adata.obsm[pca_key] stores the PCA result.

Return type

adata_after_pca

spateo.tools.cluster.utils.sctransform(adata: anndata.AnnData, rlib_path: str, n_top_genes: Optional[int] = 3000, save_sct_img_1: Optional[str] = None, save_sct_img_2: Optional[str] = None, **kwargs)[source]#

Use sctransform with an additional flag vst.flavor=”v2” to perform normalization and dimensionality reduction Original Code Repository: https://github.com/saketkc/pySCTransform

Installation: Conda:

`conda install R`

R:

```if (!require(“BiocManager”, quietly = TRUE)): install.packages(“BiocManager”)```

`BiocManager::install(version = "3.14")` `BiocManager::install("glmGamPoi")`

Python:

`pip install rpy2` `pip install git+https://github.com/saketkc/pysctransform`

Examples

>>> sctransform(adata=adata, rlib_path="/Users/jingzehua/opt/anaconda3/envs/spateo/lib/R")

Parameters

adata: An Anndata object.
rlib_path: library path for R environment.
n_top_genes: Number of highly-variable genes to keep.
save_sct_img_1: If save_sct_img_1 != None, save the image of the GLM model parameters.
save_sct_img_2: If save_sct_img_2 != None, save the image of the final residual variances.
**kwargs: Additional keyword arguments to pysctransform.SCTransform.

Returns

Updates adata with the field adata.obsm["pearson_residuals"], containing pearson_residuals.

spateo.tools.cluster.utils.pearson_residuals(adata: anndata.AnnData, n_top_genes: Optional[int] = 3000, subset: bool = False, theta: float = 100, clip: Optional[float] = None, check_values: bool = True)[source]#

Preprocess UMI count data with analytic Pearson residuals.

Pearson residuals transform raw UMI counts into a representation where three aims are achieved:

1.Remove the technical variation that comes from differences in total counts between cells; 2.Stabilize the mean-variance relationship across genes, i.e. ensure that biological signal from both low and

high expression genes can contribute similarly to downstream processing

3.Genes that are homogeneously expressed (like housekeeping genes) have small variance, while genes that are: differentially expressed (like marker genes) have high variance

Parameters

adata

An anndata object.

n_top_genes

Number of highly-variable genes to keep.

subset

Inplace subset to highly-variable genes if True otherwise merely indicate highly variable genes.

theta

The negative binomial overdispersion parameter theta for Pearson residuals. Higher values correspond to less overdispersion (var = mean + mean^2/theta), and theta=np.Inf corresponds to a Poisson model.

clip

Determines if and how residuals are clipped: * If None, residuals are clipped to the interval [-sqrt(n), sqrt(n)], where n is the number of cells

in the dataset (default behavior).

If any scalar c, residuals are clipped to the interval [-c, c]. Set clip=np.Inf for no clipping.

check_values

Check if counts in selected layer are integers. A Warning is returned if set to True.

Returns

Updates adata with the field adata.obsm["pearson_residuals"], containing pearson_residuals.

spateo.tools.cluster.utils.integrate(adatas: List[anndata.AnnData], batch_key: str = 'slices', fill_value: Union[int, float] = 0) → anndata.AnnData[source]#

Concatenating all anndata objects.

Parameters

adatas: AnnData matrices to concatenate with.
batch_key: Add the batch annotation to obs using this key.
fill_value: Scalar value to fill newly missing values in arrays with.

Returns

The concatenated AnnData, where adata.obs[batch_key] stores a categorical variable labeling the batch.

Return type

integrated_adata

spateo.tools.cluster.utils.harmony_debatch(adata: anndata.AnnData, key: str, basis: str = 'X_pca', adjusted_basis: str = 'X_pca_harmony', max_iter_harmony: int = 10, copy: bool = False) → Optional[anndata.AnnData][source]#

Use harmonypy [Korunsky19] to remove batch effects. This function should be run after performing PCA but before computing the neighbor graph. Original Code Repository: https://github.com/slowkow/harmonypy Interesting example: https://slowkow.com/notes/harmony-animation/

Parameters

adata: An Anndata object.
key: The name of the column in adata.obs that differentiates among experiments/batches.
basis: The name of the field in adata.obsm where the PCA table is stored.
adjusted_basis: The name of the field in adata.obsm where the adjusted PCA table will be stored after running this function.
max_iter_harmony: Maximum number of rounds to run Harmony. One round of Harmony involves one clustering and one correction step.
copy: Whether to copy adata or modify it inplace.

Returns

Updates adata with the field adata.obsm[adjusted_basis], containing principal components adjusted by Harmony.

spateo.tools.cluster.utils.ecp_silhouette(matrix: Union[numpy.ndarray, scipy.sparse.spmatrix], cluster_labels: numpy.ndarray) → float[source]#

Here we evaluate the clustering performance by calculating the Silhouette Coefficient. The silhouette analysis is used to choose an optimal value for clustering resolution.

The Silhouette Coefficient is a widely used method for evaluating clustering performance, where a higher Silhouette Coefficient score relates to a model with better defined clusters and indicates a good separation between the celltypes.

Advantages of the Silhouette Coefficient:

The score is bounded between -1 for incorrect clustering and +1 for highly dense clustering. Scores around zero indicate overlapping clusters.
The score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster.

Original Code Repository: https://scikit-learn.org/stable/modules/clustering.html#silhouette-coefficient

Parameters

matrix: A dense or sparse matrix of feature.
cluster_labels: A array of labels for each cluster.

Returns

Mean Silhouette Coefficient for all clusters.

Examples

>>> silhouette_score(matrix=adata.obsm["X_pca"], cluster_labels=adata.obs["leiden"].values)

spateo.tools.cluster.utils.spatial_adj_dyn(adata: anndata.AnnData, spatial_key: str = 'spatial', pca_key: str = 'pca', e_neigh: int = 30, s_neigh: int = 6, n_pca_components: int = 30)[source]#: Calculate the adjacent matrix based on a neighborhood graph of gene expression space and a neighborhood graph of physical space.

spateo.tools.cluster.utils#

Module Contents#

Functions#

Attributes#

`spateo.tools.cluster.utils`#