spateo.tools.dimensionality_reduction¶
Tools for dimensionality reduction, adapted from Dynamo: https://github.com/aristoteleo/dynamo-release/ dynamo/tools/dimension_reduction.py
Functions¶
Low dimension reduction projection of an AnnData object first with PCA, followed by non-linear |
|
Compute connectivity graph, matrices for kNN neighbor indices, distance matrix and low dimension embedding with |
|
|
Determine the optimal number of UMAP components by maximizing the silhouette score for the Leiden partitioning. |
|
Perform PCA reduction. |
|
Apply PCA to the input data array X using the specified PCA function. |
Apply truncated SVD to the input data array X with centering. |
|
|
Find the optimal number of PCA components using the elbow method. |
Module Contents¶
- spateo.tools.dimensionality_reduction.perform_dimensionality_reduction(adata: anndata.AnnData, X_data: numpy.ndarray = None, genes: List[str] | None = None, layer: str | None = None, basis: str | None = 'pca', dims: List[int] | None = None, n_pca_components: int = 30, n_components: int = 2, n_neighbors: int = 30, reduction_method: Literal['pca', 'tsne', 'umap'] = 'umap', embedding_key: str | None = None, enforce: bool = False, cores: int = 1, copy: bool = False, **kwargs) anndata.AnnData | None [source]¶
Low dimension reduction projection of an AnnData object first with PCA, followed by non-linear dimension reduction methods.
- Parameters:
- adata
AnnData object
- X_data
The user supplied non-AnnDta data that will be used for dimension reduction directly. Defaults to None.
- genes
The list of genes that will be used to subset the data for dimension reduction and clustering. If None, all genes will be used. Defaults to None.
- layer
The layer that will be used to retrieve data for dimension reduction and clustering. If None, .X is used. Defaults to None.
- basis
The space that will be used for clustering. If None, will use the data itself, without any other processing. Can be None, “pca”, or any other key based on PCA.
- dims
The list of dimensions that will be selected for clustering. If None, all dimensions will be used. Defaults to None.
- n_pca_components
Number of input PCs (principle components) that will be used for further non-linear dimension reduction. If n_pca_components is larger than the existing #PC in adata.obsm[‘X_pca’] or input layer’s corresponding pca space (layer_pca), pca will be rerun with n_pca_components PCs requested. Defaults to 30.
- n_components
The dimension of the space to embed into. Defaults to 2.
- n_neighbors
The number of nearest neighbors when constructing adjacency matrix. Defaults to 30.
- reduction_method
Non-linear dimension reduction method to further reduce dimension based on the top n_pca_components PCA components. Currently, tSNE (fitsne instead of traditional tSNE used), umap or pca are supported. If “pca”, will search for/compute the PCA representation and then stop. If “tsne” or “umap”, will compute the PCA representation (or not, if ‘basis’ is None) and use this to then compute the UMAP embedding. Defaults to “umap”.
- embedding_key
The str in .obsm that will be used as the key to save the reduced embedding space. By default it is None and embedding key is set as layer + reduction_method. If layer is None, it will be “X_neighbors”. Defaults to None.
- enforce
Whether to re-perform dimension reduction even if there is reduced basis in the AnnData object. Defaults to False.
- cores
The number of cores used for calculation. Used only when tSNE reduction_method is used. Defaults to 1.
- copy
Whether to return a copy of the AnnData object or update the object in place. Defaults to False.
- kwargs
Other kwargs that will be passed to umap.UMAP. One notable variable is “densmap”, for a density-preserving dimensionality reduction. There is also “min_dist”, which provides the minimum distance apart that points are allowed to be in the low dimensional representation.
- Returns:
- An updated AnnData object updated with reduced dimension data for data from different layers,
only if copy is true.
- Return type:
adata
- spateo.tools.dimensionality_reduction.umap_conn_indices_dist_embedding(X: numpy.ndarray, n_neighbors: int = 30, n_components: int = 2, metric: str | Callable = 'euclidean', min_dist: float = 0.1, spread: float = 1.0, max_iter: int | None = None, alpha: float = 1.0, gamma: float = 1.0, negative_sample_rate: float = 5, init_pos: Literal['spectral', 'random'] | numpy.ndarray = 'spectral', random_state: int | numpy.random.RandomState | None = 0, densmap: bool = False, dens_lambda: float = 2.0, dens_frac: float = 0.3, dens_var_shift: float = 0.1, output_dens: bool = False, return_mapper: bool = True, **umap_kwargs) Tuple[umap.UMAP, scipy.sparse.coo_matrix, numpy.ndarray, numpy.ndarray, numpy.ndarray] | Tuple[scipy.sparse.coo_matrix, numpy.ndarray, numpy.ndarray, numpy.ndarray] [source]¶
Compute connectivity graph, matrices for kNN neighbor indices, distance matrix and low dimension embedding with UMAP.
From Dynamo: https://github.com/aristoteleo/dynamo-release/, which in turn derives this code from umap-learn: (https://github.com/lmcinnes/umap/blob/97d33f57459de796774ab2d7fcf73c639835676d/umap/umap_.py).
- Parameters:
- X
The input array for which to perform UMAP
- n_neighbors
The number of nearest neighbors to compute for each sample in X. Defaults to 30.
- n_components
The dimension of the space to embed into. Defaults to 2.
- metric
The distance metric to use to find neighbors. Defaults to “euclidean”.
- min_dist
The effective minimum distance between embedded points. Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are drawn closer together, while larger values will result on a more even dispersal of points. The value should be set relative to the spread value, which determines the scale at which embedded points will be spread out. Defaults to 0.1.
- spread
The effective scale of embedded points. In combination with min_dist this determines how clustered/clumped the embedded points are. Defaults to 1.0.
- max_iter
The number of training epochs to be used in optimizing the low dimensional embedding. Larger values result in more accurate embeddings. If None is specified a value will be selected based on the size of the input dataset (200 for large datasets, 500 for small). This argument was refactored from n_epochs from UMAP-learn to account for recent API changes in UMAP-learn 0.5.2. Defaults to None.
- alpha
Initial learning rate for the SGD. Defaults to 1.0.
- gamma
Weight to apply to negative samples. Values higher than one will result in greater weight being given to negative samples. Defaults to 1.0.
- negative_sample_rate
The number of negative samples to select per positive sample in the optimization process. Increasing this value will result in greater repulsive force being applied, greater optimization cost, but slightly more accuracy. The number of negative edge/1-simplex samples to use per positive edge/1-simplex sample in optimizing the low dimensional embedding. Defaults to 5.
- init_pos
The method to initialize the low dimensional embedding. Where: “spectral”: use a spectral embedding of the fuzzy 1-skeleton. “random”: assign initial embedding positions at random. Or an np.ndarray to define the initial position. Defaults to “spectral”.
- random_state
The method to generate random numbers. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by numpy.random. Defaults to 0.
- densmap
Whether to use the density-augmented objective function to optimize the embedding according to the densMAP algorithm. Defaults to False.
- dens_lambda
Controls the regularization weight of the density correlation term in densMAP. Higher values prioritize density preservation over the UMAP objective, and vice versa for values closer to zero. Setting this parameter to zero is equivalent to running the original UMAP algorithm. Defaults to 2.0.
- dens_frac
Controls the fraction of epochs (between 0 and 1) where the density-augmented objective is used in densMAP. The first (1 - dens_frac) fraction of epochs optimize the original UMAP objective before introducing the density correlation term. Defaults to 0.3.
- dens_var_shift
A small constant added to the variance of local radii in the embedding when calculating the density correlation objective to prevent numerical instability from dividing by a small number. Defaults to 0.1.
- output_dens
Whether the local radii of the final embedding (an inverse measure of local density) are computed and returned in addition to the embedding. If set to True, local radii of the original data are also included in the output for comparison; the output is a tuple (embedding, original local radii, embedding local radii). This option can also be used when densmap=False to calculate the densities for UMAP embeddings. Defaults to False.
- return_mapper
Whether to return the data mapped onto the UMAP space. Defaults to True.
- Returns:
Data mapped onto umap space, will be returned only if return_mapper is True graph: Sparse matrix representing the nearest neighbor graph knn_indices: The indices of the nearest neighbors for each sample knn_dists: The distances of the nearest neighbors for each sample embedding: The embedding of the data in low-dimensional space
- Return type:
mapper
- spateo.tools.dimensionality_reduction.find_optimal_n_umap_components(X_data: numpy.ndarray, max_n_components: int | None = None, **umap_params)[source]¶
Determine the optimal number of UMAP components by maximizing the silhouette score for the Leiden partitioning.
- Parameters:
- X_data
Input data to UMAP
- max_n_components
Maximum number of UMAP components to test. If not given, will use half the number of features (half the number of columns of the input array).
- **umap_params
Parameters to pass to the UMAP function. Should not include ‘n_components’, which will be added by this function.
- Returns:
Number of components resulting in the highest silhouette score for the Leiden partitioning
- Return type:
best_n_components
- spateo.tools.dimensionality_reduction.pca(adata: anndata.AnnData, X_data: numpy.ndarray = None, n_pca_components: int = 30, pca_key: str = 'X_pca', pcs_key: str = 'PCs', layer: List[str] | str | None = None, svd_solver: Literal['randomized', 'arpack'] = 'randomized', random_state: int = 0, use_truncated_SVD_threshold: int = 500000, use_incremental_PCA: bool = False, incremental_batch_size: int | None = None, return_all: bool = False) anndata.AnnData | Tuple[anndata.AnnData, sklearn.decomposition.PCA | sklearn.decomposition.TruncatedSVD, numpy.ndarray] [source]¶
Perform PCA reduction.
For large datasets (>1 million samples), incremental PCA is recommended to avoid memory issues. For datasets with more than 500,000 samples, truncated SVD will be used. Otherwise, truncated SVD with centering will be used.
- Parameters:
- adata
AnnData object to store results in
- X_data
Optional data array to perform dimension reduction on
- n_pca_components
Number of PCA components reduced to. Defaults to 30.
- pca_key
The key to store the reduced data. Defaults to “X”.
- pcs_key
The key to store the principle axes in feature space. Defaults to “PCs”.
- layer
The layer(s) to perform dimension reduction on. Only used if ‘X_data’ is not provided. Defaults to None to use “.X”.
- svd_solver
The svd_solver to solve svd decomposition in PCA.
- random_state
The seed used to initialize the random state for PCA.
- use_truncated_SVD_threshold
The threshold of observations to use truncated SVD instead of standard PCA for efficiency.
- use_incremental_PCA
whether to use Incremental PCA. Recommended to set True when dataset is too large to fit in memory.
- incremental_batch_size
The number of samples to use for each batch when performing incremental PCA. If batch_size is None, then batch_size is inferred from the data and set to 5 * n_features.
- return_all
Whether to return the PCA fit model and the reduced array together with the updated AnnData object. Defaults to False.
- Returns:
Updated AnnData object fit: The PCA fit model. Returned only if ‘return_all’ is True. X_pca: The reduced data array. Returned only if ‘return_all’ is True.
- Return type:
adata
- spateo.tools.dimensionality_reduction.pca_fit(X: numpy.ndarray, pca_func: Callable, n_components: int = 30, **kwargs) Tuple[sklearn.decomposition.PCA, numpy.ndarray] [source]¶
Apply PCA to the input data array X using the specified PCA function.
- Parameters:
- X
The input data array of shape (n_samples, n_features).
- pca_func
The PCA function to use, which should have a ‘fit’ and ‘transform’ method, such as the PCA class or the IncrementalPCA class from sklearn.decomposition.
- n_components
The number of principal components to compute
- **kwargs
Any additional keyword arguments that will be passed to the PCA function
- Returns:
The fitted PCA object X_pca: The reduced data array of shape (n_samples, n_components)
- Return type:
fit
- spateo.tools.dimensionality_reduction.truncated_SVD_with_center(X: numpy.ndarray, n_components: int = 30, random_state: int | numpy.random.RandomState | None = 0) Tuple[sklearn.decomposition.PCA, numpy.ndarray] [source]¶
Apply truncated SVD to the input data array X with centering.
- Parameters:
- X
The input data array of shape (n_samples, n_features).
- n_components
The number of principal components to compute
- random_state
The seed used to initialize the random state for PCA.
- Returns:
The fitted truncated SVD object X_pca: The reduced data array of shape (n_samples, n_components)
- Return type:
fit
- spateo.tools.dimensionality_reduction.find_optimal_pca_components(X: numpy.ndarray, pca_func: Callable, method: Literal['elbow', 'eigen'] = 'elbow', max_components: int | None = None, drop_ratio: float = 0.33, **kwargs) int [source]¶
Find the optimal number of PCA components using the elbow method.
- Parameters:
- X
The input data array of shape (n_samples, n_features)
- pca_func
The PCA function to use, which should have a ‘fit’ and ‘transform’ method, such as the PCA class or the IncrementalPCA class from sklearn.decomposition.
- method
Method to use to find the optimal number of components. Either ‘elbow’ or ‘eigen’. “Elbow” uses the elbow method, “eigen” uses a permutation procedure to find the maximum eigenvalue to use as a threshold for variance explained.
- max_components
The maximum number of principal components to test. If not given, will use half the number of features (half the number of columns of the input array).
- drop_ratio
The ratio of the change in explained variance to consider a significant drop
- **kwargs
Any additional keyword arguments that will be passed to the PCA function
- Returns:
Optimal number of components
- Return type:
n