spateo.tools.find_neighbors¶
Functions for finding nearest neighbors, the distances between them and the spatial weighting between points in spatial transcriptomics data.
Classes¶
Spatial weights for regression models are learned using kernel functions. |
Functions¶
|
Given array of x- and y-coordinates, compute pairwise distances between all samples using Euclidean distance. |
|
For single sample, compute distance between that sample and each other sample in the data. |
|
Compute the Jaccard index between a row of a binary array and all other rows. |
|
Symmetrically normalize adjacency matrix, set diagonal to 1 and return processed adjacency array. |
|
Given an adjacency matrix, convert to KNN graph. |
|
Given the indices and weights of a KNN graph, convert to adjacency matrix. |
Computes connectivity and sparse distance matrices |
|
|
Pairwise distance computation, coupled with :func find_bw. |
|
Finds the bandwidth such that on average, cells in the sample have n neighbors. |
|
Finds threshold distance beyond which there is a dramatic increase in the average distance to remaining |
|
Get spatial weights for an individual sample, given the coordinates of all samples in space. |
|
Constructing bucket-to-bucket nearest neighbors graph. |
|
Given an AnnData object, compute pairwise connectivity matrix in transcriptomic or physical space |
|
Given array of x- and y-coordinates, compute affinity matrix between all samples using Euclidean distance. |
Module Contents¶
- spateo.tools.find_neighbors.calculate_distance(position: numpy.ndarray, dist_metric: str = 'euclidean') numpy.ndarray [source]¶
Given array of x- and y-coordinates, compute pairwise distances between all samples using Euclidean distance.
- spateo.tools.find_neighbors.local_dist(coords_i: numpy.ndarray, coords: numpy.ndarray)[source]¶
For single sample, compute distance between that sample and each other sample in the data.
- Parameters:
- coords_i
Array of shape (n, ), where n is the dimensionality of the data; the coordinates of a single point
- coords
Array of shape (m, n), where n is the dimensionality of the data and m is an arbitrary number of samples; pairwise distances from coords_i.
- Returns:
- array-like, shape (m, ), where m is an arbitrary number of samples. The pairwise distances
between coords_i and each point in coords.
- Return type:
distances
- spateo.tools.find_neighbors.jaccard_index(row_i: numpy.ndarray, array: numpy.ndarray)[source]¶
Compute the Jaccard index between a row of a binary array and all other rows.
- Parameters:
- row_i
1D binary array representing the row for which to compute the Jaccard index.
- array
2D binary array containing the rows to compare against.
- Returns:
1D array of Jaccard indices between row_i and each row in array.
- Return type:
jaccard_indices
- spateo.tools.find_neighbors.normalize_adj(adj: numpy.ndarray, exclude_self: bool = True) numpy.ndarray [source]¶
Symmetrically normalize adjacency matrix, set diagonal to 1 and return processed adjacency array.
- Parameters:
- adj
Pairwise distance matrix of shape [n_samples, n_samples].
- exclude_self
Set True to set diagonal of adjacency matrix to 1.
- Returns:
The normalized adjacency matrix.
- Return type:
adj_proc
- spateo.tools.find_neighbors.adj_to_knn(adj: numpy.ndarray, n_neighbors: int = 15) Tuple[numpy.ndarray, numpy.ndarray] [source]¶
Given an adjacency matrix, convert to KNN graph.
- Parameters:
- adj
Adjacency matrix of shape (n_samples, n_samples)
- n_neighbors
Number of nearest neighbors to include in the KNN graph
- Returns:
- Array (n_samples x n_neighbors) storing the indices for each node’s nearest neighbors in the
knn graph.
- weights: Array (n_samples x n_neighbors) storing the edge weights for each node’s nearest neighbors in
the knn graph.
- Return type:
indices
- spateo.tools.find_neighbors.knn_to_adj(knn_indices: numpy.ndarray, knn_weights: numpy.ndarray) scipy.sparse.csr_matrix [source]¶
Given the indices and weights of a KNN graph, convert to adjacency matrix.
- Parameters:
- knn_indices
Array (n_samples x n_neighbors) storing the indices for each node’s nearest neighbors in the knn graph.
- knn_weights
Array (n_samples x n_neighbors) storing the edge weights for each node’s nearest neighbors in the knn graph.
- Returns:
The adjacency matrix corresponding to the KNN graph
- Return type:
adj
- spateo.tools.find_neighbors.compute_distances_and_connectivities(knn_indices: numpy.ndarray, distances: numpy.ndarray) Tuple[scipy.sparse.csr_matrix, scipy.sparse.csr_matrix] [source]¶
Computes connectivity and sparse distance matrices
- Parameters:
- knn_indices
Array of shape (n_samples, n_samples) containing the indices of the nearest neighbors for each sample.
- distances
The distances to the n_neighbors the closest points in knn graph
- Returns:
Sparse distance matrix connectivities: Sparse connectivity matrix
- Return type:
distances
- spateo.tools.find_neighbors.calculate_distances_chunk(coords_chunk: numpy.ndarray, chunk_start_idx: int, coords: numpy.ndarray, n_nonzeros: dict | None = None, metric: str = 'euclidean') numpy.ndarray [source]¶
Pairwise distance computation, coupled with :func find_bw.
- Parameters:
- coords_chunk
Array of shape (n_samples_chunk, n_features) containing coordinates of the chunk of interest.
- chunk_start_idx
Index of the first sample in the chunk. Required if n_nonzeros is not None.
- coords
Array of shape (n_samples, n_features) containing the coordinates of all points.
- n_nonzeros
Optional dictionary containing the number of non-zero columns for each row in the distance matrix.
- metric
Distance metric to use for pairwise distance computation, can be any of the metrics supported by :func sklearn.metrics.pairwise_distances.
- spateo.tools.find_neighbors.find_bw_for_n_neighbors(adata: anndata.AnnData, coords_key: str = 'spatial', n_anchors: int | None = None, target_n_neighbors: int = 6, initial_bw: float | None = None, chunk_size: int = 1000, exclude_self: bool = False, normalize_distances: bool = False, verbose: bool = True, max_iterations: int = 100, alpha: float = 0.5) float [source]¶
Finds the bandwidth such that on average, cells in the sample have n neighbors.
- Parameters:
- adata
AnnData object containing coordinates for all cells
- coords_key
Key in adata.obsm where the spatial coordinates are stored
- target_n_neighbors
Target average number of neighbors per cell
- initial_bw
Can optionally be used to set the starting distance for the bandwidth search
- chunk_size
Number of cells to compute pairwise distance for at once
- exclude_self
Whether to exclude self from the list of neighbors
- normalize_distances
Whether to normalize the distances by the number of nonzero columns (should be used only if the entry in .obs[coords_key] contains something other than x-, y-, z-coordinates).
- verbose
Whether to print the bandwidth at each iteration. If False, will only print the final bandwidth.
- max_iterations
Will stop the process and return the bandwidth that results in the closest number of neighbors to the specified target if it takes more than this number of iterations.
- alpha
Factor used in determining the new bandwidth- ratio of found neighbors to target neighbors will be raised to this power.
- Returns:
Bandwidth in distance units
- Return type:
bandwidth
- spateo.tools.find_neighbors.find_threshold_distance(adata: anndata.AnnData, coords_key: str = 'X_pca', n_neighbors: int = 10, chunk_size: int = 1000, normalize_distances: bool = False) float [source]¶
Finds threshold distance beyond which there is a dramatic increase in the average distance to remaining nearest neighbors.
- Parameters:
- adata
AnnData object containing coordinates for all cells
- coords_key
Key in adata.obsm where the spatial coordinates are stored
- n_neighbors
Will first compute the number of nearest neighbors as a comparison for querying additional distance values.
- chunk_size
Number of cells to compute pairwise distance for at once
- normalize_distances
Whether to normalize the distances by the number of nonzero columns (should be used only if the entry in .obs[coords_key] contains something other than x-, y-, z-coordinates).
- Returns:
Bandwidth in distance units
- Return type:
bandwidth
- class spateo.tools.find_neighbors.Kernel(i: int, data: numpy.ndarray | scipy.sparse.spmatrix, bw: int | float, cov: numpy.ndarray | None = None, ct: numpy.ndarray | None = None, expr_mat: numpy.ndarray | None = None, fixed: bool = True, exclude_self: bool = False, function: str = 'triangular', threshold: float = 1e-05, eps: float = 1.0000001, sparse_array: bool = False, normalize_weights: bool = False, use_expression_neighbors: bool = False)[source]¶
Bases:
object
Spatial weights for regression models are learned using kernel functions.
- Args:
i: Index of the point for which to estimate the density data: Array of shape (n_samples, n_features) representing the data. If aiming to derive weights from spatial
distance, this should be the array of spatial positions.
bw: Bandwidth parameter for the kernel density estimation cov: Optional array of shape (n_samples, ). Can be used to adjust the distance calculation to look only at
samples of interest vs. samples not of interest, which is determined from nonzero values in this vector. This can be used to modify the modeling process based on factors thought to reflect biological differences, for example, to condition on histological classification, passing a distance threshold, etc. If ‘ct’ is also given, will look for samples of interest that are also of the same cell type.
- ct: Optional array of shape (n_samples, ), containing vector where cell types are encoded as integers. Can be
used to condition nearest neighbor finding on cell type or other category.
- expr_mat: Can be used together with ‘cov’ (so will only be used if ‘cov’ is not None)- if the spatial neighbors
are not consistent with the sample in question (determined by assessing similarity by “cov”), there may be different mechanisms at play. In this case, will instead search for nearest neighbors in the gene expression space if given.
- fixed: If True, bw is treated as a fixed bandwidth parameter. Otherwise, it is treated as the number
of nearest neighbors to include in the bandwidth estimation.
exclude_self: If True, ignore each sample itself when computing the kernel density estimation function: The name of the kernel function to use. Valid options are as follows (note that in equations,
any “1” can be substituted for any other value(s)): - ‘triangular’: linearly decaying kernel,
- ‘quadratic’: quadratically decaying kernel,
- :math K(u) =
- egin{cases}
dfrac{3}{4}(1-u^2)
end{cases},
‘gaussian’: decays following normal distribution, :math K(u) = dfrac{1}{sqrt{2pi}} e^{-
- rac{1}{2}u^2},
- ‘uniform’: AKA the tophat kernel, sets weight of all observations within the bandwidth to the same value,
- :math K(u) =
- egin{cases}
1 & ext{if} |u| leq 1 0 & ext{otherwise}
end{cases},
‘exponential’: exponentially decaying kernel, :math K(u) = e^{-|u|},
- ‘bisquare’: assigns a weight of zero to observations outside of the bandwidth, and decays within the
- bandwidth following equation
- :math K(u) =
- egin{cases}
dfrac{15}{16}(1-u^2)^2 & ext{if} |u| leq 1 0 & ext{otherwise}
end{cases}.
- threshold: Threshold for the kernel density estimation. If the density is below this threshold, the density
will be set to zero.
eps: Error-correcting factor to avoid division by zero sparse_array: If True, the kernel will be converted to sparse array. Recommended for large datasets. normalize_weights: If True, the weights will be normalized to sum to 1. use_expression_neighbors: If True, will only use the expression matrix to find nearest neighbors.
- spateo.tools.find_neighbors.get_wi(i: int, n_samples: int, coords: numpy.ndarray, cov: numpy.ndarray | None = None, ct: numpy.ndarray | None = None, expr_mat: numpy.ndarray | None = None, fixed_bw: bool = True, exclude_self: bool = False, kernel: str = 'gaussian', bw: float | int = 100, threshold: float = 1e-05, sparse_array: bool = False, normalize_weights: bool = False, use_expression_neighbors: bool = False) scipy.sparse.csr_matrix [source]¶
Get spatial weights for an individual sample, given the coordinates of all samples in space.
- Parameters:
- i
Index of sample for which weights are to be calculated to all other samples in the dataset
- n_samples
Total number of samples in the dataset
- coords
Array of shape (n_samples, 2) or (n_samples, 3) representing the spatial coordinates of each sample
- cov
Optional array of shape (n_samples, ). Can be used to adjust the distance calculation to look only at samples of interest vs. samples not of interest, which is determined from nonzero values in this vector. This can be used to modify the modeling process based on factors thought to reflect biological differences, for example, to condition on histological classification, passing a distance threshold, etc. If ‘ct’ is also given, will look for samples of interest that are also of the same cell type.
- ct
Optional array of shape (n_samples, ), containing vector where cell types are encoded as integers. Can be used to condition nearest neighbor finding on cell type or other category.
- expr_mat
Can be used together with ‘cov’- if the spatial neighbors are not consistent with the sample in question (determined by assessing similarity by “cov”), there may be different mechanisms at play. In this case, will instead search for nearest neighbors in the gene expression space if given.
- fixed_bw
If True, bw is treated as a spatial distance for computing spatial weights. Otherwise, it is treated as the number of neighbors.
- exclude_self
If True, ignore each sample itself when computing the kernel density estimation
- kernel
The name of the kernel function to use. Valid options: “triangular”, “uniform”, “quadratic”, “bisquare”, “gaussian” or “exponential”
- bw
Bandwidth for the spatial kernel
- threshold
Threshold for the kernel density estimation. If the density is below this threshold, the density will be set to zero.
- sparse_array
If True, the kernel will be converted to sparse array. Recommended for large datasets.
- normalize_weights
If True, the weights will be normalized to sum to 1.
- use_expression_neighbors
If True, will only use expression neighbors to determine the bandwidth.
- Returns:
Array of weights for sample of interest
- Return type:
wi
- spateo.tools.find_neighbors.construct_nn_graph(adata: anndata.AnnData, spatial_key: str = 'spatial', dist_metric: str = 'euclidean', n_neighbors: int = 8, exclude_self: bool = True, make_symmetrical: bool = False, save_id: None | str = None) None [source]¶
Constructing bucket-to-bucket nearest neighbors graph.
- Parameters:
- adata
An anndata object.
- spatial_key
Key in .obsm in which x- and y-coordinates are stored.
- dist_metric
Distance metric to use. Options: ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘jensenshannon’, ‘kulczynski1’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’.
- n_neighbors
Number of nearest neighbors to compute for each bucket.
- exclude_self
Set True to set elements along the diagonal to zero.
- make_symmetrical
Set True to make sure adjacency matrix is symmetrical (i.e. ensure that if A is a neighbor of B, B is also included among the neighbors of A)
- save_id
Optional string; if not None, will save distance matrix and neighbors matrix to path:
- path './neighbors/{save_id}_distance.csv' and
‘./neighbors/{save_id}_neighbors.csv’, respectively.
- spateo.tools.find_neighbors.neighbors(adata: anndata.AnnData, nbr_object: sklearn.neighbors.NearestNeighbors = None, basis: str = 'pca', spatial_key: str = 'spatial', n_neighbors_method: str = 'ball_tree', n_pca_components: int = 30, n_neighbors: int = 10) Tuple[sklearn.neighbors.NearestNeighbors, anndata.AnnData] [source]¶
Given an AnnData object, compute pairwise connectivity matrix in transcriptomic or physical space
- Parameters:
- adata
an anndata object.
- nbr_object
An optional sklearn.neighbors.NearestNeighbors object. Can optionally create a nearest neighbor object with custom functionality.
- basis
str, default ‘pca’ The space that will be used for nearest neighbor search. Valid names includes, for example, pca, umap, or X for gene expression neighbors, ‘spatial’ for neighbors in the physical space.
- spatial_key
Optional, can be used to specify .obsm entry in adata that contains spatial coordinates. Only used if basis is ‘spatial’.
- n_neighbors_method
str, default ‘ball_tree’ Specifies algorithm to use in computing neighbors using sklearn’s implementation. Options: “ball_tree” and “kd_tree”.
- n_pca_components
Only used if ‘basis’ is ‘pca’. Sets number of principal components to compute (if PCA has not already been computed for this dataset).
- n_neighbors
Number of neighbors for kneighbors queries.
- Returns:
Object of class sklearn.neighbors.NearestNeighbors adata : Modified AnnData object
- Return type:
nbrs
- spateo.tools.find_neighbors.calculate_affinity(position: numpy.ndarray, dist_metric: str = 'euclidean', n_neighbors: int = 10) numpy.ndarray [source]¶
Given array of x- and y-coordinates, compute affinity matrix between all samples using Euclidean distance. Math from: Zelnik-Manor, L., & Perona, P. (2004). Self-tuning spectral clustering. Advances in neural information processing systems, 17. https://proceedings.neurips.cc/paper/2004/file/40173ea48d9567f1f393b20c855bb40b-Paper.pdf