spateo.tools.find_neighbors¶

Functions for finding nearest neighbors, the distances between them and the spatial weighting between points in spatial transcriptomics data.

Classes¶

Kernel

Spatial weights for regression models are learned using kernel functions.

Functions¶

`calculate_distance`(→ numpy.ndarray)	Given array of x- and y-coordinates, compute pairwise distances between all samples using Euclidean distance.
`local_dist`(coords_i, coords)	For single sample, compute distance between that sample and each other sample in the data.
`jaccard_index`(row_i, array)	Compute the Jaccard index between a row of a binary array and all other rows.
`normalize_adj`(→ numpy.ndarray)	Symmetrically normalize adjacency matrix, set diagonal to 1 and return processed adjacency array.
`adj_to_knn`(→ Tuple[numpy.ndarray, numpy.ndarray])	Given an adjacency matrix, convert to KNN graph.
`knn_to_adj`(→ scipy.sparse.csr_matrix)	Given the indices and weights of a KNN graph, convert to adjacency matrix.
`compute_distances_and_connectivities`(...)	Computes connectivity and sparse distance matrices
`calculate_distances_chunk`(→ numpy.ndarray)	Pairwise distance computation, coupled with :func find_bw.
`find_bw_for_n_neighbors`(→ float)	Finds the bandwidth such that on average, cells in the sample have n neighbors.
`find_threshold_distance`(→ float)	Finds threshold distance beyond which there is a dramatic increase in the average distance to remaining
`get_wi`(→ scipy.sparse.csr_matrix)	Get spatial weights for an individual sample, given the coordinates of all samples in space.
`construct_nn_graph`(→ None)	Constructing bucket-to-bucket nearest neighbors graph.
`neighbors`(→ Tuple[sklearn.neighbors.NearestNeighbors, ...)	Given an AnnData object, compute pairwise connectivity matrix in transcriptomic or physical space
`calculate_affinity`(→ numpy.ndarray)	Given array of x- and y-coordinates, compute affinity matrix between all samples using Euclidean distance.

Module Contents¶

spateo.tools.find_neighbors.calculate_distance(position: numpy.ndarray, dist_metric: str = 'euclidean') → numpy.ndarray[source]¶: Given array of x- and y-coordinates, compute pairwise distances between all samples using Euclidean distance.

spateo.tools.find_neighbors.local_dist(coords_i: numpy.ndarray, coords: numpy.ndarray)[source]¶

For single sample, compute distance between that sample and each other sample in the data.

Parameters:

coords_i: Array of shape (n, ), where n is the dimensionality of the data; the coordinates of a single point
coords: Array of shape (m, n), where n is the dimensionality of the data and m is an arbitrary number of samples; pairwise distances from coords_i.

Returns:

array-like, shape (m, ), where m is an arbitrary number of samples. The pairwise distances: between coords_i and each point in coords.

Return type:

distances

spateo.tools.find_neighbors.jaccard_index(row_i: numpy.ndarray, array: numpy.ndarray)[source]¶

Compute the Jaccard index between a row of a binary array and all other rows.

Parameters:

row_i: 1D binary array representing the row for which to compute the Jaccard index.
array: 2D binary array containing the rows to compare against.

Returns:

1D array of Jaccard indices between row_i and each row in array.

Return type:

jaccard_indices

spateo.tools.find_neighbors.normalize_adj(adj: numpy.ndarray, exclude_self: bool = True) → numpy.ndarray[source]¶

Symmetrically normalize adjacency matrix, set diagonal to 1 and return processed adjacency array.

Parameters:

adj: Pairwise distance matrix of shape [n_samples, n_samples].
exclude_self: Set True to set diagonal of adjacency matrix to 1.

Returns:

The normalized adjacency matrix.

Return type:

adj_proc

spateo.tools.find_neighbors.adj_to_knn(adj: numpy.ndarray, n_neighbors: int = 15) → Tuple[numpy.ndarray, numpy.ndarray][source]¶

Given an adjacency matrix, convert to KNN graph.

Parameters:

adj: Adjacency matrix of shape (n_samples, n_samples)
n_neighbors: Number of nearest neighbors to include in the KNN graph

Returns:

Array (n_samples x n_neighbors) storing the indices for each node’s nearest neighbors in the: knn graph.
weights: Array (n_samples x n_neighbors) storing the edge weights for each node’s nearest neighbors in: the knn graph.

Return type:

indices

spateo.tools.find_neighbors.knn_to_adj(knn_indices: numpy.ndarray, knn_weights: numpy.ndarray) → scipy.sparse.csr_matrix[source]¶

Given the indices and weights of a KNN graph, convert to adjacency matrix.

Parameters:

knn_indices: Array (n_samples x n_neighbors) storing the indices for each node’s nearest neighbors in the knn graph.
knn_weights: Array (n_samples x n_neighbors) storing the edge weights for each node’s nearest neighbors in the knn graph.

Returns:

The adjacency matrix corresponding to the KNN graph

Return type:

adj

spateo.tools.find_neighbors.compute_distances_and_connectivities(knn_indices: numpy.ndarray, distances: numpy.ndarray) → Tuple[scipy.sparse.csr_matrix, scipy.sparse.csr_matrix][source]¶

Computes connectivity and sparse distance matrices

Parameters:

knn_indices: Array of shape (n_samples, n_samples) containing the indices of the nearest neighbors for each sample.
distances: The distances to the n_neighbors the closest points in knn graph

Returns:

Sparse distance matrix connectivities: Sparse connectivity matrix

Return type:

distances

spateo.tools.find_neighbors.calculate_distances_chunk(coords_chunk: numpy.ndarray, chunk_start_idx: int, coords: numpy.ndarray, n_nonzeros: dict | None = None, metric: str = 'euclidean') → numpy.ndarray[source]¶

Pairwise distance computation, coupled with :func find_bw.

Parameters:

coords_chunk: Array of shape (n_samples_chunk, n_features) containing coordinates of the chunk of interest.
chunk_start_idx: Index of the first sample in the chunk. Required if n_nonzeros is not None.
coords: Array of shape (n_samples, n_features) containing the coordinates of all points.
n_nonzeros: Optional dictionary containing the number of non-zero columns for each row in the distance matrix.
metric: Distance metric to use for pairwise distance computation, can be any of the metrics supported by :func sklearn.metrics.pairwise_distances.

spateo.tools.find_neighbors.find_bw_for_n_neighbors(adata: anndata.AnnData, coords_key: str = 'spatial', n_anchors: int | None = None, target_n_neighbors: int = 6, initial_bw: float | None = None, chunk_size: int = 1000, exclude_self: bool = False, normalize_distances: bool = False, verbose: bool = True, max_iterations: int = 100, alpha: float = 0.5) → float[source]¶

Finds the bandwidth such that on average, cells in the sample have n neighbors.

Parameters:

adata: AnnData object containing coordinates for all cells
coords_key: Key in adata.obsm where the spatial coordinates are stored
target_n_neighbors: Target average number of neighbors per cell
initial_bw: Can optionally be used to set the starting distance for the bandwidth search
chunk_size: Number of cells to compute pairwise distance for at once
exclude_self: Whether to exclude self from the list of neighbors
normalize_distances: Whether to normalize the distances by the number of nonzero columns (should be used only if the entry in .obs[coords_key] contains something other than x-, y-, z-coordinates).
verbose: Whether to print the bandwidth at each iteration. If False, will only print the final bandwidth.
max_iterations: Will stop the process and return the bandwidth that results in the closest number of neighbors to the specified target if it takes more than this number of iterations.
alpha: Factor used in determining the new bandwidth- ratio of found neighbors to target neighbors will be raised to this power.

Returns:

Bandwidth in distance units

Return type:

bandwidth

spateo.tools.find_neighbors.find_threshold_distance(adata: anndata.AnnData, coords_key: str = 'X_pca', n_neighbors: int = 10, chunk_size: int = 1000, normalize_distances: bool = False) → float[source]¶

Finds threshold distance beyond which there is a dramatic increase in the average distance to remaining nearest neighbors.

Parameters:

adata: AnnData object containing coordinates for all cells
coords_key: Key in adata.obsm where the spatial coordinates are stored
n_neighbors: Will first compute the number of nearest neighbors as a comparison for querying additional distance values.
chunk_size: Number of cells to compute pairwise distance for at once
normalize_distances: Whether to normalize the distances by the number of nonzero columns (should be used only if the entry in .obs[coords_key] contains something other than x-, y-, z-coordinates).

Returns:

Bandwidth in distance units

Return type:

bandwidth

class spateo.tools.find_neighbors.Kernel(i: int, data: numpy.ndarray | scipy.sparse.spmatrix, bw: int | float, cov: numpy.ndarray | None = None, ct: numpy.ndarray | None = None, expr_mat: numpy.ndarray | None = None, fixed: bool = True, exclude_self: bool = False, function: str = 'triangular', threshold: float = 1e-05, eps: float = 1.0000001, sparse_array: bool = False, normalize_weights: bool = False, use_expression_neighbors: bool = False)[source]¶

Bases: object

Spatial weights for regression models are learned using kernel functions.

Args:
i: Index of the point for which to estimate the density data: Array of shape (n_samples, n_features) representing the data. If aiming to derive weights from spatial

distance, this should be the array of spatial positions.

bw: Bandwidth parameter for the kernel density estimation cov: Optional array of shape (n_samples, ). Can be used to adjust the distance calculation to look only at

samples of interest vs. samples not of interest, which is determined from nonzero values in this vector. This can be used to modify the modeling process based on factors thought to reflect biological differences, for example, to condition on histological classification, passing a distance threshold, etc. If ‘ct’ is also given, will look for samples of interest that are also of the same cell type.

ct: Optional array of shape (n_samples, ), containing vector where cell types are encoded as integers. Can be
used to condition nearest neighbor finding on cell type or other category.

expr_mat: Can be used together with ‘cov’ (so will only be used if ‘cov’ is not None)- if the spatial neighbors
are not consistent with the sample in question (determined by assessing similarity by “cov”), there may be different mechanisms at play. In this case, will instead search for nearest neighbors in the gene expression space if given.

fixed: If True, bw is treated as a fixed bandwidth parameter. Otherwise, it is treated as the number
of nearest neighbors to include in the bandwidth estimation.

exclude_self: If True, ignore each sample itself when computing the kernel density estimation function: The name of the kernel function to use. Valid options are as follows (note that in equations,

any “1” can be substituted for any other value(s)): - ‘triangular’: linearly decaying kernel,

:math K(u) =

egin{cases}
1-|u| & ext{if} |u| leq 1 0 & ext{otherwise}

end{cases},

‘quadratic’: quadratically decaying kernel,

:math K(u) =

egin{cases}
dfrac{3}{4}(1-u^2)

end{cases},

‘gaussian’: decays following normal distribution, :math K(u) = dfrac{1}{sqrt{2pi}} e^{-

rac{1}{2}u^2},

‘uniform’: AKA the tophat kernel, sets weight of all observations within the bandwidth to the same value,

:math K(u) =

egin{cases}
1 & ext{if} |u| leq 1 0 & ext{otherwise}

end{cases},

‘exponential’: exponentially decaying kernel, :math K(u) = e^{-|u|},

‘bisquare’: assigns a weight of zero to observations outside of the bandwidth, and decays within the

bandwidth following equation

:math K(u) =

egin{cases}
dfrac{15}{16}(1-u^2)^2 & ext{if} |u| leq 1 0 & ext{otherwise}

end{cases}.

threshold: Threshold for the kernel density estimation. If the density is below this threshold, the density: will be set to zero.

eps: Error-correcting factor to avoid division by zero sparse_array: If True, the kernel will be converted to sparse array. Recommended for large datasets. normalize_weights: If True, the weights will be normalized to sum to 1. use_expression_neighbors: If True, will only use the expression matrix to find nearest neighbors.

kernel[source]¶

_kernel_functions(x)[source]¶

spateo.tools.find_neighbors.get_wi(i: int, n_samples: int, coords: numpy.ndarray, cov: numpy.ndarray | None = None, ct: numpy.ndarray | None = None, expr_mat: numpy.ndarray | None = None, fixed_bw: bool = True, exclude_self: bool = False, kernel: str = 'gaussian', bw: float | int = 100, threshold: float = 1e-05, sparse_array: bool = False, normalize_weights: bool = False, use_expression_neighbors: bool = False) → scipy.sparse.csr_matrix[source]¶

Get spatial weights for an individual sample, given the coordinates of all samples in space.

Parameters:

i: Index of sample for which weights are to be calculated to all other samples in the dataset
n_samples: Total number of samples in the dataset
coords: Array of shape (n_samples, 2) or (n_samples, 3) representing the spatial coordinates of each sample
cov: Optional array of shape (n_samples, ). Can be used to adjust the distance calculation to look only at samples of interest vs. samples not of interest, which is determined from nonzero values in this vector. This can be used to modify the modeling process based on factors thought to reflect biological differences, for example, to condition on histological classification, passing a distance threshold, etc. If ‘ct’ is also given, will look for samples of interest that are also of the same cell type.
ct: Optional array of shape (n_samples, ), containing vector where cell types are encoded as integers. Can be used to condition nearest neighbor finding on cell type or other category.
expr_mat: Can be used together with ‘cov’- if the spatial neighbors are not consistent with the sample in question (determined by assessing similarity by “cov”), there may be different mechanisms at play. In this case, will instead search for nearest neighbors in the gene expression space if given.
fixed_bw: If True, bw is treated as a spatial distance for computing spatial weights. Otherwise, it is treated as the number of neighbors.
exclude_self: If True, ignore each sample itself when computing the kernel density estimation
kernel: The name of the kernel function to use. Valid options: “triangular”, “uniform”, “quadratic”, “bisquare”, “gaussian” or “exponential”
bw: Bandwidth for the spatial kernel
threshold: Threshold for the kernel density estimation. If the density is below this threshold, the density will be set to zero.
sparse_array: If True, the kernel will be converted to sparse array. Recommended for large datasets.
normalize_weights: If True, the weights will be normalized to sum to 1.
use_expression_neighbors: If True, will only use expression neighbors to determine the bandwidth.

Returns:

Array of weights for sample of interest

Return type:

spateo.tools.find_neighbors.construct_nn_graph(adata: anndata.AnnData, spatial_key: str = 'spatial', dist_metric: str = 'euclidean', n_neighbors: int = 8, exclude_self: bool = True, make_symmetrical: bool = False, save_id: None | str = None) → None[source]¶

Constructing bucket-to-bucket nearest neighbors graph.

Parameters:

adata: An anndata object.
spatial_key: Key in .obsm in which x- and y-coordinates are stored.
dist_metric: Distance metric to use. Options: ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘jensenshannon’, ‘kulczynski1’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’.
n_neighbors: Number of nearest neighbors to compute for each bucket.
exclude_self: Set True to set elements along the diagonal to zero.
make_symmetrical: Set True to make sure adjacency matrix is symmetrical (i.e. ensure that if A is a neighbor of B, B is also included among the neighbors of A)
save_id: Optional string; if not None, will save distance matrix and neighbors matrix to path:
path './neighbors/{save_id}_distance.csv' and: ‘./neighbors/{save_id}_neighbors.csv’, respectively.

spateo.tools.find_neighbors.neighbors(adata: anndata.AnnData, nbr_object: sklearn.neighbors.NearestNeighbors = None, basis: str = 'pca', spatial_key: str = 'spatial', n_neighbors_method: str = 'ball_tree', n_pca_components: int = 30, n_neighbors: int = 10) → Tuple[sklearn.neighbors.NearestNeighbors, anndata.AnnData][source]¶

Given an AnnData object, compute pairwise connectivity matrix in transcriptomic or physical space

Parameters:

adata: an anndata object.
nbr_object: An optional sklearn.neighbors.NearestNeighbors object. Can optionally create a nearest neighbor object with custom functionality.
basis: str, default ‘pca’ The space that will be used for nearest neighbor search. Valid names includes, for example, pca, umap, or X for gene expression neighbors, ‘spatial’ for neighbors in the physical space.
spatial_key: Optional, can be used to specify .obsm entry in adata that contains spatial coordinates. Only used if basis is ‘spatial’.
n_neighbors_method: str, default ‘ball_tree’ Specifies algorithm to use in computing neighbors using sklearn’s implementation. Options: “ball_tree” and “kd_tree”.
n_pca_components: Only used if ‘basis’ is ‘pca’. Sets number of principal components to compute (if PCA has not already been computed for this dataset).
n_neighbors: Number of neighbors for kneighbors queries.

Returns:

Object of class sklearn.neighbors.NearestNeighbors adata : Modified AnnData object

Return type:

nbrs

spateo.tools.find_neighbors.calculate_affinity(position: numpy.ndarray, dist_metric: str = 'euclidean', n_neighbors: int = 10) → numpy.ndarray[source]¶: Given array of x- and y-coordinates, compute affinity matrix between all samples using Euclidean distance. Math from: Zelnik-Manor, L., & Perona, P. (2004). Self-tuning spectral clustering. Advances in neural information processing systems, 17. https://proceedings.neurips.cc/paper/2004/file/40173ea48d9567f1f393b20c855bb40b-Paper.pdf