spateo.tools.find_neighbors#

Functions for finding nearest neighbors, the distances between them and the spatial weighting between points in spatial transcriptomics data.

Module Contents#

Classes#

Kernel

Spatial weights for regression models are learned using kernel functions.

Functions#

calculate_distance(→ numpy.ndarray)

Given array of x- and y-coordinates, compute pairwise distances between all samples using Euclidean distance.

local_dist(coords_i, coords)

For single sample, compute distance between that sample and each other sample in the data.

jaccard_index(row_i, array)

Compute the Jaccard index between a row of a binary array and all other rows.

normalize_adj(→ numpy.ndarray)

Symmetrically normalize adjacency matrix, set diagonal to 1 and return processed adjacency array.

adj_to_knn(→ Tuple[numpy.ndarray, numpy.ndarray])

Given an adjacency matrix, convert to KNN graph.

knn_to_adj(→ scipy.sparse.csr_matrix)

Given the indices and weights of a KNN graph, convert to adjacency matrix.

compute_distances_and_connectivities(...)

Computes connectivity and sparse distance matrices

calculate_distances_chunk(→ numpy.ndarray)

Pairwise distance computation, coupled with :func find_bw.

find_bw_for_n_neighbors(→ float)

Finds the bandwidth such that on average, cells in the sample have n neighbors.

find_threshold_distance(→ float)

Finds threshold distance beyond which there is a dramatic increase in the average distance to remaining

get_wi(→ scipy.sparse.csr_matrix)

Get spatial weights for an individual sample, given the coordinates of all samples in space.

construct_nn_graph(→ None)

Constructing bucket-to-bucket nearest neighbors graph.

neighbors(→ Tuple[sklearn.neighbors.NearestNeighbors, ...)

Given an AnnData object, compute pairwise connectivity matrix in transcriptomic or physical space

calculate_affinity(→ numpy.ndarray)

Given array of x- and y-coordinates, compute affinity matrix between all samples using Euclidean distance.

spateo.tools.find_neighbors.calculate_distance(position: numpy.ndarray, dist_metric: str = 'euclidean') numpy.ndarray[source]#

Given array of x- and y-coordinates, compute pairwise distances between all samples using Euclidean distance.

spateo.tools.find_neighbors.local_dist(coords_i: numpy.ndarray, coords: numpy.ndarray)[source]#

For single sample, compute distance between that sample and each other sample in the data.

Parameters:
coords_i

Array of shape (n, ), where n is the dimensionality of the data; the coordinates of a single point

coords

Array of shape (m, n), where n is the dimensionality of the data and m is an arbitrary number of samples; pairwise distances from coords_i.

Returns:

array-like, shape (m, ), where m is an arbitrary number of samples. The pairwise distances

between coords_i and each point in coords.

Return type:

distances

spateo.tools.find_neighbors.jaccard_index(row_i: numpy.ndarray, array: numpy.ndarray)[source]#

Compute the Jaccard index between a row of a binary array and all other rows.

Parameters:
row_i

1D binary array representing the row for which to compute the Jaccard index.

array

2D binary array containing the rows to compare against.

Returns:

1D array of Jaccard indices between row_i and each row in array.

Return type:

jaccard_indices

spateo.tools.find_neighbors.normalize_adj(adj: numpy.ndarray, exclude_self: bool = True) numpy.ndarray[source]#

Symmetrically normalize adjacency matrix, set diagonal to 1 and return processed adjacency array.

Parameters:
adj

Pairwise distance matrix of shape [n_samples, n_samples].

exclude_self

Set True to set diagonal of adjacency matrix to 1.

Returns:

The normalized adjacency matrix.

Return type:

adj_proc

spateo.tools.find_neighbors.adj_to_knn(adj: numpy.ndarray, n_neighbors: int = 15) Tuple[numpy.ndarray, numpy.ndarray][source]#

Given an adjacency matrix, convert to KNN graph.

Parameters:
adj

Adjacency matrix of shape (n_samples, n_samples)

n_neighbors

Number of nearest neighbors to include in the KNN graph

Returns:

Array (n_samples x n_neighbors) storing the indices for each node’s nearest neighbors in the

knn graph.

weights: Array (n_samples x n_neighbors) storing the edge weights for each node’s nearest neighbors in

the knn graph.

Return type:

indices

spateo.tools.find_neighbors.knn_to_adj(knn_indices: numpy.ndarray, knn_weights: numpy.ndarray) scipy.sparse.csr_matrix[source]#

Given the indices and weights of a KNN graph, convert to adjacency matrix.

Parameters:
knn_indices

Array (n_samples x n_neighbors) storing the indices for each node’s nearest neighbors in the knn graph.

knn_weights

Array (n_samples x n_neighbors) storing the edge weights for each node’s nearest neighbors in the knn graph.

Returns:

The adjacency matrix corresponding to the KNN graph

Return type:

adj

spateo.tools.find_neighbors.compute_distances_and_connectivities(knn_indices: numpy.ndarray, distances: numpy.ndarray) Tuple[scipy.sparse.csr_matrix, scipy.sparse.csr_matrix][source]#

Computes connectivity and sparse distance matrices

Parameters:
knn_indices

Array of shape (n_samples, n_samples) containing the indices of the nearest neighbors for each sample.

distances

The distances to the n_neighbors the closest points in knn graph

Returns:

Sparse distance matrix connectivities: Sparse connectivity matrix

Return type:

distances

spateo.tools.find_neighbors.calculate_distances_chunk(coords_chunk: numpy.ndarray, chunk_start_idx: int, coords: numpy.ndarray, n_nonzeros: dict | None = None, metric: str = 'euclidean') numpy.ndarray[source]#

Pairwise distance computation, coupled with :func find_bw.

Parameters:
coords_chunk

Array of shape (n_samples_chunk, n_features) containing coordinates of the chunk of interest.

chunk_start_idx

Index of the first sample in the chunk. Required if n_nonzeros is not None.

coords

Array of shape (n_samples, n_features) containing the coordinates of all points.

n_nonzeros

Optional dictionary containing the number of non-zero columns for each row in the distance matrix.

metric

Distance metric to use for pairwise distance computation, can be any of the metrics supported by :func sklearn.metrics.pairwise_distances.

spateo.tools.find_neighbors.find_bw_for_n_neighbors(adata: anndata.AnnData, coords_key: str = 'spatial', n_anchors: int | None = None, target_n_neighbors: int = 6, initial_bw: float | None = None, chunk_size: int = 1000, exclude_self: bool = False, normalize_distances: bool = False, verbose: bool = True, max_iterations: int = 100, alpha: float = 0.5) float[source]#

Finds the bandwidth such that on average, cells in the sample have n neighbors.

Parameters:
adata

AnnData object containing coordinates for all cells

coords_key

Key in adata.obsm where the spatial coordinates are stored

target_n_neighbors

Target average number of neighbors per cell

initial_bw

Can optionally be used to set the starting distance for the bandwidth search

chunk_size

Number of cells to compute pairwise distance for at once

exclude_self

Whether to exclude self from the list of neighbors

normalize_distances

Whether to normalize the distances by the number of nonzero columns (should be used only if the entry in .obs[coords_key] contains something other than x-, y-, z-coordinates).

verbose

Whether to print the bandwidth at each iteration. If False, will only print the final bandwidth.

max_iterations

Will stop the process and return the bandwidth that results in the closest number of neighbors to the specified target if it takes more than this number of iterations.

alpha

Factor used in determining the new bandwidth- ratio of found neighbors to target neighbors will be raised to this power.

Returns:

Bandwidth in distance units

Return type:

bandwidth

spateo.tools.find_neighbors.find_threshold_distance(adata: anndata.AnnData, coords_key: str = 'X_pca', n_neighbors: int = 10, chunk_size: int = 1000, normalize_distances: bool = False) float[source]#

Finds threshold distance beyond which there is a dramatic increase in the average distance to remaining nearest neighbors.

Parameters:
adata

AnnData object containing coordinates for all cells

coords_key

Key in adata.obsm where the spatial coordinates are stored

n_neighbors

Will first compute the number of nearest neighbors as a comparison for querying additional distance values.

chunk_size

Number of cells to compute pairwise distance for at once

normalize_distances

Whether to normalize the distances by the number of nonzero columns (should be used only if the entry in .obs[coords_key] contains something other than x-, y-, z-coordinates).

Returns:

Bandwidth in distance units

Return type:

bandwidth

class spateo.tools.find_neighbors.Kernel(i: int, data: numpy.ndarray | scipy.sparse.spmatrix, bw: int | float, cov: numpy.ndarray | None = None, ct: numpy.ndarray | None = None, expr_mat: numpy.ndarray | None = None, fixed: bool = True, exclude_self: bool = False, function: str = 'triangular', threshold: float = 1e-05, eps: float = 1.0000001, sparse_array: bool = False, normalize_weights: bool = False, use_expression_neighbors: bool = False)[source]#

Bases: object

Spatial weights for regression models are learned using kernel functions.

Args:

i: Index of the point for which to estimate the density data: Array of shape (n_samples, n_features) representing the data. If aiming to derive weights from spatial

distance, this should be the array of spatial positions.

bw: Bandwidth parameter for the kernel density estimation cov: Optional array of shape (n_samples, ). Can be used to adjust the distance calculation to look only at

samples of interest vs. samples not of interest, which is determined from nonzero values in this vector. This can be used to modify the modeling process based on factors thought to reflect biological differences, for example, to condition on histological classification, passing a distance threshold, etc. If ‘ct’ is also given, will look for samples of interest that are also of the same cell type.

ct: Optional array of shape (n_samples, ), containing vector where cell types are encoded as integers. Can be

used to condition nearest neighbor finding on cell type or other category.

expr_mat: Can be used together with ‘cov’ (so will only be used if ‘cov’ is not None)- if the spatial neighbors

are not consistent with the sample in question (determined by assessing similarity by “cov”), there may be different mechanisms at play. In this case, will instead search for nearest neighbors in the gene expression space if given.

fixed: If True, bw is treated as a fixed bandwidth parameter. Otherwise, it is treated as the number

of nearest neighbors to include in the bandwidth estimation.

exclude_self: If True, ignore each sample itself when computing the kernel density estimation function: The name of the kernel function to use. Valid options are as follows (note that in equations,

any “1” can be substituted for any other value(s)): - ‘triangular’: linearly decaying kernel,

:math K(u) =
egin{cases}

1-|u| & ext{if} |u| leq 1 0 & ext{otherwise}

end{cases},

  • ‘quadratic’: quadratically decaying kernel,
    :math K(u) =
    egin{cases}

    dfrac{3}{4}(1-u^2)

    end{cases},

  • ‘gaussian’: decays following normal distribution, :math K(u) = dfrac{1}{sqrt{2pi}} e^{-

rac{1}{2}u^2},
  • ‘uniform’: AKA the tophat kernel, sets weight of all observations within the bandwidth to the same value,
    :math K(u) =
    egin{cases}

    1 & ext{if} |u| leq 1 0 & ext{otherwise}

    end{cases},

  • ‘exponential’: exponentially decaying kernel, :math K(u) = e^{-|u|},

  • ‘bisquare’: assigns a weight of zero to observations outside of the bandwidth, and decays within the
    bandwidth following equation
    :math K(u) =
    egin{cases}

    dfrac{15}{16}(1-u^2)^2 & ext{if} |u| leq 1 0 & ext{otherwise}

    end{cases}.

threshold: Threshold for the kernel density estimation. If the density is below this threshold, the density

will be set to zero.

eps: Error-correcting factor to avoid division by zero sparse_array: If True, the kernel will be converted to sparse array. Recommended for large datasets. normalize_weights: If True, the weights will be normalized to sum to 1. use_expression_neighbors: If True, will only use the expression matrix to find nearest neighbors.

_kernel_functions(x)[source]#
spateo.tools.find_neighbors.get_wi(i: int, n_samples: int, coords: numpy.ndarray, cov: numpy.ndarray | None = None, ct: numpy.ndarray | None = None, expr_mat: numpy.ndarray | None = None, fixed_bw: bool = True, exclude_self: bool = False, kernel: str = 'gaussian', bw: float | int = 100, threshold: float = 1e-05, sparse_array: bool = False, normalize_weights: bool = False, use_expression_neighbors: bool = False) scipy.sparse.csr_matrix[source]#

Get spatial weights for an individual sample, given the coordinates of all samples in space.

Parameters:
i

Index of sample for which weights are to be calculated to all other samples in the dataset

n_samples

Total number of samples in the dataset

coords

Array of shape (n_samples, 2) or (n_samples, 3) representing the spatial coordinates of each sample

cov

Optional array of shape (n_samples, ). Can be used to adjust the distance calculation to look only at samples of interest vs. samples not of interest, which is determined from nonzero values in this vector. This can be used to modify the modeling process based on factors thought to reflect biological differences, for example, to condition on histological classification, passing a distance threshold, etc. If ‘ct’ is also given, will look for samples of interest that are also of the same cell type.

ct

Optional array of shape (n_samples, ), containing vector where cell types are encoded as integers. Can be used to condition nearest neighbor finding on cell type or other category.

expr_mat

Can be used together with ‘cov’- if the spatial neighbors are not consistent with the sample in question (determined by assessing similarity by “cov”), there may be different mechanisms at play. In this case, will instead search for nearest neighbors in the gene expression space if given.

fixed_bw

If True, bw is treated as a spatial distance for computing spatial weights. Otherwise, it is treated as the number of neighbors.

exclude_self

If True, ignore each sample itself when computing the kernel density estimation

kernel

The name of the kernel function to use. Valid options: “triangular”, “uniform”, “quadratic”, “bisquare”, “gaussian” or “exponential”

bw

Bandwidth for the spatial kernel

threshold

Threshold for the kernel density estimation. If the density is below this threshold, the density will be set to zero.

sparse_array

If True, the kernel will be converted to sparse array. Recommended for large datasets.

normalize_weights

If True, the weights will be normalized to sum to 1.

use_expression_neighbors

If True, will only use expression neighbors to determine the bandwidth.

Returns:

Array of weights for sample of interest

Return type:

wi

spateo.tools.find_neighbors.construct_nn_graph(adata: anndata.AnnData, spatial_key: str = 'spatial', dist_metric: str = 'euclidean', n_neighbors: int = 8, exclude_self: bool = True, make_symmetrical: bool = False, save_id: None | str = None) None[source]#

Constructing bucket-to-bucket nearest neighbors graph.

Parameters:
adata

An anndata object.

spatial_key

Key in .obsm in which x- and y-coordinates are stored.

dist_metric

Distance metric to use. Options: ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘jensenshannon’, ‘kulczynski1’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’.

n_neighbors

Number of nearest neighbors to compute for each bucket.

exclude_self

Set True to set elements along the diagonal to zero.

make_symmetrical

Set True to make sure adjacency matrix is symmetrical (i.e. ensure that if A is a neighbor of B, B is also included among the neighbors of A)

save_id

Optional string; if not None, will save distance matrix and neighbors matrix to path:

path './neighbors/{save_id}_distance.csv' and

‘./neighbors/{save_id}_neighbors.csv’, respectively.

spateo.tools.find_neighbors.neighbors(adata: anndata.AnnData, nbr_object: sklearn.neighbors.NearestNeighbors = None, basis: str = 'pca', spatial_key: str = 'spatial', n_neighbors_method: str = 'ball_tree', n_pca_components: int = 30, n_neighbors: int = 10) Tuple[sklearn.neighbors.NearestNeighbors, anndata.AnnData][source]#

Given an AnnData object, compute pairwise connectivity matrix in transcriptomic or physical space

Parameters:
adata

an anndata object.

nbr_object

An optional sklearn.neighbors.NearestNeighbors object. Can optionally create a nearest neighbor object with custom functionality.

basis

str, default ‘pca’ The space that will be used for nearest neighbor search. Valid names includes, for example, pca, umap, or X for gene expression neighbors, ‘spatial’ for neighbors in the physical space.

spatial_key

Optional, can be used to specify .obsm entry in adata that contains spatial coordinates. Only used if basis is ‘spatial’.

n_neighbors_method

str, default ‘ball_tree’ Specifies algorithm to use in computing neighbors using sklearn’s implementation. Options: “ball_tree” and “kd_tree”.

n_pca_components

Only used if ‘basis’ is ‘pca’. Sets number of principal components to compute (if PCA has not already been computed for this dataset).

n_neighbors

Number of neighbors for kneighbors queries.

Returns:

Object of class sklearn.neighbors.NearestNeighbors adata : Modified AnnData object

Return type:

nbrs

spateo.tools.find_neighbors.calculate_affinity(position: numpy.ndarray, dist_metric: str = 'euclidean', n_neighbors: int = 10) numpy.ndarray[source]#

Given array of x- and y-coordinates, compute affinity matrix between all samples using Euclidean distance. Math from: Zelnik-Manor, L., & Perona, P. (2004). Self-tuning spectral clustering. Advances in neural information processing systems, 17. https://proceedings.neurips.cc/paper/2004/file/40173ea48d9567f1f393b20c855bb40b-Paper.pdf