`spateo.tools.ST_regression.spatial_regression`#

Suite of tools for spatially-aware as well as spatially-lagged linear regression

Also performs downstream characterization following spatially-informed regression to characterize niche impact on gene expression

Note to self: current set up –> each of the spatial regression classes can be called either through cell_interaction ( e.g. st.cell_interaction.NicheModel) or standalone (e.g. st.NicheModel)- the same is true for all functions besides the general regression ones (e.g. fit_glm, which must be called w/ st.fit_glm).

Module Contents#

Classes#

`Base_Model`	Basis class for all spatially-aware and spatially-lagged regression models that can be implemented through this
`Category_Model`	Wraps all necessary methods for data loading and preparation, model initialization, parameterization,
`Niche_Model`	Wraps all necessary methods for data loading and preparation, model initialization, parameterization,
`Lagged_Model`	Wraps all necessary methods for data loading and preparation, model initialization, parameterization,
`Niche_LR_Model`	Wraps all necessary methods for data loading and preparation, model initialization, parameterization,

Functions#

calc_1nd_moment(→ Tuple[numpy.ndarray, ...)

class spateo.tools.ST_regression.spatial_regression.Base_Model(adata: anndata.AnnData, spatial_key: str = 'spatial', distr: Union[None, Literal[gaussian, poisson, softplus, neg - binomial, gamma]] = None, group_key: Union[None, str] = None, genes: Union[None, List] = None, drop_dummy: Union[None, str] = None, layer: Union[None, str] = None, cci_dir: Union[None, str] = None, normalize: bool = True, smooth: bool = False, log_transform: bool = False, niche_compute_indicator: bool = True, weights_mode: str = 'knn', data_id: Union[None, str] = None, **kwargs)[source]#

Basis class for all spatially-aware and spatially-lagged regression models that can be implemented through this toolkit. Includes necessary methods for data loading and preparation, computation of spatial weights matrices, computation of evaluation metrics and more.

Parameters

adata

object of class anndata.AnnData

group_key

Key in .obs where group (e.g. cell type) information can be found

spatial_key

Key in .obsm where x- and y-coordinates are stored

distr

Can optionally provide distribution family to specify the type of model that should be fit at the time of initializing this class rather than after calling :func GLMCV_fit_predict- can be “gaussian”, “poisson”, “softplus”, “neg-binomial”, or “gamma”. Case sensitive.

genes

Subset to genes of interest: will be used as dependent variables in non-ligand-based regression analyses, will be independent variables in ligand-based regression analyses

drop_dummy

Name of the category to be dropped (the “dummy variable”) in the regression. The dummy category can aid in interpretation as all model coefficients can be taken to be in reference to the dummy category. If None, will randomly select a few samples to constitute the dummy group.

layer

Entry in .layers to use instead of .X when fitting model- all other operations will use .X.

cci_dir

Full path to the directory containing cell-cell communication databases. Only used in the case of models that use ligands for prediction.

normalize

Perform library size normalization, to set total counts in each cell to the same number (adjust for cell size)

smooth

To correct for dropout effects, leverage gene expression neighborhoods to smooth expression

log_transform

Set True if log-transformation should be applied to expression (otherwise, will assume preprocessing/log-transform was computed beforehand)

niche_compute_indicator

Only used if ‘mod_type’ is “niche” or “niche_lr”. If True, for the “niche” model, for the connections array encoding the cell type-cell type interactions that occur within each niche, threshold all nonzero values to 1, to reflect the presence of a pairwise cell type interaction. Otherwise, will fit on the normalized number of pairwise interactions within each niche. For the “niche_lr” model, for the cell type pair interactions array, threshold all nonzero values to 1 to reflect the presence of an interaction between the two cell types within each niche. Otherwise, will fit on normalized data.

weights_mode

Options “knn”, “kernel”, “band”; sets whether to use K-nearest neighbors, a kernel-based method, or distance band to compute spatial weights, respectively.

data_id

If given, will save pairwise distance arrays & nearest neighbor arrays to folder in the working directory, under ‘./neighbors/{data_id}_distance.csv’ and ‘./neighbors/{data_id}_adj.csv’. Will also check for existing files under these names to avoid re-computing these arrays. If not given, will not save.

kwargs

Provides additional spatial weight-finding arguments. Note that these must specifically match the name that the function will look for (case sensitive). For reference:

n_neighborsint
Number of nearest neighbors for KNN

pint
Minkowski p-norm for KNN and distance band methods

distance_metricstr
Pairwise distance metric for KNN

bandwidthfloat or array-like of floats
Sets kernel width for kernel method

fixedbool
Allow bandwidth to vary across observations for kernel method

n_neighbors_bandwidthint
Number of nearest neighbors for determining bandwidth for kernel method

kernel_functionstr
”triangular”, “uniform”, “quadratic”, “quartic” or “gaussian”. Rule for setting how spatial weight decays with distance

thresholdfloat
Distance for which to consider spots “neighbors” for each spot in distance band method (typically in units of pixels)

alphafloat
Should be less than 0; can be used to set weights to decay with distance for distance band method

preprocess_data(normalize: Union[None, bool] = None, smooth: Union[None, bool] = None, log_transform: Union[None, bool] = None)[source]#

Normalization and transformation of input data. Can manually specify whether to normalize, scale, etc. data- any arguments not given this way will default to values passed on instantiation of the Interpreter object.

Returns: None, all preprocessing operates inplace on the object’s input AnnData.

prepare_data(mod_type: str = 'category', lig: Union[None, List[str]] = None, rec: Union[None, List[str]] = None, niche_lr_r_lag: bool = True, use_ds: bool = True, rec_ds: Union[None, List[str]] = None, species: Literal[human, mouse, axolotl] = 'human')[source]#

Handles any necessary data preparation, starting from given source AnnData object

Parameters

mod_type

The type of model that will be employed- this dictates how the data will be processed and prepared. Options:

category: spatially-aware, for each sample, computes category prevalence within the spatial

neighborhood and uses these as independent variables - niche: spatially-aware, uses spatial connections between samples as independent variables - ligand_lag: spatially-lagged, from database uses select ligand genes to perform regression on

select receptor and/or receptor-downstream genes, and additionally considers neighbor expression of the ligands

niche_lr: spatially-aware, uses a coupling of spatial category connections, ligand expression
and receptor expression to perform regression on select receptor-downstream genes

lig

Only used if ‘mod_type’ contains “ligand”. Provides the list of ligands to use as predictors. If not given, will attempt to subset self.genes

rec

Only used if ‘mod_type’ contains “ligand”. Provides the list of receptors to investigate. If not given, will search through database for all genes that correspond to the provided genes from ‘ligands’.

niche_lr_r_lag

Only used if ‘mod_type’ is “niche_lr”. Uses the spatial lag of the receptor as the dependent variable rather than each spot’s unique receptor expression. Defaults to True.

use_ds

If True, uses receptor-downstream genes in addition to ligands and receptors.

rec_ds

Only used if ‘mod_type’ is “niche_lr” or “ligand_lag”. Can be used to optionally manually define a list of genes shown to be (or thought to potentially be) downstream of one or more of the provided L:R pairs. If not given, will find receptor-downstream genes from database based on input to ‘lig’ and ‘rec’.

species

Selects the cell-cell communication database the relevant ligands will be drawn from. Options: “human”, “mouse”, “axolotl”.

compute_spatial_weights()[source]#: Generates matrix of pairwise spatial distances, used in spatially-lagged models

GLMCV_fit_predict(gs_params: Union[None, dict] = None, n_gs_cv: Union[None, int] = None, n_jobs: int = 30, cat_key: Union[None, str] = None, categories: Union[None, str, List[str]] = None, **kwargs) → Tuple[pandas.DataFrame, pandas.DataFrame][source]#

Wrapper for fitting predictive generalized linear regression model.

Parameters

gs_params: Optional dictionary where keys are variable names for the regressor and values are lists of potential values for which to find the best combination using grid search. Classifier parameters should be given in the following form: ‘classifier__{parameter name}’.
n_gs_cv: Number of folds for grid search cross-validation, will only be used if gs_params is not None. If None, will default to a 5-fold cross-validation.
n_jobs: For parallel processing, number of tasks to run at once
cat_key: Optional, name of key in .obs containing categorical (e.g. cell type) information
categories: Optional, names of categories to subset to for the regression. In cases where the exogenous block is exceptionally heterogenous, can be used to narrow down the search space.
kwargs: Additional named arguments that will be provided to :class GLMCV.

Returns

Contains fitted parameters for each feature reconst: Contains predicted expression for each feature

Return type

coeffs

visualize_params(coeffs: pandas.DataFrame, subset_cols: Union[None, str, List[str]] = None, cmap: str = 'autumn', zero_center_cmap: bool = False, mask_threshold: Union[None, float] = None, mask_zero: bool = True, transpose: bool = False, title: Union[None, str] = None, xlabel: Union[None, str] = None, ylabel: Union[None, str] = None, figsize: Union[None, Tuple[float, float]] = None, annot_kws: dict = {}, save_show_or_return: Literal[save, show, return, both, all] = 'save', save_kwargs: dict = {})[source]#

Generates heatmap of parameter values for visualization

Parameters

coeffs: Contains coefficients (and any other relevant statistics that were computed) from regression for each variable
subset_cols: String or list of strings that can be used to subset coeffs DataFrame such that only columns with names containing the provided key strings are displayed on heatmap. For example, can use “coeff” to plot only the linear regression coefficients, “zstat” for the z-statistic, etc. Or can use the full name of the column to select specific columns.
cmap: Name of the colormap to use
zero_center_cmap: Set True to set colormap intensity midpoint to zero.
mask_threshold: Optional, sets lower absolute value thresholds for parameters to be assigned color in heatmap (will compare absolute value of each element against this threshold)
mask_zero: Set True to not assign color to zeros (representing neither a positive or negative interaction)
transpose: Set True to reverse the dataframe’s orientation before plotting
title: Optional, provides title for plot. If not given, will use default “Spatial Parameters”.
xlabel: Optional, provides label for x-axis. If not given, will use default “Predictor Features”.
ylabel: Optional, provides label for y-axis. If not given, will use default “Target Features”.
figsize: Can be used to set width and height of figure window, in inches. If not given, will use Spateo default.
annot_kws: Optional dictionary that can be used to set qualities of the axis/tick labels. For example, can set ‘size’: 9, ‘weight’: ‘bold’, etc.
save_show_or_return: Whether to save, show or return the figure. If “both”, it will save and plot the figure at the same time. If “all”, the figure will be saved, displayed and the associated axis and other object will be return.
save_kwargs: A dictionary that will passed to the save_fig function. By default it is an empty dictionary and the save_fig function will use the {“path”: None, “prefix”: ‘scatter’, “dpi”: None, “ext”: ‘pdf’, “transparent”: True, “close”: True, “verbose”: True} as its parameters. Otherwise you can provide a dictionary that properly modifies those keys according to your needs.

compute_coeff_significance(coeffs: pandas.DataFrame, significance_threshold: float = 0.05, only_positive: bool = False, only_negative: bool = False) → Tuple[pandas.DataFrame, pandas.DataFrame, pandas.DataFrame][source]#

Computes statistical significance for fitted coefficients.

Parameters

coeffs: Contains coefficients from regression for each variable
significance_threshold: p-value needed to call a sender-receiver relationship significant
only_positive: Set True to find significance/pvalues/qvalues only for the subset of coefficients that is positive (representing possible mechanisms of positive regulation).
only_negative: Set True to find significance/pvalues/qvalues only for the subset of coefficients that is negative (representing possible mechanisms of positive regulation).

Returns

Dataframe of identical shape to coeffs, where each element is True or False if it meets the threshold for significance pvalues: Dataframe of identical shape to coeffs, where each element is a p-value for that instance of that

feature

qvalues: Dataframe of identical shape to coeffs, where each element is a q-value for that instance of that: feature

Return type

is_significant

get_effect_sizes(coeffs: pandas.DataFrame, only_positive: bool = False, only_negative: bool = False, significance_threshold: float = 0.05, lr_pair: Union[None, str] = None, save_prefix: Union[None, str] = None)[source]#

For each predictor and each feature, determine if the influence of said predictor in predicting said feature is significant.

Additionally, for each feature and each sender-receiver category pair, determines the effect size that the sender induces in the feature for the receiver.

Only valid if the model specified uses the connections between categories as variables for the regression- thus can be applied to ‘mod_type’ “niche”, or “niche_lr”.

Parameters

coeffs: Contains coefficients from regression for each variable
only_positive: Set True to find significance/pvalues/qvalues only for the subset of coefficients that is positive (representing possible mechanisms of positive regulation).
only_negative: Set True to find significance/pvalues/qvalues only for the subset of coefficients that is negative (representing possible mechanisms of positive regulation).
significance_threshold: p-value needed to call a sender-receiver relationship significant
lr_pair: Required if (and used only in the case that) coefficients came from a Niche-LR model; used to subset the coefficients array to the specific ligand-receptor pair of interest. Takes the form “{ligand}-{receptor}” and should match one of the keys in :dict self.niche_mats. If not given, will default to the first key in this dictionary.
save_prefix: If provided, saves all relevant dataframes to :path ./regression_outputs under the name {prefix}_{coeffs/pvalues, etc.}.csv. If not provided, will not save.

type_coupling(cmap: str = 'Reds', fontsize: Union[None, int] = None, figsize: Union[None, Tuple[float, float]] = None, ignore_self: bool = True, save_show_or_return: Literal[save, show, return, both, all] = 'save', save_kwargs: dict = {})[source]#

Generates heatmap of spatially differentially-expressed features for each pair of sender and receiver categories. Only valid if the model specified uses the connections between categories as variables for the regression.

A high number of differentially-expressed genes between a given sender-receiver pair means that the sender being in the neighborhood of the receiver tends to correlate with differential expression levels of many of the genes within the selection- much of the cellular variation in the receiver cell type can be attributed to being in proximity with the sender.

Parameters

cmap: Name of Matplotlib color map to use
fontsize: Size of figure title and axis labels
figsize: Width and height of plotting window
save_show_or_return: Options: “save”, “show”, “return”, “both”, “all” - “both” for save and show
ignore_self: If True, will ignore the effect of cell type in proximity to other cells of the same type- will record the number of DEGs only if the two cell types are different.
save_show_or_return: Whether to save, show or return the figure. If “both”, it will save and plot the figure at the same time. If “all”, the figure will be saved, displayed and the associated axis and other object will be return.
save_kwargs: A dictionary that will passed to the save_fig function. By default it is an empty dictionary and the save_fig function will use the {“path”: None, “prefix”: ‘scatter’, “dpi”: None, “ext”: ‘pdf’, “transparent”: True, “close”: True, “verbose”: True} as its parameters. Otherwise you can provide a dictionary that properly modifies those keys according to your needs.

sender_effect_on_all_receivers(sender: str, plot_mode: str = 'effect_size', gene_subset: Union[None, List[str]] = None, significance_threshold: float = 0.05, cut_pvals: float = -5, fontsize: Union[None, int] = None, figsize: Union[None, Tuple[float, float]] = None, cmap: str = 'seismic', save_show_or_return: Literal[save, show, return, both, all] = 'show', save_kwargs: Optional[dict] = {})[source]#

Evaluates and visualizes the effect that the given sender cell type has on expression/abundance in each possible receiver cell type.

Parameters

sender

sender cell type label

plot_mode

specifies what gets plotted. Options:

”qvals”: elements of the plot represent statistical significance of the interaction

”effect_size”: elements of the plot represent numerical expression change induced in the
sender by the sender

gene_subset

Names of genes to subset for plot. If None, will use all genes that were used in the regression.

significance_threshold

Set non-significant effect sizes to zero, where the threshold is given here

cut_pvals

Minimum allowable log10(pval)- anything below this will be clipped to this value

fontsize

Size of figure title and axis labels

figsize

Width and height of plotting window

cmap

Name of matplotlib colormap specifying colormap to use

save_show_or_return

Whether to save, show or return the figure. If “both”, it will save and plot the figure at the same time. If “all”, the figure will be saved, displayed and the associated axis and other object will be return.

save_kwargs

A dictionary that will passed to the save_fig function. By default it is an empty dictionary and the save_fig function will use the {“path”: None, “prefix”: ‘scatter’, “dpi”: None, “ext”: ‘pdf’, “transparent”: True, “close”: True, “verbose”: True} as its parameters. Otherwise you can provide a dictionary that properly modifies those keys according to your needs.

all_senders_effect_on_receiver(receiver: str, plot_mode: str = 'effect_size', gene_subset: Union[None, List[str]] = None, significance_threshold: float = 0.05, cut_pvals: float = -5, fontsize: Union[None, int] = None, figsize: Union[None, Tuple[float, float]] = None, cmap: str = 'seismic', save_show_or_return: Literal[save, show, return, both, all] = 'show', save_kwargs: Optional[dict] = {})[source]#

Evaluates and visualizes the effect that each possible sender cell type has on expression/abundance in a selected receiver cell type.

Parameters

receiver

Receiver cell type label

plot_mode

specifies what gets plotted. Options:

”qvals”: elements of the plot represent statistical significance of the interaction

”effect_size”: elements of the plot represent effect size induced in the receiver by the sender

gene_subset

Names of genes to subset for plot. If None, will use all genes that were used in the regression.

significance_threshold

Set non-significant effect sizes to zero, where the threshold is given here

cut_pvals

Minimum allowable log10(pval)- anything below this will be clipped to this value

fontsize

Size of figure title and axis labels

figsize

Width and height of plotting window

cmap

Name of matplotlib colormap specifying colormap to use

save_show_or_return

save_kwargs

sender_receiver_effect_volcanoplot(receiver: str, sender: str, significance_threshold: float = 0.05, effect_size_threshold: Union[None, float] = None, fontsize: Union[None, int] = None, figsize: Union[None, Tuple[float, float]] = (4.5, 7.0), save_show_or_return: Literal[save, show, return, both, all] = 'show', save_kwargs: Optional[dict] = {})[source]#

Volcano plot to identify differentially expressed genes of a given receiver cell type in the presence of a given sender cell type.

Parameters

receiver: Receiver cell type label
sender: Sender cell type label
significance_threshold: Set non-significant effect sizes (given by q-values) to zero, where the threshold is given here
effect_size_threshold: Set absolute value effect-size threshold beyond which observations are marked as interesting. If not given, will take the 95th percentile fold-change as the cutoff.
fontsize: Size of figure title and axis labels
figsize: Width and height of plotting window
save_show_or_return: Whether to save, show or return the figure. If “both”, it will save and plot the figure at the same time. If “all”, the figure will be saved, displayed and the associated axis and other object will be return.
save_kwargs: A dictionary that will passed to the save_fig function. By default it is an empty dictionary and the save_fig function will use the {“path”: None, “prefix”: ‘scatter’, “dpi”: None, “ext”: ‘pdf’, “transparent”: True, “close”: True, “verbose”: True} as its parameters. Otherwise you can provide a dictionary that properly modifies those keys according to your needs.

class spateo.tools.ST_regression.spatial_regression.Category_Model(*args, **kwargs)[source]#

Bases: Base_Model

Wraps all necessary methods for data loading and preparation, model initialization, parameterization, evaluation and prediction when instantiating a model for spatially-aware (but not spatially lagged) regression using categorical variables (specifically, the prevalence of categories within spatial neighborhoods) to predict the value of gene expression.

Arguments passed to :class Base_Model. The only keyword argument that is used for this class is ‘n_neighbors’.

Parameters

args: Positional arguments to :class Base_Model
kwargs: Keyword arguments to :class Base_Model

class spateo.tools.ST_regression.spatial_regression.Niche_Model(*args, **kwargs)[source]#

Bases: Base_Model

Wraps all necessary methods for data loading and preparation, model initialization, parameterization, evaluation and prediction when instantiating a model for spatially-aware regression using both the prevalence of and connections between categories within spatial neighborhoods to predict the value of gene expression.

Arguments passed to :class Base_Model.

Parameters

args: Positional arguments to :class Base_Model
kwargs: Keyword arguments to :class Base_Model

class spateo.tools.ST_regression.spatial_regression.Lagged_Model(model_type: str = 'ligand', lig: Union[None, str, List[str]] = None, rec: Union[None, str, List[str]] = None, rec_ds: Union[None, str, List[str]] = None, species: Literal[human, mouse, axolotl] = 'human', normalize: bool = True, smooth: bool = False, log_transform: bool = True, *args, **kwargs)[source]#

Bases: Base_Model

Wraps all necessary methods for data loading and preparation, model initialization, parameterization, evaluation and prediction when instantiating a model for spatially-lagged regression.

Can specify one of two models: “ligand”, which uses the spatial lag of ligand genes and the spatial lag of the regression target to predict the regression target, or “niche”, which uses the spatial lag of cell type colocalization and the spatial lag of the regression target to predict the regression target.

If “ligand” is specified, arguments to lig must be given, and it is recommended to provide species as well- default for this is human.

Arguments passed to :class Base_Model.

Parameters

model_type: Either “ligand” or “niche”, specifies whether to fit a model that incorporates the spatial lag of ligand expression or the spatial lag of cell type colocalization.
lig: Name(s) of ligands to use as predictors
rec: Name(s) of receptors to use as regression targets. If not given, will search through database for all genes that correspond to the provided genes from ‘ligands’.
rec_ds: Name(s) of receptor-downstream genes to use as regression targets. If not given, will search through database for all genes that correspond to receptor-downstream genes.
species: Specifies L:R database to use
normalize: Perform library size normalization, to set total counts in each cell to the same number (adjust for cell size)
smooth: To correct for dropout effects, leverage gene expression neighborhoods to smooth expression
log_transform: Set True if log-transformation should be applied to expression (otherwise, will assume preprocessing/log-transform was computed beforehand)
args: Additional positional arguments to :class Base_Model
kwargs: Additional keyword arguments to :class Base_Model

run_GM_lag() → Tuple[pandas.DataFrame, pandas.DataFrame, pandas.DataFrame][source]#: Runs spatially lagged two-stage least squares model

single(cur_g: str, X: pandas.DataFrame, X_variable_names: List[str], param_labels: List[str], adata: anndata.AnnData, w: numpy.ndarray, layer: Union[None, str] = None) → Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray][source]#

Defines model run process for a single feature- not callable by the user, all arguments populated by arguments passed on instantiation of :class Base_Model.

Parameters

cur_g: Name of the feature to regress on
X: Values used for the regression
X_variable_names: Names of the variables used for the regression
param_labels: Names of categories- each computed parameter corresponds to a single element in param_labels
adata: AnnData object to store results in
w: Spatial weights array
layer: Specifies layer in AnnData to use- if None, will use .X.

Returns

Coefficients for each categorical group for each feature pred: Predicted values from regression for each feature resid: Residual values from regression for each feature

Return type

coeffs

class spateo.tools.ST_regression.spatial_regression.Niche_LR_Model(lig: Union[None, str, List[str]], rec: Union[None, str, List[str]] = None, rec_ds: Union[None, str, List[str]] = None, species: Literal[human, mouse, axolotl] = 'human', niche_lr_r_lag: bool = True, *args, **kwargs)[source]#

Bases: Base_Model

Wraps all necessary methods for data loading and preparation, model initialization, parameterization, evaluation and prediction when instantiating a model for spatially-aware regression using the prevalence of and connections between categories within spatial neighborhoods and the cell type-specific expression of ligands and receptors to predict the regression target.

Arguments passed to :class Base_Model.

Parameters

lig: Name(s) of ligands to use as predictors
rec: Name(s) of receptors to use as regression targets. If not given, will search through database for all genes that correspond to the provided genes from ‘ligands’
rec_ds: Name(s) of receptor-downstream genes to use as regression targets. If not given, will search through database for all genes that correspond to receptors
species: Specifies L:R database to use
niche_lr_r_lag: Only used if ‘mod_type’ is “niche_lr”. Uses the spatial lag of the receptor as the dependent variable rather than each spot’s unique receptor expression. Defaults to True.
args: Additional positional arguments to :class Base_Model
kwargs: Additional keyword arguments to :class Base_Model

spateo.tools.ST_regression.spatial_regression.calc_1nd_moment(X, W, normalize_W=True) → Tuple[numpy.ndarray, Optional[numpy.ndarray]][source]#

spateo.tools.ST_regression.spatial_regression#

Module Contents#

Classes#

Functions#

`spateo.tools.ST_regression.spatial_regression`#