spateo.tools.ST_regression.spatial_regression#

Suite of tools for spatially-aware as well as spatially-lagged linear regression

Also performs downstream characterization following spatially-informed regression to characterize niche impact on gene expression

Note to self: current set up –> each of the spatial regression classes can be called either through cell_interaction ( e.g. st.cell_interaction.NicheModel) or standalone (e.g. st.NicheModel)- the same is true for all functions besides the general regression ones (e.g. fit_glm, which must be called w/ st.fit_glm).

Module Contents#

Classes#

Base_Model

Basis class for all spatially-aware and spatially-lagged regression models that can be implemented through this

Category_Model

Wraps all necessary methods for data loading and preparation, model initialization, parameterization,

Niche_Model

Wraps all necessary methods for data loading and preparation, model initialization, parameterization,

Lagged_Model

Wraps all necessary methods for data loading and preparation, model initialization, parameterization,

Niche_LR_Model

Wraps all necessary methods for data loading and preparation, model initialization, parameterization,

Functions#

calc_1nd_moment(→ Tuple[numpy.ndarray, ...)

class spateo.tools.ST_regression.spatial_regression.Base_Model(adata: anndata.AnnData, spatial_key: str = 'spatial', distr: Union[None, Literal[gaussian, poisson, softplus, neg - binomial, gamma]] = None, group_key: Union[None, str] = None, genes: Union[None, List] = None, drop_dummy: Union[None, str] = None, layer: Union[None, str] = None, cci_dir: Union[None, str] = None, normalize: bool = True, smooth: bool = False, log_transform: bool = False, niche_compute_indicator: bool = True, weights_mode: str = 'knn', data_id: Union[None, str] = None, **kwargs)[source]#

Basis class for all spatially-aware and spatially-lagged regression models that can be implemented through this toolkit. Includes necessary methods for data loading and preparation, computation of spatial weights matrices, computation of evaluation metrics and more.

Parameters
adata

object of class anndata.AnnData

group_key

Key in .obs where group (e.g. cell type) information can be found

spatial_key

Key in .obsm where x- and y-coordinates are stored

distr

Can optionally provide distribution family to specify the type of model that should be fit at the time of initializing this class rather than after calling :func GLMCV_fit_predict- can be “gaussian”, “poisson”, “softplus”, “neg-binomial”, or “gamma”. Case sensitive.

genes

Subset to genes of interest: will be used as dependent variables in non-ligand-based regression analyses, will be independent variables in ligand-based regression analyses

drop_dummy

Name of the category to be dropped (the “dummy variable”) in the regression. The dummy category can aid in interpretation as all model coefficients can be taken to be in reference to the dummy category. If None, will randomly select a few samples to constitute the dummy group.

layer

Entry in .layers to use instead of .X when fitting model- all other operations will use .X.

cci_dir

Full path to the directory containing cell-cell communication databases. Only used in the case of models that use ligands for prediction.

normalize

Perform library size normalization, to set total counts in each cell to the same number (adjust for cell size)

smooth

To correct for dropout effects, leverage gene expression neighborhoods to smooth expression

log_transform

Set True if log-transformation should be applied to expression (otherwise, will assume preprocessing/log-transform was computed beforehand)

niche_compute_indicator

Only used if ‘mod_type’ is “niche” or “niche_lr”. If True, for the “niche” model, for the connections array encoding the cell type-cell type interactions that occur within each niche, threshold all nonzero values to 1, to reflect the presence of a pairwise cell type interaction. Otherwise, will fit on the normalized number of pairwise interactions within each niche. For the “niche_lr” model, for the cell type pair interactions array, threshold all nonzero values to 1 to reflect the presence of an interaction between the two cell types within each niche. Otherwise, will fit on normalized data.

weights_mode

Options “knn”, “kernel”, “band”; sets whether to use K-nearest neighbors, a kernel-based method, or distance band to compute spatial weights, respectively.

data_id

If given, will save pairwise distance arrays & nearest neighbor arrays to folder in the working directory, under ‘./neighbors/{data_id}_distance.csv’ and ‘./neighbors/{data_id}_adj.csv’. Will also check for existing files under these names to avoid re-computing these arrays. If not given, will not save.

kwargs

Provides additional spatial weight-finding arguments. Note that these must specifically match the name that the function will look for (case sensitive). For reference:

  • n_neighborsint

    Number of nearest neighbors for KNN

  • pint

    Minkowski p-norm for KNN and distance band methods

  • distance_metricstr

    Pairwise distance metric for KNN

  • bandwidthfloat or array-like of floats

    Sets kernel width for kernel method

  • fixedbool

    Allow bandwidth to vary across observations for kernel method

  • n_neighbors_bandwidthint

    Number of nearest neighbors for determining bandwidth for kernel method

  • kernel_functionstr

    ”triangular”, “uniform”, “quadratic”, “quartic” or “gaussian”. Rule for setting how spatial weight decays with distance

  • thresholdfloat

    Distance for which to consider spots “neighbors” for each spot in distance band method (typically in units of pixels)

  • alphafloat

    Should be less than 0; can be used to set weights to decay with distance for distance band method

preprocess_data(normalize: Union[None, bool] = None, smooth: Union[None, bool] = None, log_transform: Union[None, bool] = None)[source]#

Normalization and transformation of input data. Can manually specify whether to normalize, scale, etc. data- any arguments not given this way will default to values passed on instantiation of the Interpreter object.

Returns

None, all preprocessing operates inplace on the object’s input AnnData.

prepare_data(mod_type: str = 'category', lig: Union[None, List[str]] = None, rec: Union[None, List[str]] = None, niche_lr_r_lag: bool = True, use_ds: bool = True, rec_ds: Union[None, List[str]] = None, species: Literal[human, mouse, axolotl] = 'human')[source]#

Handles any necessary data preparation, starting from given source AnnData object

Parameters
mod_type

The type of model that will be employed- this dictates how the data will be processed and prepared. Options:

  • category: spatially-aware, for each sample, computes category prevalence within the spatial

neighborhood and uses these as independent variables - niche: spatially-aware, uses spatial connections between samples as independent variables - ligand_lag: spatially-lagged, from database uses select ligand genes to perform regression on

select receptor and/or receptor-downstream genes, and additionally considers neighbor expression of the ligands

  • niche_lr: spatially-aware, uses a coupling of spatial category connections, ligand expression

    and receptor expression to perform regression on select receptor-downstream genes

lig

Only used if ‘mod_type’ contains “ligand”. Provides the list of ligands to use as predictors. If not given, will attempt to subset self.genes

rec

Only used if ‘mod_type’ contains “ligand”. Provides the list of receptors to investigate. If not given, will search through database for all genes that correspond to the provided genes from ‘ligands’.

niche_lr_r_lag

Only used if ‘mod_type’ is “niche_lr”. Uses the spatial lag of the receptor as the dependent variable rather than each spot’s unique receptor expression. Defaults to True.

use_ds

If True, uses receptor-downstream genes in addition to ligands and receptors.

rec_ds

Only used if ‘mod_type’ is “niche_lr” or “ligand_lag”. Can be used to optionally manually define a list of genes shown to be (or thought to potentially be) downstream of one or more of the provided L:R pairs. If not given, will find receptor-downstream genes from database based on input to ‘lig’ and ‘rec’.

species

Selects the cell-cell communication database the relevant ligands will be drawn from. Options: “human”, “mouse”, “axolotl”.

compute_spatial_weights()[source]#

Generates matrix of pairwise spatial distances, used in spatially-lagged models

GLMCV_fit_predict(gs_params: Union[None, dict] = None, n_gs_cv: Union[None, int] = None, n_jobs: int = 30, cat_key: Union[None, str] = None, categories: Union[None, str, List[str]] = None, **kwargs) Tuple[pandas.DataFrame, pandas.DataFrame][source]#

Wrapper for fitting predictive generalized linear regression model.

Parameters
gs_params

Optional dictionary where keys are variable names for the regressor and values are lists of potential values for which to find the best combination using grid search. Classifier parameters should be given in the following form: ‘classifier__{parameter name}’.

n_gs_cv

Number of folds for grid search cross-validation, will only be used if gs_params is not None. If None, will default to a 5-fold cross-validation.

n_jobs

For parallel processing, number of tasks to run at once

cat_key

Optional, name of key in .obs containing categorical (e.g. cell type) information

categories

Optional, names of categories to subset to for the regression. In cases where the exogenous block is exceptionally heterogenous, can be used to narrow down the search space.

kwargs

Additional named arguments that will be provided to :class GLMCV.

Returns

Contains fitted parameters for each feature reconst: Contains predicted expression for each feature

Return type

coeffs

visualize_params(coeffs: pandas.DataFrame, subset_cols: Union[None, str, List[str]] = None, cmap: str = 'autumn', zero_center_cmap: bool = False, mask_threshold: Union[None, float] = None, mask_zero: bool = True, transpose: bool = False, title: Union[None, str] = None, xlabel: Union[None, str] = None, ylabel: Union[None, str] = None, figsize: Union[None, Tuple[float, float]] = None, annot_kws: dict = {}, save_show_or_return: Literal[save, show, return, both, all] = 'save', save_kwargs: dict = {})[source]#

Generates heatmap of parameter values for visualization

Parameters
coeffs

Contains coefficients (and any other relevant statistics that were computed) from regression for each variable

subset_cols

String or list of strings that can be used to subset coeffs DataFrame such that only columns with names containing the provided key strings are displayed on heatmap. For example, can use “coeff” to plot only the linear regression coefficients, “zstat” for the z-statistic, etc. Or can use the full name of the column to select specific columns.

cmap

Name of the colormap to use

zero_center_cmap

Set True to set colormap intensity midpoint to zero.

mask_threshold

Optional, sets lower absolute value thresholds for parameters to be assigned color in heatmap (will compare absolute value of each element against this threshold)

mask_zero

Set True to not assign color to zeros (representing neither a positive or negative interaction)

transpose

Set True to reverse the dataframe’s orientation before plotting

title

Optional, provides title for plot. If not given, will use default “Spatial Parameters”.

xlabel

Optional, provides label for x-axis. If not given, will use default “Predictor Features”.

ylabel

Optional, provides label for y-axis. If not given, will use default “Target Features”.

figsize

Can be used to set width and height of figure window, in inches. If not given, will use Spateo default.

annot_kws

Optional dictionary that can be used to set qualities of the axis/tick labels. For example, can set ‘size’: 9, ‘weight’: ‘bold’, etc.

save_show_or_return

Whether to save, show or return the figure. If “both”, it will save and plot the figure at the same time. If “all”, the figure will be saved, displayed and the associated axis and other object will be return.

save_kwargs

A dictionary that will passed to the save_fig function. By default it is an empty dictionary and the save_fig function will use the {“path”: None, “prefix”: ‘scatter’, “dpi”: None, “ext”: ‘pdf’, “transparent”: True, “close”: True, “verbose”: True} as its parameters. Otherwise you can provide a dictionary that properly modifies those keys according to your needs.

compute_coeff_significance(coeffs: pandas.DataFrame, significance_threshold: float = 0.05, only_positive: bool = False, only_negative: bool = False) Tuple[pandas.DataFrame, pandas.DataFrame, pandas.DataFrame][source]#

Computes statistical significance for fitted coefficients.

Parameters
coeffs

Contains coefficients from regression for each variable

significance_threshold

p-value needed to call a sender-receiver relationship significant

only_positive

Set True to find significance/pvalues/qvalues only for the subset of coefficients that is positive (representing possible mechanisms of positive regulation).

only_negative

Set True to find significance/pvalues/qvalues only for the subset of coefficients that is negative (representing possible mechanisms of positive regulation).

Returns

Dataframe of identical shape to coeffs, where each element is True or False if it meets the threshold for significance pvalues: Dataframe of identical shape to coeffs, where each element is a p-value for that instance of that

feature

qvalues: Dataframe of identical shape to coeffs, where each element is a q-value for that instance of that

feature

Return type

is_significant

get_effect_sizes(coeffs: pandas.DataFrame, only_positive: bool = False, only_negative: bool = False, significance_threshold: float = 0.05, lr_pair: Union[None, str] = None, save_prefix: Union[None, str] = None)[source]#

For each predictor and each feature, determine if the influence of said predictor in predicting said feature is significant.

Additionally, for each feature and each sender-receiver category pair, determines the effect size that the sender induces in the feature for the receiver.

Only valid if the model specified uses the connections between categories as variables for the regression- thus can be applied to ‘mod_type’ “niche”, or “niche_lr”.

Parameters
coeffs

Contains coefficients from regression for each variable

only_positive

Set True to find significance/pvalues/qvalues only for the subset of coefficients that is positive (representing possible mechanisms of positive regulation).

only_negative

Set True to find significance/pvalues/qvalues only for the subset of coefficients that is negative (representing possible mechanisms of positive regulation).

significance_threshold

p-value needed to call a sender-receiver relationship significant

lr_pair

Required if (and used only in the case that) coefficients came from a Niche-LR model; used to subset the coefficients array to the specific ligand-receptor pair of interest. Takes the form “{ligand}-{receptor}” and should match one of the keys in :dict self.niche_mats. If not given, will default to the first key in this dictionary.

save_prefix

If provided, saves all relevant dataframes to :path ./regression_outputs under the name {prefix}_{coeffs/pvalues, etc.}.csv. If not provided, will not save.

type_coupling(cmap: str = 'Reds', fontsize: Union[None, int] = None, figsize: Union[None, Tuple[float, float]] = None, ignore_self: bool = True, save_show_or_return: Literal[save, show, return, both, all] = 'save', save_kwargs: dict = {})[source]#

Generates heatmap of spatially differentially-expressed features for each pair of sender and receiver categories. Only valid if the model specified uses the connections between categories as variables for the regression.

A high number of differentially-expressed genes between a given sender-receiver pair means that the sender being in the neighborhood of the receiver tends to correlate with differential expression levels of many of the genes within the selection- much of the cellular variation in the receiver cell type can be attributed to being in proximity with the sender.

Parameters
cmap

Name of Matplotlib color map to use

fontsize

Size of figure title and axis labels

figsize

Width and height of plotting window

save_show_or_return

Options: “save”, “show”, “return”, “both”, “all” - “both” for save and show

ignore_self

If True, will ignore the effect of cell type in proximity to other cells of the same type- will record the number of DEGs only if the two cell types are different.

save_show_or_return

Whether to save, show or return the figure. If “both”, it will save and plot the figure at the same time. If “all”, the figure will be saved, displayed and the associated axis and other object will be return.

save_kwargs

A dictionary that will passed to the save_fig function. By default it is an empty dictionary and the save_fig function will use the {“path”: None, “prefix”: ‘scatter’, “dpi”: None, “ext”: ‘pdf’, “transparent”: True, “close”: True, “verbose”: True} as its parameters. Otherwise you can provide a dictionary that properly modifies those keys according to your needs.

sender_effect_on_all_receivers(sender: str, plot_mode: str = 'effect_size', gene_subset: Union[None, List[str]] = None, significance_threshold: float = 0.05, cut_pvals: float = -5, fontsize: Union[None, int] = None, figsize: Union[None, Tuple[float, float]] = None, cmap: str = 'seismic', save_show_or_return: Literal[save, show, return, both, all] = 'show', save_kwargs: Optional[dict] = {})[source]#

Evaluates and visualizes the effect that the given sender cell type has on expression/abundance in each possible receiver cell type.

Parameters
sender

sender cell type label

plot_mode

specifies what gets plotted. Options:

  • ”qvals”: elements of the plot represent statistical significance of the interaction

  • ”effect_size”: elements of the plot represent numerical expression change induced in the

    sender by the sender

gene_subset

Names of genes to subset for plot. If None, will use all genes that were used in the regression.

significance_threshold

Set non-significant effect sizes to zero, where the threshold is given here

cut_pvals

Minimum allowable log10(pval)- anything below this will be clipped to this value

fontsize

Size of figure title and axis labels

figsize

Width and height of plotting window

cmap

Name of matplotlib colormap specifying colormap to use

save_show_or_return

Whether to save, show or return the figure. If “both”, it will save and plot the figure at the same time. If “all”, the figure will be saved, displayed and the associated axis and other object will be return.

save_kwargs

A dictionary that will passed to the save_fig function. By default it is an empty dictionary and the save_fig function will use the {“path”: None, “prefix”: ‘scatter’, “dpi”: None, “ext”: ‘pdf’, “transparent”: True, “close”: True, “verbose”: True} as its parameters. Otherwise you can provide a dictionary that properly modifies those keys according to your needs.

all_senders_effect_on_receiver(receiver: str, plot_mode: str = 'effect_size', gene_subset: Union[None, List[str]] = None, significance_threshold: float = 0.05, cut_pvals: float = -5, fontsize: Union[None, int] = None, figsize: Union[None, Tuple[float, float]] = None, cmap: str = 'seismic', save_show_or_return: Literal[save, show, return, both, all] = 'show', save_kwargs: Optional[dict] = {})[source]#

Evaluates and visualizes the effect that each possible sender cell type has on expression/abundance in a selected receiver cell type.

Parameters
receiver

Receiver cell type label

plot_mode

specifies what gets plotted. Options:

  • ”qvals”: elements of the plot represent statistical significance of the interaction

  • ”effect_size”: elements of the plot represent effect size induced in the receiver by the sender

gene_subset

Names of genes to subset for plot. If None, will use all genes that were used in the regression.

significance_threshold

Set non-significant effect sizes to zero, where the threshold is given here

cut_pvals

Minimum allowable log10(pval)- anything below this will be clipped to this value

fontsize

Size of figure title and axis labels

figsize

Width and height of plotting window

cmap

Name of matplotlib colormap specifying colormap to use

save_show_or_return

Whether to save, show or return the figure. If “both”, it will save and plot the figure at the same time. If “all”, the figure will be saved, displayed and the associated axis and other object will be return.

save_kwargs

A dictionary that will passed to the save_fig function. By default it is an empty dictionary and the save_fig function will use the {“path”: None, “prefix”: ‘scatter’, “dpi”: None, “ext”: ‘pdf’, “transparent”: True, “close”: True, “verbose”: True} as its parameters. Otherwise you can provide a dictionary that properly modifies those keys according to your needs.

sender_receiver_effect_volcanoplot(receiver: str, sender: str, significance_threshold: float = 0.05, effect_size_threshold: Union[None, float] = None, fontsize: Union[None, int] = None, figsize: Union[None, Tuple[float, float]] = (4.5, 7.0), save_show_or_return: Literal[save, show, return, both, all] = 'show', save_kwargs: Optional[dict] = {})[source]#

Volcano plot to identify differentially expressed genes of a given receiver cell type in the presence of a given sender cell type.

Parameters
receiver

Receiver cell type label

sender

Sender cell type label

significance_threshold

Set non-significant effect sizes (given by q-values) to zero, where the threshold is given here

effect_size_threshold

Set absolute value effect-size threshold beyond which observations are marked as interesting. If not given, will take the 95th percentile fold-change as the cutoff.

fontsize

Size of figure title and axis labels

figsize

Width and height of plotting window

save_show_or_return

Whether to save, show or return the figure. If “both”, it will save and plot the figure at the same time. If “all”, the figure will be saved, displayed and the associated axis and other object will be return.

save_kwargs

A dictionary that will passed to the save_fig function. By default it is an empty dictionary and the save_fig function will use the {“path”: None, “prefix”: ‘scatter’, “dpi”: None, “ext”: ‘pdf’, “transparent”: True, “close”: True, “verbose”: True} as its parameters. Otherwise you can provide a dictionary that properly modifies those keys according to your needs.

class spateo.tools.ST_regression.spatial_regression.Category_Model(*args, **kwargs)[source]#

Bases: Base_Model

Wraps all necessary methods for data loading and preparation, model initialization, parameterization, evaluation and prediction when instantiating a model for spatially-aware (but not spatially lagged) regression using categorical variables (specifically, the prevalence of categories within spatial neighborhoods) to predict the value of gene expression.

Arguments passed to :class Base_Model. The only keyword argument that is used for this class is ‘n_neighbors’.

Parameters
args

Positional arguments to :class Base_Model

kwargs

Keyword arguments to :class Base_Model

class spateo.tools.ST_regression.spatial_regression.Niche_Model(*args, **kwargs)[source]#

Bases: Base_Model

Wraps all necessary methods for data loading and preparation, model initialization, parameterization, evaluation and prediction when instantiating a model for spatially-aware regression using both the prevalence of and connections between categories within spatial neighborhoods to predict the value of gene expression.

Arguments passed to :class Base_Model.

Parameters
args

Positional arguments to :class Base_Model

kwargs

Keyword arguments to :class Base_Model

class spateo.tools.ST_regression.spatial_regression.Lagged_Model(model_type: str = 'ligand', lig: Union[None, str, List[str]] = None, rec: Union[None, str, List[str]] = None, rec_ds: Union[None, str, List[str]] = None, species: Literal[human, mouse, axolotl] = 'human', normalize: bool = True, smooth: bool = False, log_transform: bool = True, *args, **kwargs)[source]#

Bases: Base_Model

Wraps all necessary methods for data loading and preparation, model initialization, parameterization, evaluation and prediction when instantiating a model for spatially-lagged regression.

Can specify one of two models: “ligand”, which uses the spatial lag of ligand genes and the spatial lag of the regression target to predict the regression target, or “niche”, which uses the spatial lag of cell type colocalization and the spatial lag of the regression target to predict the regression target.

If “ligand” is specified, arguments to lig must be given, and it is recommended to provide species as well- default for this is human.

Arguments passed to :class Base_Model.

Parameters
model_type

Either “ligand” or “niche”, specifies whether to fit a model that incorporates the spatial lag of ligand expression or the spatial lag of cell type colocalization.

lig

Name(s) of ligands to use as predictors

rec

Name(s) of receptors to use as regression targets. If not given, will search through database for all genes that correspond to the provided genes from ‘ligands’.

rec_ds

Name(s) of receptor-downstream genes to use as regression targets. If not given, will search through database for all genes that correspond to receptor-downstream genes.

species

Specifies L:R database to use

normalize

Perform library size normalization, to set total counts in each cell to the same number (adjust for cell size)

smooth

To correct for dropout effects, leverage gene expression neighborhoods to smooth expression

log_transform

Set True if log-transformation should be applied to expression (otherwise, will assume preprocessing/log-transform was computed beforehand)

args

Additional positional arguments to :class Base_Model

kwargs

Additional keyword arguments to :class Base_Model

run_GM_lag() Tuple[pandas.DataFrame, pandas.DataFrame, pandas.DataFrame][source]#

Runs spatially lagged two-stage least squares model

single(cur_g: str, X: pandas.DataFrame, X_variable_names: List[str], param_labels: List[str], adata: anndata.AnnData, w: numpy.ndarray, layer: Union[None, str] = None) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray][source]#

Defines model run process for a single feature- not callable by the user, all arguments populated by arguments passed on instantiation of :class Base_Model.

Parameters
cur_g

Name of the feature to regress on

X

Values used for the regression

X_variable_names

Names of the variables used for the regression

param_labels

Names of categories- each computed parameter corresponds to a single element in param_labels

adata

AnnData object to store results in

w

Spatial weights array

layer

Specifies layer in AnnData to use- if None, will use .X.

Returns

Coefficients for each categorical group for each feature pred: Predicted values from regression for each feature resid: Residual values from regression for each feature

Return type

coeffs

class spateo.tools.ST_regression.spatial_regression.Niche_LR_Model(lig: Union[None, str, List[str]], rec: Union[None, str, List[str]] = None, rec_ds: Union[None, str, List[str]] = None, species: Literal[human, mouse, axolotl] = 'human', niche_lr_r_lag: bool = True, *args, **kwargs)[source]#

Bases: Base_Model

Wraps all necessary methods for data loading and preparation, model initialization, parameterization, evaluation and prediction when instantiating a model for spatially-aware regression using the prevalence of and connections between categories within spatial neighborhoods and the cell type-specific expression of ligands and receptors to predict the regression target.

Arguments passed to :class Base_Model.

Parameters
lig

Name(s) of ligands to use as predictors

rec

Name(s) of receptors to use as regression targets. If not given, will search through database for all genes that correspond to the provided genes from ‘ligands’

rec_ds

Name(s) of receptor-downstream genes to use as regression targets. If not given, will search through database for all genes that correspond to receptors

species

Specifies L:R database to use

niche_lr_r_lag

Only used if ‘mod_type’ is “niche_lr”. Uses the spatial lag of the receptor as the dependent variable rather than each spot’s unique receptor expression. Defaults to True.

args

Additional positional arguments to :class Base_Model

kwargs

Additional keyword arguments to :class Base_Model

spateo.tools.ST_regression.spatial_regression.calc_1nd_moment(X, W, normalize_W=True) Tuple[numpy.ndarray, Optional[numpy.ndarray]][source]#