spateo.tools.CCI_effects_modeling.MuSIC#

Modeling cell-cell communication using a regression model that is considerate of the spatial heterogeneity of (and thus the context-dependency of the relationships of) the response variable.

Module Contents#

Classes#

MuSIC

Spatially weighted regression on spatial omics data with parallel processing. Runs after being called

class spateo.tools.CCI_effects_modeling.MuSIC.MuSIC(parser: argparse.ArgumentParser, args_list: List[str] | None = None, verbose: bool = True, save_subsampling: bool = True)[source]#

Spatially weighted regression on spatial omics data with parallel processing. Runs after being called from the command line.

Parameters:
comm

MPI communicator object initialized with mpi4py, to control parallel processing operations

parser

ArgumentParser object initialized with argparse, to parse command line arguments for arguments pertinent to modeling.

args_list

If parser is provided by function call, the arguments to parse must be provided as a separate list. It is recommended to use the return from :func define_spateo_argparse() for this.

verbose

Set True to print updates to screen. Will be set False when initializing downstream analysis object, which inherits from this class but for which the information is generally not as useful.

save_subsampling

Set True to save the subsampled data to a .json file. Defaults to True, recommended to set True for ease of access to the subsampling results.

mod_type#

The type of model that will be employed- this dictates how the data will be processed and prepared. Options:

  • “niche”: Spatially-aware, uses categorical cell type labels as independent variables.

  • “lr”: Spatially-aware, essentially uses the combination of receptor expression in the “target” cell

    and spatially lagged ligand expression in the neighboring cells as independent variables.

  • “ligand”: Spatially-aware, essentially uses ligand expression in the neighboring cells as

    independent variables.

  • “receptor”: Uses receptor expression in the “target” cell as independent variables.

adata_path#

Path to the AnnData object from which to extract data for modeling

csv_path#

Can also be used to specify path to non-AnnData .csv object. Assumes the first three columns contain x- and y-coordinates and then dependent variable values, in that order, with all subsequent columns containing independent variable values.

normalize#

Set True to perform library size normalization, to set total counts in each cell to the same number (adjust for cell size).

smooth#

Set True to correct for dropout effects by leveraging gene expression neighborhoods to smooth expression. It is advisable not to do this if performing Poisson or negative binomial regression.

log_transform#

Set True if log-transformation should be applied to expression. It is advisable not to do this if performing Poisson or negative binomial regression.

normalize_signaling#

Set True to minmax scale the final ligand expression array (for :attr mod_type = “ligand”), or the final ligand-receptor array (for :attr mod_type = “lr”). This is recommended to associate downstream expression with rarer/less prevalent signaling mechanisms.

target_expr_threshold#

Only used if :param mod_type is “lr” or “ligand” and :param targets_path is not given. When manually selecting targets, expression above a threshold percentage of cells will be used to filter to a smaller subset of interesting genes. Defaults to 0.2.

multicollinear_threshold#

Variance inflation factor threshold used to filter out multicollinear features. A value of 5 or 10 is recommended.

custom_lig_path#

Optional path to a .txt file containing a list of ligands for the model, separated by newlines. Only used if :attr mod_type is “lr” or “ligand” (and thus uses ligand expression directly in the inference). If not provided, will select ligands using a threshold based on expression levels in the data.

custom_ligands#

Optional list of ligands for the model, can be used as an alternative to :attr custom_lig_path. Only used if :attr mod_type is “lr” or “ligand”.

custom_rec_path#

Optional path to a .txt file containing a list of receptors for the model, separated by newlines. Only used if :attr mod_type is “lr” (and thus uses receptor expression directly in the inference). If not provided, will select receptors using a threshold based on expression levels in the data.

custom_receptors#

Optional list of receptors for the model, can be used as an alternative to :attr custom_rec_path. Only used if :attr mod_type is “lr”.

custom_pathways_path#

Rather than providing a list of receptors, can provide a list of signaling pathways- all receptors with annotations in this pathway will be included in the model. Only used if :attr mod_type is “lr”.

custom_pathways#

Optional list of signaling pathways for the model, can be used as an alternative to :attr custom_pathways_path. Only used if :attr mod_type is “lr”.

targets_path#

Optional path to a .txt file containing a list of prediction target genes for the model, separated by newlines. If not provided, targets will be strategically selected from the given receptors.

custom_targets#

Optional list of prediction target genes for the model, can be used as an alternative to :attr targets_path.

init_betas_path#

Optional path to a .json file or .csv file containing initial coefficient values for the model for each target variable. If encoded in .json, keys should be target gene names, values should be numpy arrays containing coefficients. If encoded in .csv, columns should be target gene names. Initial coefficients should have shape [n_features, ].

cci_dir#

Full path to the directory containing cell-cell communication databases

species#

Selects the cell-cell communication database the relevant ligands will be drawn from. Options: “human”, “mouse”.

output_path#

Full path name for the .csv file in which results will be saved

coords_key#

Key in .obsm of the AnnData object that contains the coordinates of the cells

group_key#

Key in .obs of the AnnData object that contains the category grouping for each cell

group_subset#

Subset of cell types to include in the model (provided as a whitespace-separated list in command line). If given, will consider only cells of these types in modeling. Defaults to all cell types.

covariate_keys#

Can be used to optionally provide any number of keys in .obs or .var containing a continuous covariate (e.g. expression of a particular TF, avg. distance from a perturbed cell, etc.)

total_counts_key#

Entry in adata .obs that contains total counts for each cell. Required if subsetting by total counts.

total_counts_threshold#

Threshold for total counts to subset cells by- cells with total counts greater than this threshold will be retained.

bw#

Used to provide previously obtained bandwidth for the spatial kernel. Consists of either a distance value or N for the number of nearest neighbors. Pass “np.inf” if all other points should have the same spatial weight.

minbw#

For use in automated bandwidth selection- the lower-bound bandwidth to test.

maxbw#

For use in automated bandwidth selection- the upper-bound bandwidth to test.

distr#

Distribution family for the dependent variable; one of “gaussian”, “poisson”, “nb”

kernel#

Type of kernel function used to weight observations; one of “bisquare”, “exponential”, “gaussian”, “quadratic”, “triangular” or “uniform”.

n_neighbors_membrane_bound#

For mod_type “ligand” or “lr”- ligand expression will be taken from the neighboring cells- this defines the number of cells to use for membrane-bound ligands.

n_neighbors_secreted#

For mod_type “ligand” or “lr”- ligand expression will be taken from the neighboring cells- this defines the number of cells to use for secreted or ECM ligands.

use_expression_neighbors#

The default for finding spatial neighborhoods for the modeling process is to use neighbors in physical space. If this argument is provided, expression will instead be used to find neighbors.

bw_fixed#

Set True for distance-based kernel function and False for nearest neighbor-based kernel function

exclude_self#

If True, ignore each sample itself when computing the kernel density estimation

fit_intercept#

Set True to include intercept in the model and False to exclude intercept

_set_up_model(verbose: bool = True, downstream: bool = False)[source]#
parse_stgwr_args()[source]#

Parse command line arguments for arguments pertinent to modeling.

load_and_process(upstream: bool = False, downstream: bool = False)[source]#

Load AnnData object and process it for modeling.

Parameters:
upstream

Set False if performing the actual model fitting process, True to define only the AnnData object for upstream purposes.

downstream

Set True if setting up a downstream model- in this case, ligand/receptor preprocessing will be skipped.

setup_downstream(adata: anndata.AnnData | None = None)[source]#

Setup for downstream tasks- namely, models for inferring signaling-associated differential expression.

define_sig_inputs(adata: anndata.AnnData | None = None, recompute: bool = False)[source]#

For signaling-relevant models, define necessary quantities that will later be used to define the independent variable array- the one-hot cell-type array, the ligand expression array and the receptor expression array.

Parameters:
recompute

Re-calculate all quantities and re-save even if already-existing file can be found in path

run_subsample(verbose: bool = True, y: pandas.DataFrame | None = None)[source]#

To combat computational intensiveness of this regressive protocol, subsampling will be performed in cases where there are >= 5000 cells or in cases where specific cell types are manually selected for fitting- local fit will be performed only on this subset under the assumption that discovered signals will not be significantly different for the subsampled data.

New Attributes:

subsampled_indices: Dictionary containing indices of the subsampled cells for each dependent variable n_samples_subsampled: Dictionary containing number of samples to be fit (not total number of samples) for

each dependent variable

subsampled_sample_names: Dictionary containing lists of names of the subsampled cells for each dependent

variable

neighboring_unsampled: Dictionary containing a mapping between each unsampled point and the closest

sampled point

map_new_cells()[source]#

There may be instances where new cells are added to an AnnData object that has already been fit to- in this instance, accelerate the process by using neighboring results to project model fit to the new cells.

_set_search_range()[source]#

Set the search range for the bandwidth selection procedure.

Parameters:
y

Array of dependent variable values, used to determine the search range for the bandwidth selection

_compute_all_wi(bw: float | int, bw_fixed: bool | None = None, exclude_self: bool | None = None, kernel: str | None = None, verbose: bool = False) scipy.sparse.spmatrix[source]#

Compute spatial weights for all samples in the dataset given a specified bandwidth.

Parameters:
bw

Bandwidth for the spatial kernel

fixed_bw

Whether the bandwidth considers a uniform distance for each sample (True) or a nonconstant distance for each sample that depends on the number of neighbors (False). If not given, will default to self.fixed_bw.

exclude_self

Whether to include each sample itself as one of its nearest neighbors. If not given, will default to self.exclude_self.

kernel

Kernel to use for the spatial weights. If not given, will default to self.kernel.

verbose

Whether to display messages during runtime

Returns:

Array of weights for all samples in the dataset

Return type:

wi

local_fit(i: int, y: numpy.ndarray, X: numpy.ndarray, bw: float | int, y_label: str, coords: numpy.ndarray | None = None, mask_indices: numpy.ndarray | None = None, feature_mask: numpy.ndarray | None = None, final: bool = False, fit_predictor: bool = False) numpy.ndarray | List[float][source]#

Fit a local regression model for each sample.

Parameters:
i

Index of sample for which local regression model is to be fitted

y

Response variable

X

Independent variable array

bw

Bandwidth for the spatial kernel

y_label

Name of the response variable

coords

Can be optionally used to provide coordinates for samples- used if subsampling was performed to maintain all original sample coordinates (to take original neighborhoods into account)

mask_indices

Can be optionally used to provide indices of samples to mask out of the dataset

feature_mask

Can be optionally used to provide a mask for features to mask out of the dataset

final

Set True to indicate that no additional parameter selection needs to be performed; the model can be fit and more stats can be returned.

fit_predictor

Set True to indicate that dependent variable to fit is a linear predictor rather than a true response variable

Returns:

A single output will be given for each case, and can contain either betas or a list w/ combinations of the following:

  • i: Index of sample for which local regression model was fitted

  • diagnostic: Portion of the output to be used for diagnostic purposes- for Gaussian regression,

    this is the residual for the fitted response variable value compared to the observed value. For non-Gaussian generalized linear regression, this is the fitted response variable value (which will be used to compute deviance and log-likelihood later on).

  • hat_i: Row i of the hat matrix, which is the effect of deleting sample i from the dataset on the

    estimated predicted value for sample i

  • bw_diagnostic: Output to be used for diagnostic purposes during bandwidth selection- for Gaussian

    regression, this is the squared residual, for non-Gaussian generalized linear regression, this is the fitted response variable value. One of the returns if :param final is False

  • betas: Estimated coefficients for sample i

  • leverages: Leverages for sample i, representing the influence of each independent variable on the

    predicted values (linear predictor for GLMs, response variable for Gaussian regression).

find_optimal_bw(range_lowest: float, range_highest: float, function: Callable) float[source]#

Perform golden section search to find the optimal bandwidth.

Parameters:
range_lowest

Lower bound of the search range

range_highest

Upper bound of the search range

function

Function to be minimized

Returns:

Optimal bandwidth

Return type:

bw

mpi_fit(y: numpy.ndarray | None, X: numpy.ndarray | None, X_labels: List[str], y_label: str, bw: float | int, coords: numpy.ndarray | None = None, mask_indices: numpy.ndarray | None = None, feature_mask: numpy.ndarray | None = None, final: bool = False, fit_predictor: bool = False) None[source]#

Fit local regression model for each sample in parallel, given a specified bandwidth.

Parameters:
y

Response variable

X

Independent variable array- if not given, will default to :attr X. Note that if object was initialized using an AnnData object, this will be overridden with :attr X even if a different array is given.

X_labels

Optional list of labels for the features in the X array. Needed if :attr X passed to the function is not identical to the dependent variable array compiled in preprocessing.

y_label

Used to provide a unique ID for the dependent variable for saving purposes and to query keys from various dictionaries

bw

Bandwidth for the spatial kernel

coords

Coordinates of each point in the X array

mask_indices

Optional array used to mask out indices in the fitting process

feature_mask

Optional array used to mask out features in the fitting process

final

Set True to indicate that no additional parameter selection needs to be performed; the model can be fit and more stats can be returned.

fit_predictor

Set True to indicate that dependent variable to fit is a linear predictor rather than a true response variable

fit(y: pandas.DataFrame | None = None, X: numpy.ndarray | None = None, fit_predictor: bool = False, verbose: bool = True) Tuple[None | Dict[str, numpy.ndarray], Dict[str, float]] | None[source]#

For each column of the dependent variable array, fit model. If given bandwidth, run :func SWR.mpi_fit() with the given bandwidth. Otherwise, compute optimal bandwidth using :func SWR.find_optimal_bw(), minimizing AICc.

Parameters:
y

Optional dataframe, can be used to provide dependent variable array directly to the fit function. If None, will use :attr targets_expr computed using the given AnnData object to create this (each individual column will serve as an independent variable). Needed to be given as a dataframe so that column(s) are labeled, so each result can be associated with a labeled dependent variable.

X

Optional array, can be used to provide dependent variable array directly to the fit function. If None, will use :attr X computed using the given AnnData object and the type of the model to create.

n_feat

Optional int, can be used to specify one column of the X array to fit to.

init_betas

Optional dictionary containing arrays with initial values for the coefficients. Keys should correspond to target genes and values should be arrays of shape [n_features, 1].

fit_predictor

Set True to indicate that dependent variable to fit is a linear predictor rather than a response variable

verbose

Set True to print out information about the bandwidth selection and/or fitting process.

predict(input: pandas.DataFrame | None = None, coeffs: numpy.ndarray | Dict[str, pandas.DataFrame] | None = None, adjust_for_subsampling: bool = False) pandas.DataFrame[source]#

Given input data and learned coefficients, predict the dependent variables.

Parameters:
input

Input data to be predicted on.

coeffs

Coefficients to be used in the prediction. If None, will attempt to load the coefficients learned in the fitting process from file.

compute_aicc_linear(RSS: float, trace_hat: float, n_samples: int | None = None) float[source]#

Compute the corrected Akaike Information Criterion (AICc) for the linear GWR model.

compute_aicc_glm(ll: float, trace_hat: float, n_samples: int | None = None) float[source]#

Compute the corrected Akaike Information Criterion (AICc) for the generalized linear GWR models. Given by: :math AICc = -2*log-likelihood + 2k + (2k(k+1))/(n_eff-k-1).

Parameters:
ll

Model log-likelihood

trace_hat

Trace of the hat matrix

n_samples

Number of samples model was fitted to

output_diagnostics(aicc: float | None = None, ENP: float | None = None, r_squared: float | None = None, deviance: float | None = None, y_label: str | None = None) None[source]#

Output diagnostic information about the GWR model.

save_results(data: numpy.ndarray, header: str, label: str | None) None[source]#

Save the results of the GWR model to file, and return the coefficients.

Parameters:
data

Elements of data to save to .csv

header

Column names

label

Optional, can be used to provide unique ID to save file- notably used when multiple dependent variables with different names are fit during this process.

Returns:

Model coefficients

Return type:

betas

predict_and_save(input: numpy.ndarray | None = None, coeffs: numpy.ndarray | Dict[str, pandas.DataFrame] | None = None, adjust_for_subsampling: bool = True)[source]#

Given input data and learned coefficients, predict the dependent variables and then save the output.

Parameters:
input

Input data to be predicted on.

coeffs

Coefficients to be used in the prediction. If None, will attempt to load the coefficients learned in the fitting process from file.

adjust_for_subsampling

Set True if subsampling was performed; this indicates that the coefficients for the subsampled points need to be extended to the neighboring non-sampled points.

return_outputs(adjust_for_subsampling: bool = True, load_for_interpreter: bool = False, load_from_downstream: Literal[ligand, receptor, target_gene] | None = None) Tuple[Dict[str, pandas.DataFrame], Dict[str, pandas.DataFrame]][source]#

Return final coefficients for all fitted models.

Parameters:
adjust_for_subsampling

Set True if subsampling was performed; this indicates that the coefficients for the subsampled points need to be extended to the neighboring non-sampled points.

load_for_interpreter

Set True if this is being called from within instance of :class MuSIC_Interpreter.

load_from_downstream

Set to “ligand”, “receptor”, or “target_gene” to load coefficients from downstream models where targets are ligands, receptors or target genes. Must be given if “load_downstream” is True.

Outputs:

all_coeffs: Dictionary containing dataframe consisting of coefficients for each target gene all_se: Dictionary containing dataframe consisting of standard errors for each target gene

return_intercepts() None | numpy.ndarray | Dict[str, numpy.ndarray][source]#

Return final intercepts for all fitted models.