spateo.tools.CCI_effects_modeling.MuSIC¶
Modeling cell-cell communication using a regression model that is considerate of the spatial heterogeneity of (and thus the context-dependency of the relationships of) the response variable.
Classes¶
Spatially weighted regression on spatial omics data with parallel processing. Runs after being called |
Module Contents¶
- class spateo.tools.CCI_effects_modeling.MuSIC.MuSIC(parser: argparse.ArgumentParser, args_list: List[str] | None = None, verbose: bool = True, save_subsampling: bool = True)[source]¶
Spatially weighted regression on spatial omics data with parallel processing. Runs after being called from the command line.
- Parameters:
- comm
MPI communicator object initialized with mpi4py, to control parallel processing operations
- parser
ArgumentParser object initialized with argparse, to parse command line arguments for arguments pertinent to modeling.
- args_list
If parser is provided by function call, the arguments to parse must be provided as a separate list. It is recommended to use the return from :func define_spateo_argparse() for this.
- verbose
Set True to print updates to screen. Will be set False when initializing downstream analysis object, which inherits from this class but for which the information is generally not as useful.
- save_subsampling
Set True to save the subsampled data to a .json file. Defaults to True, recommended to set True for ease of access to the subsampling results.
- mod_type[source]¶
The type of model that will be employed- this dictates how the data will be processed and prepared. Options:
“niche”: Spatially-aware, uses categorical cell type labels as independent variables.
- “lr”: Spatially-aware, essentially uses the combination of receptor expression in the “target” cell
and spatially lagged ligand expression in the neighboring cells as independent variables.
- “ligand”: Spatially-aware, essentially uses ligand expression in the neighboring cells as
independent variables.
“receptor”: Uses receptor expression in the “target” cell as independent variables.
- adata_path¶
Path to the AnnData object from which to extract data for modeling
- csv_path¶
Can also be used to specify path to non-AnnData .csv object. Assumes the first three columns contain x- and y-coordinates and then dependent variable values, in that order, with all subsequent columns containing independent variable values.
- normalize[source]¶
Set True to perform library size normalization, to set total counts in each cell to the same number (adjust for cell size).
- smooth[source]¶
Set True to correct for dropout effects by leveraging gene expression neighborhoods to smooth expression. It is advisable not to do this if performing Poisson or negative binomial regression.
- log_transform[source]¶
Set True if log-transformation should be applied to expression. It is advisable not to do this if performing Poisson or negative binomial regression.
- normalize_signaling¶
Set True to minmax scale the final ligand expression array (for :attr mod_type = “ligand”), or the final ligand-receptor array (for :attr mod_type = “lr”). This is recommended to associate downstream expression with rarer/less prevalent signaling mechanisms.
- target_expr_threshold[source]¶
Only used if :param mod_type is “lr” or “ligand” and :param targets_path is not given. When manually selecting targets, expression above a threshold percentage of cells will be used to filter to a smaller subset of interesting genes. Defaults to 0.2.
- multicollinear_threshold¶
Variance inflation factor threshold used to filter out multicollinear features. A value of 5 or 10 is recommended.
- custom_lig_path¶
Optional path to a .txt file containing a list of ligands for the model, separated by newlines. Only used if :attr mod_type is “lr” or “ligand” (and thus uses ligand expression directly in the inference). If not provided, will select ligands using a threshold based on expression levels in the data.
- custom_ligands¶
Optional list of ligands for the model, can be used as an alternative to :attr custom_lig_path. Only used if :attr mod_type is “lr” or “ligand”.
- custom_rec_path¶
Optional path to a .txt file containing a list of receptors for the model, separated by newlines. Only used if :attr mod_type is “lr” (and thus uses receptor expression directly in the inference). If not provided, will select receptors using a threshold based on expression levels in the data.
- custom_receptors¶
Optional list of receptors for the model, can be used as an alternative to :attr custom_rec_path. Only used if :attr mod_type is “lr”.
- custom_pathways_path¶
Rather than providing a list of receptors, can provide a list of signaling pathways- all receptors with annotations in this pathway will be included in the model. Only used if :attr mod_type is “lr”.
- custom_pathways¶
Optional list of signaling pathways for the model, can be used as an alternative to :attr custom_pathways_path. Only used if :attr mod_type is “lr”.
- targets_path¶
Optional path to a .txt file containing a list of prediction target genes for the model, separated by newlines. If not provided, targets will be strategically selected from the given receptors.
- custom_targets¶
Optional list of prediction target genes for the model, can be used as an alternative to :attr targets_path.
- init_betas_path¶
Optional path to a .json file or .csv file containing initial coefficient values for the model for each target variable. If encoded in .json, keys should be target gene names, values should be numpy arrays containing coefficients. If encoded in .csv, columns should be target gene names. Initial coefficients should have shape [n_features, ].
- cci_dir¶
Full path to the directory containing cell-cell communication databases
- species[source]¶
Selects the cell-cell communication database the relevant ligands will be drawn from. Options: “human”, “mouse”.
- output_path¶
Full path name for the .csv file in which results will be saved
- coords_key¶
Key in .obsm of the AnnData object that contains the coordinates of the cells
- group_key¶
Key in .obs of the AnnData object that contains the category grouping for each cell
- group_subset¶
Subset of cell types to include in the model (provided as a whitespace-separated list in command line). If given, will consider only cells of these types in modeling. Defaults to all cell types.
- covariate_keys¶
Can be used to optionally provide any number of keys in .obs or .var containing a continuous covariate (e.g. expression of a particular TF, avg. distance from a perturbed cell, etc.)
- total_counts_key¶
Entry in
adata
.obs that contains total counts for each cell. Required if subsetting by total counts.
- total_counts_threshold¶
Threshold for total counts to subset cells by- cells with total counts greater than this threshold will be retained.
- bw[source]¶
Used to provide previously obtained bandwidth for the spatial kernel. Consists of either a distance value or N for the number of nearest neighbors. Pass “np.inf” if all other points should have the same spatial weight.
- kernel[source]¶
Type of kernel function used to weight observations; one of “bisquare”, “exponential”, “gaussian”, “quadratic”, “triangular” or “uniform”.
- n_neighbors_membrane_bound¶
For
mod_type
“ligand” or “lr”- ligand expression will be taken from the neighboring cells- this defines the number of cells to use for membrane-bound ligands.
- n_neighbors_secreted¶
For
mod_type
“ligand” or “lr”- ligand expression will be taken from the neighboring cells- this defines the number of cells to use for secreted or ECM ligands.
- use_expression_neighbors¶
The default for finding spatial neighborhoods for the modeling process is to use neighbors in physical space. If this argument is provided, expression will instead be used to find neighbors.
- bw_fixed¶
Set True for distance-based kernel function and False for nearest neighbor-based kernel function
- exclude_self¶
If True, ignore each sample itself when computing the kernel density estimation
- fit_intercept¶
Set True to include intercept in the model and False to exclude intercept
- load_and_process(upstream: bool = False)[source]¶
Load AnnData object and process it for modeling.
- Parameters:
- upstream
Set False if performing the actual model fitting process, True to define only the AnnData object for upstream purposes.
- downstream
Set True if setting up a downstream model- in this case, ligand/receptor preprocessing will be skipped.
- setup_downstream(adata: anndata.AnnData | None = None)[source]¶
Setup for downstream tasks- namely, models for inferring signaling-associated differential expression.
- define_sig_inputs(adata: anndata.AnnData | None = None, recompute: bool = False)[source]¶
For signaling-relevant models, define necessary quantities that will later be used to define the independent variable array- the one-hot cell-type array, the ligand expression array and the receptor expression array.
- Parameters:
- recompute
Re-calculate all quantities and re-save even if already-existing file can be found in path
- run_subsample(verbose: bool = True, y: pandas.DataFrame | None = None)[source]¶
To combat computational intensiveness of this regressive protocol, subsampling will be performed in cases where there are >= 5000 cells or in cases where specific cell types are manually selected for fitting- local fit will be performed only on this subset under the assumption that discovered signals will not be significantly different for the subsampled data.
- New Attributes:
subsampled_indices: Dictionary containing indices of the subsampled cells for each dependent variable n_samples_subsampled: Dictionary containing number of samples to be fit (not total number of samples) for
each dependent variable
- subsampled_sample_names: Dictionary containing lists of names of the subsampled cells for each dependent
variable
- neighboring_unsampled: Dictionary containing a mapping between each unsampled point and the closest
sampled point
- map_new_cells()[source]¶
There may be instances where new cells are added to an AnnData object that has already been fit to- in this instance, accelerate the process by using neighboring results to project model fit to the new cells.
- _set_search_range()[source]¶
Set the search range for the bandwidth selection procedure.
- Parameters:
- y
Array of dependent variable values, used to determine the search range for the bandwidth selection
- _compute_all_wi(bw: float | int, bw_fixed: bool | None = None, exclude_self: bool | None = None, kernel: str | None = None, verbose: bool = False) scipy.sparse.spmatrix [source]¶
Compute spatial weights for all samples in the dataset given a specified bandwidth.
- Parameters:
- bw
Bandwidth for the spatial kernel
- fixed_bw
Whether the bandwidth considers a uniform distance for each sample (True) or a nonconstant distance for each sample that depends on the number of neighbors (False). If not given, will default to self.fixed_bw.
- exclude_self
Whether to include each sample itself as one of its nearest neighbors. If not given, will default to self.exclude_self.
- kernel
Kernel to use for the spatial weights. If not given, will default to self.kernel.
- verbose
Whether to display messages during runtime
- Returns:
Array of weights for all samples in the dataset
- Return type:
wi
- local_fit(i: int, y: numpy.ndarray, X: numpy.ndarray, bw: float | int, y_label: str, coords: numpy.ndarray | None = None, mask_indices: numpy.ndarray | None = None, feature_mask: numpy.ndarray | None = None, final: bool = False, fit_predictor: bool = False) numpy.ndarray | List[float] [source]¶
Fit a local regression model for each sample.
- Parameters:
- i
Index of sample for which local regression model is to be fitted
- y
Response variable
- X
Independent variable array
- bw
Bandwidth for the spatial kernel
- y_label
Name of the response variable
- coords
Can be optionally used to provide coordinates for samples- used if subsampling was performed to maintain all original sample coordinates (to take original neighborhoods into account)
- mask_indices
Can be optionally used to provide indices of samples to mask out of the dataset
- feature_mask
Can be optionally used to provide a mask for features to mask out of the dataset
- final
Set True to indicate that no additional parameter selection needs to be performed; the model can be fit and more stats can be returned.
- fit_predictor
Set True to indicate that dependent variable to fit is a linear predictor rather than a true response variable
- Returns:
A single output will be given for each case, and can contain either betas or a list w/ combinations of the following:
i: Index of sample for which local regression model was fitted
- diagnostic: Portion of the output to be used for diagnostic purposes- for Gaussian regression,
this is the residual for the fitted response variable value compared to the observed value. For non-Gaussian generalized linear regression, this is the fitted response variable value (which will be used to compute deviance and log-likelihood later on).
- hat_i: Row i of the hat matrix, which is the effect of deleting sample i from the dataset on the
estimated predicted value for sample i
- bw_diagnostic: Output to be used for diagnostic purposes during bandwidth selection- for Gaussian
regression, this is the squared residual, for non-Gaussian generalized linear regression, this is the fitted response variable value. One of the returns if :param final is False
betas: Estimated coefficients for sample i
- leverages: Leverages for sample i, representing the influence of each independent variable on the
predicted values (linear predictor for GLMs, response variable for Gaussian regression).
- find_optimal_bw(range_lowest: float, range_highest: float, function: Callable) float [source]¶
Perform golden section search to find the optimal bandwidth.
- Parameters:
- range_lowest
Lower bound of the search range
- range_highest
Upper bound of the search range
- function
Function to be minimized
- Returns:
Optimal bandwidth
- Return type:
bw
- mpi_fit(y: numpy.ndarray | None, X: numpy.ndarray | None, X_labels: List[str], y_label: str, bw: float | int, coords: numpy.ndarray | None = None, mask_indices: numpy.ndarray | None = None, feature_mask: numpy.ndarray | None = None, final: bool = False, fit_predictor: bool = False) None [source]¶
Fit local regression model for each sample in parallel, given a specified bandwidth.
- Parameters:
- y
Response variable
- X
Independent variable array- if not given, will default to :attr X. Note that if object was initialized using an AnnData object, this will be overridden with :attr X even if a different array is given.
- X_labels
Optional list of labels for the features in the X array. Needed if :attr X passed to the function is not identical to the dependent variable array compiled in preprocessing.
- y_label
Used to provide a unique ID for the dependent variable for saving purposes and to query keys from various dictionaries
- bw
Bandwidth for the spatial kernel
- coords
Coordinates of each point in the X array
- mask_indices
Optional array used to mask out indices in the fitting process
- feature_mask
Optional array used to mask out features in the fitting process
- final
Set True to indicate that no additional parameter selection needs to be performed; the model can be fit and more stats can be returned.
- fit_predictor
Set True to indicate that dependent variable to fit is a linear predictor rather than a true response variable
- fit(y: pandas.DataFrame | None = None, X: numpy.ndarray | None = None, fit_predictor: bool = False, verbose: bool = True) Tuple[None | Dict[str, numpy.ndarray], Dict[str, float]] | None [source]¶
For each column of the dependent variable array, fit model. If given bandwidth, run :func SWR.mpi_fit() with the given bandwidth. Otherwise, compute optimal bandwidth using :func SWR.find_optimal_bw(), minimizing AICc.
- Parameters:
- y
Optional dataframe, can be used to provide dependent variable array directly to the fit function. If None, will use :attr targets_expr computed using the given AnnData object to create this (each individual column will serve as an independent variable). Needed to be given as a dataframe so that column(s) are labeled, so each result can be associated with a labeled dependent variable.
- X
Optional array, can be used to provide dependent variable array directly to the fit function. If None, will use :attr X computed using the given AnnData object and the type of the model to create.
- n_feat
Optional int, can be used to specify one column of the X array to fit to.
- init_betas
Optional dictionary containing arrays with initial values for the coefficients. Keys should correspond to target genes and values should be arrays of shape [n_features, 1].
- fit_predictor
Set True to indicate that dependent variable to fit is a linear predictor rather than a response variable
- verbose
Set True to print out information about the bandwidth selection and/or fitting process.
- predict(input: pandas.DataFrame | None = None, coeffs: numpy.ndarray | Dict[str, pandas.DataFrame] | None = None, adjust_for_subsampling: bool = False) pandas.DataFrame [source]¶
Given input data and learned coefficients, predict the dependent variables.
- Parameters:
- input
Input data to be predicted on.
- coeffs
Coefficients to be used in the prediction. If None, will attempt to load the coefficients learned in the fitting process from file.
- compute_aicc_linear(RSS: float, trace_hat: float, n_samples: int | None = None) float [source]¶
Compute the corrected Akaike Information Criterion (AICc) for the linear GWR model.
- compute_aicc_glm(ll: float, trace_hat: float, n_samples: int | None = None) float [source]¶
Compute the corrected Akaike Information Criterion (AICc) for the generalized linear GWR models. Given by: :math AICc = -2*log-likelihood + 2k + (2k(k+1))/(n_eff-k-1).
- Parameters:
- ll
Model log-likelihood
- trace_hat
Trace of the hat matrix
- n_samples
Number of samples model was fitted to
- output_diagnostics(aicc: float | None = None, ENP: float | None = None, r_squared: float | None = None, deviance: float | None = None, y_label: str | None = None) None [source]¶
Output diagnostic information about the GWR model.
- save_results(data: numpy.ndarray, header: str, label: str | None) None [source]¶
Save the results of the GWR model to file, and return the coefficients.
- Parameters:
- data
Elements of data to save to .csv
- header
Column names
- label
Optional, can be used to provide unique ID to save file- notably used when multiple dependent variables with different names are fit during this process.
- Returns:
Model coefficients
- Return type:
betas
- predict_and_save(input: numpy.ndarray | None = None, coeffs: numpy.ndarray | Dict[str, pandas.DataFrame] | None = None, adjust_for_subsampling: bool = True)[source]¶
Given input data and learned coefficients, predict the dependent variables and then save the output.
- Parameters:
- input
Input data to be predicted on.
- coeffs
Coefficients to be used in the prediction. If None, will attempt to load the coefficients learned in the fitting process from file.
- adjust_for_subsampling
Set True if subsampling was performed; this indicates that the coefficients for the subsampled points need to be extended to the neighboring non-sampled points.
- return_outputs(adjust_for_subsampling: bool = True, load_for_interpreter: bool = False, load_from_downstream: Literal['ligand', 'receptor', 'target_gene'] | None = None) Tuple[Dict[str, pandas.DataFrame], Dict[str, pandas.DataFrame]] [source]¶
Return final coefficients for all fitted models.
- Parameters:
- adjust_for_subsampling
Set True if subsampling was performed; this indicates that the coefficients for the subsampled points need to be extended to the neighboring non-sampled points.
- load_for_interpreter
Set True if this is being called from within instance of :class MuSIC_Interpreter.
- load_from_downstream
Set to “ligand”, “receptor”, or “target_gene” to load coefficients from downstream models where targets are ligands, receptors or target genes. Must be given if “load_downstream” is True.
- Outputs:
all_coeffs: Dictionary containing dataframe consisting of coefficients for each target gene all_se: Dictionary containing dataframe consisting of standard errors for each target gene
- return_intercepts() None | numpy.ndarray | Dict[str, numpy.ndarray] [source]¶
Return final intercepts for all fitted models.