spateo.tools.CCI_effects_modeling.MuSIC
=======================================
.. py:module:: spateo.tools.CCI_effects_modeling.MuSIC
.. autoapi-nested-parse::
Modeling cell-cell communication using a regression model that is considerate of the spatial heterogeneity of (and thus
the context-dependency of the relationships of) the response variable.
Classes
-------
.. autoapisummary::
spateo.tools.CCI_effects_modeling.MuSIC.MuSIC
Module Contents
---------------
.. py:class:: MuSIC(parser: argparse.ArgumentParser, args_list: Optional[List[str]] = None, verbose: bool = True, save_subsampling: bool = True)
Spatially weighted regression on spatial omics data with parallel processing. Runs after being called
from the command line.
:param comm: MPI communicator object initialized with mpi4py, to control parallel processing operations
:param parser: ArgumentParser object initialized with argparse, to parse command line arguments for arguments
pertinent to modeling.
:param args_list: If parser is provided by function call, the arguments to parse must be provided as a separate
list. It is recommended to use the return from :func `define_spateo_argparse()` for this.
:param verbose: Set True to print updates to screen. Will be set False when initializing downstream analysis object,
which inherits from this class but for which the information is generally not as useful.
:param save_subsampling: Set True to save the subsampled data to a .json file. Defaults to True, recommended to set
True for ease of access to the subsampling results.
.. attribute:: mod_type
The type of model that will be employed- this dictates how the data will be processed and
prepared. Options:
- "niche": Spatially-aware, uses categorical cell type labels as independent variables.
- "lr": Spatially-aware, essentially uses the combination of receptor expression in the "target" cell
and spatially lagged ligand expression in the neighboring cells as independent variables.
- "ligand": Spatially-aware, essentially uses ligand expression in the neighboring cells as
independent variables.
- "receptor": Uses receptor expression in the "target" cell as independent variables.
.. attribute:: adata_path
Path to the AnnData object from which to extract data for modeling
.. attribute:: csv_path
Can also be used to specify path to non-AnnData .csv object. Assumes the first three columns
contain x- and y-coordinates and then dependent variable values, in that order, with all subsequent
columns containing independent variable values.
.. attribute:: normalize
Set True to perform library size normalization, to set total counts in each cell to the same
number (adjust for cell size).
.. attribute:: smooth
Set True to correct for dropout effects by leveraging gene expression neighborhoods to smooth
expression. It is advisable not to do this if performing Poisson or negative binomial regression.
.. attribute:: log_transform
Set True if log-transformation should be applied to expression. It is advisable not to do
this if performing Poisson or negative binomial regression.
.. attribute:: normalize_signaling
Set True to minmax scale the final ligand expression array (for :attr `mod_type` =
"ligand"), or the final ligand-receptor array (for :attr `mod_type` = "lr"). This is recommended to
associate downstream expression with rarer/less prevalent signaling mechanisms.
.. attribute:: target_expr_threshold
Only used if :param `mod_type` is "lr" or "ligand" and :param `targets_path` is not
given. When manually selecting targets, expression above a threshold percentage of cells will be used to
filter to a smaller subset of interesting genes. Defaults to 0.2.
.. attribute:: multicollinear_threshold
Variance inflation factor threshold used to filter out multicollinear features. A
value of 5 or 10 is recommended.
.. attribute:: custom_lig_path
Optional path to a .txt file containing a list of ligands for the model, separated by
newlines. Only used if :attr `mod_type` is "lr" or "ligand" (and thus uses ligand expression directly in
the inference). If not provided, will select ligands using a threshold based on expression
levels in the data.
.. attribute:: custom_ligands
Optional list of ligands for the model, can be used as an alternative to :attr
`custom_lig_path`. Only used if :attr `mod_type` is "lr" or "ligand".
.. attribute:: custom_rec_path
Optional path to a .txt file containing a list of receptors for the model, separated by
newlines. Only used if :attr `mod_type` is "lr" (and thus uses receptor expression directly in the
inference). If not provided, will select receptors using a threshold based on expression
levels in the data.
.. attribute:: custom_receptors
Optional list of receptors for the model, can be used as an alternative to :attr
`custom_rec_path`. Only used if :attr `mod_type` is "lr".
.. attribute:: custom_pathways_path
Rather than providing a list of receptors, can provide a list of signaling pathways-
all receptors with annotations in this pathway will be included in the model. Only used if :attr `mod_type`
is "lr".
.. attribute:: custom_pathways
Optional list of signaling pathways for the model, can be used as an alternative to :attr
`custom_pathways_path`. Only used if :attr `mod_type` is "lr".
.. attribute:: targets_path
Optional path to a .txt file containing a list of prediction target genes for the model,
separated by newlines. If not provided, targets will be strategically selected from the given receptors.
.. attribute:: custom_targets
Optional list of prediction target genes for the model, can be used as an alternative to
:attr `targets_path`.
.. attribute:: init_betas_path
Optional path to a .json file or .csv file containing initial coefficient values for the model
for each target variable. If encoded in .json, keys should be target gene names, values should be numpy
arrays containing coefficients. If encoded in .csv, columns should be target gene names. Initial
coefficients should have shape [n_features, ].
.. attribute:: cci_dir
Full path to the directory containing cell-cell communication databases
.. attribute:: species
Selects the cell-cell communication database the relevant ligands will be drawn from. Options:
"human", "mouse".
.. attribute:: output_path
Full path name for the .csv file in which results will be saved
.. attribute:: coords_key
Key in .obsm of the AnnData object that contains the coordinates of the cells
.. attribute:: group_key
Key in .obs of the AnnData object that contains the category grouping for each cell
.. attribute:: group_subset
Subset of cell types to include in the model (provided as a whitespace-separated list in
command line). If given, will consider only cells of these types in modeling. Defaults to all cell types.
.. attribute:: covariate_keys
Can be used to optionally provide any number of keys in .obs or .var containing a continuous
covariate (e.g. expression of a particular TF, avg. distance from a perturbed cell, etc.)
.. attribute:: total_counts_key
Entry in :attr:`adata` .obs that contains total counts for each cell. Required if subsetting
by total counts.
.. attribute:: total_counts_threshold
Threshold for total counts to subset cells by- cells with total counts greater than
this threshold will be retained.
.. attribute:: bw
Used to provide previously obtained bandwidth for the spatial kernel. Consists of either a distance
value or N for the number of nearest neighbors. Pass "np.inf" if all other points should have the same
spatial weight.
.. attribute:: minbw
For use in automated bandwidth selection- the lower-bound bandwidth to test.
.. attribute:: maxbw
For use in automated bandwidth selection- the upper-bound bandwidth to test.
.. attribute:: distr
Distribution family for the dependent variable; one of "gaussian", "poisson", "nb"
.. attribute:: kernel
Type of kernel function used to weight observations; one of "bisquare", "exponential", "gaussian",
"quadratic", "triangular" or "uniform".
.. attribute:: n_neighbors_membrane_bound
For :attr:`mod_type` "ligand" or "lr"- ligand expression will be taken from the
neighboring cells- this defines the number of cells to use for membrane-bound ligands.
.. attribute:: n_neighbors_secreted
For :attr:`mod_type` "ligand" or "lr"- ligand expression will be taken from the
neighboring cells- this defines the number of cells to use for secreted or ECM ligands.
.. attribute:: use_expression_neighbors
The default for finding spatial neighborhoods for the modeling process is to
use neighbors in physical space. If this argument is provided, expression will instead be used to find
neighbors.
.. attribute:: bw_fixed
Set True for distance-based kernel function and False for nearest neighbor-based kernel function
.. attribute:: exclude_self
If True, ignore each sample itself when computing the kernel density estimation
.. attribute:: fit_intercept
Set True to include intercept in the model and False to exclude intercept
.. py:attribute:: logger
.. py:attribute:: parser
.. py:attribute:: args_list
.. py:attribute:: verbose
.. py:attribute:: save_subsampling
.. py:attribute:: mod_type
:value: None
.. py:attribute:: species
:value: None
.. py:attribute:: ligands
:value: None
.. py:attribute:: receptors
:value: None
.. py:attribute:: targets
:value: None
.. py:attribute:: normalize
:value: None
.. py:attribute:: smooth
:value: None
.. py:attribute:: log_transform
:value: None
.. py:attribute:: target_expr_threshold
:value: None
.. py:attribute:: coords
:value: None
.. py:attribute:: groups
:value: None
.. py:attribute:: y
:value: None
.. py:attribute:: X
:value: None
.. py:attribute:: bw
:value: None
.. py:attribute:: minbw
:value: None
.. py:attribute:: maxbw
:value: None
.. py:attribute:: distr
:value: None
.. py:attribute:: kernel
:value: None
.. py:attribute:: n_samples
:value: None
.. py:attribute:: n_features
:value: None
.. py:attribute:: set_up
:value: False
.. py:attribute:: X_df
:value: None
.. py:attribute:: adata
:value: None
.. py:attribute:: cell_categories
:value: None
.. py:attribute:: clip
:value: None
.. py:attribute:: cof_db
:value: None
.. py:attribute:: ct_vec
:value: None
.. py:attribute:: feature_distance
:value: None
.. py:attribute:: feature_names
:value: None
.. py:attribute:: grn
:value: None
.. py:attribute:: ligands_expr
:value: None
.. py:attribute:: ligands_expr_nonlag
:value: None
.. py:attribute:: lr_db
:value: None
.. py:attribute:: lr_pairs
:value: None
.. py:attribute:: n_samples_subsampled
:value: None
.. py:attribute:: n_samples_subset
:value: None
.. py:attribute:: neighboring_unsampled
:value: None
.. py:attribute:: optimal_bw
:value: None
.. py:attribute:: r_tf_db
:value: None
.. py:attribute:: receptors_expr
:value: None
.. py:attribute:: sample_names
:value: None
.. py:attribute:: subsampled
:value: None
.. py:attribute:: subsampled_sample_names
:value: None
.. py:attribute:: subset
:value: None
.. py:attribute:: subset_indices
:value: None
.. py:attribute:: subset_sample_names
:value: None
.. py:attribute:: targets_expr
:value: None
.. py:attribute:: tf_tf_db
:value: None
.. py:attribute:: x_chunk
:value: None
.. py:method:: _set_up_model(verbose: bool = True)
.. py:method:: parse_stgwr_args()
Parse command line arguments for arguments pertinent to modeling.
.. py:method:: load_and_process(upstream: bool = False)
Load AnnData object and process it for modeling.
:param upstream: Set False if performing the actual model fitting process, True to define only the AnnData
object for upstream purposes.
:param downstream: Set True if setting up a downstream model- in this case, ligand/receptor preprocessing will
be skipped.
.. py:method:: setup_downstream(adata: Optional[anndata.AnnData] = None)
Setup for downstream tasks- namely, models for inferring signaling-associated differential expression.
.. py:method:: define_sig_inputs(adata: Optional[anndata.AnnData] = None, recompute: bool = False)
For signaling-relevant models, define necessary quantities that will later be used to define the independent
variable array- the one-hot cell-type array, the ligand expression array and the receptor expression array.
:param recompute: Re-calculate all quantities and re-save even if already-existing file can be found in path
.. py:method:: run_subsample(verbose: bool = True, y: Optional[pandas.DataFrame] = None)
To combat computational intensiveness of this regressive protocol, subsampling will be performed in cases
where there are >= 5000 cells or in cases where specific cell types are manually selected for fitting- local
fit will be performed only on this subset under the assumption that discovered signals will not be
significantly different for the subsampled data.
New Attributes:
subsampled_indices: Dictionary containing indices of the subsampled cells for each dependent variable
n_samples_subsampled: Dictionary containing number of samples to be fit (not total number of samples) for
each dependent variable
subsampled_sample_names: Dictionary containing lists of names of the subsampled cells for each dependent
variable
neighboring_unsampled: Dictionary containing a mapping between each unsampled point and the closest
sampled point
.. py:method:: map_new_cells()
There may be instances where new cells are added to an AnnData object that has already been fit to- in
this instance, accelerate the process by using neighboring results to project model fit to the new cells.
.. py:method:: _set_search_range()
Set the search range for the bandwidth selection procedure.
:param y: Array of dependent variable values, used to determine the search range for the bandwidth selection
.. py:method:: _compute_all_wi(bw: Union[float, int], bw_fixed: Optional[bool] = None, exclude_self: Optional[bool] = None, kernel: Optional[str] = None, verbose: bool = False) -> scipy.sparse.spmatrix
Compute spatial weights for all samples in the dataset given a specified bandwidth.
:param bw: Bandwidth for the spatial kernel
:param fixed_bw: Whether the bandwidth considers a uniform distance for each sample (True) or a nonconstant
distance for each sample that depends on the number of neighbors (False). If not given, will default
to self.fixed_bw.
:param exclude_self: Whether to include each sample itself as one of its nearest neighbors. If not given,
will default to self.exclude_self.
:param kernel: Kernel to use for the spatial weights. If not given, will default to self.kernel.
:param verbose: Whether to display messages during runtime
:returns: Array of weights for all samples in the dataset
:rtype: wi
.. py:method:: local_fit(i: int, y: numpy.ndarray, X: numpy.ndarray, bw: Union[float, int], y_label: str, coords: Optional[numpy.ndarray] = None, mask_indices: Optional[numpy.ndarray] = None, feature_mask: Optional[numpy.ndarray] = None, final: bool = False, fit_predictor: bool = False) -> Union[numpy.ndarray, List[float]]
Fit a local regression model for each sample.
:param i: Index of sample for which local regression model is to be fitted
:param y: Response variable
:param X: Independent variable array
:param bw: Bandwidth for the spatial kernel
:param y_label: Name of the response variable
:param coords: Can be optionally used to provide coordinates for samples- used if subsampling was performed to
maintain all original sample coordinates (to take original neighborhoods into account)
:param mask_indices: Can be optionally used to provide indices of samples to mask out of the dataset
:param feature_mask: Can be optionally used to provide a mask for features to mask out of the dataset
:param final: Set True to indicate that no additional parameter selection needs to be performed; the model can
be fit and more stats can be returned.
:param fit_predictor: Set True to indicate that dependent variable to fit is a linear predictor rather than a
true response variable
:returns: A single output will be given for each case, and can contain either `betas` or a list w/ combinations of
the following:
- i: Index of sample for which local regression model was fitted
- diagnostic: Portion of the output to be used for diagnostic purposes- for Gaussian regression,
this is the residual for the fitted response variable value compared to the observed value. For
non-Gaussian generalized linear regression, this is the fitted response variable value (which
will be used to compute deviance and log-likelihood later on).
- hat_i: Row i of the hat matrix, which is the effect of deleting sample i from the dataset on the
estimated predicted value for sample i
- bw_diagnostic: Output to be used for diagnostic purposes during bandwidth selection- for Gaussian
regression, this is the squared residual, for non-Gaussian generalized linear regression,
this is the fitted response variable value. One of the returns if :param `final` is False
- betas: Estimated coefficients for sample i
- leverages: Leverages for sample i, representing the influence of each independent variable on the
predicted values (linear predictor for GLMs, response variable for Gaussian regression).
.. py:method:: find_optimal_bw(range_lowest: float, range_highest: float, function: Callable) -> float
Perform golden section search to find the optimal bandwidth.
:param range_lowest: Lower bound of the search range
:param range_highest: Upper bound of the search range
:param function: Function to be minimized
:returns: Optimal bandwidth
:rtype: bw
.. py:method:: mpi_fit(y: Optional[numpy.ndarray], X: Optional[numpy.ndarray], X_labels: List[str], y_label: str, bw: Union[float, int], coords: Optional[numpy.ndarray] = None, mask_indices: Optional[numpy.ndarray] = None, feature_mask: Optional[numpy.ndarray] = None, final: bool = False, fit_predictor: bool = False) -> None
Fit local regression model for each sample in parallel, given a specified bandwidth.
:param y: Response variable
:param X: Independent variable array- if not given, will default to :attr `X`. Note that if object was initialized
using an AnnData object, this will be overridden with :attr `X` even if a different array is given.
:param X_labels: Optional list of labels for the features in the X array. Needed if :attr `X` passed to the
function is not identical to the dependent variable array compiled in preprocessing.
:param y_label: Used to provide a unique ID for the dependent variable for saving purposes and to query keys
from various dictionaries
:param bw: Bandwidth for the spatial kernel
:param coords: Coordinates of each point in the X array
:param mask_indices: Optional array used to mask out indices in the fitting process
:param feature_mask: Optional array used to mask out features in the fitting process
:param final: Set True to indicate that no additional parameter selection needs to be performed; the model can
be fit and more stats can be returned.
:param fit_predictor: Set True to indicate that dependent variable to fit is a linear predictor rather than a
true response variable
.. py:method:: fit(y: Optional[pandas.DataFrame] = None, X: Optional[numpy.ndarray] = None, fit_predictor: bool = False, verbose: bool = True) -> Optional[Tuple[Union[None, Dict[str, numpy.ndarray]], Dict[str, float]]]
For each column of the dependent variable array, fit model. If given bandwidth, run :func
`SWR.mpi_fit()` with the given bandwidth. Otherwise, compute optimal bandwidth using :func
`SWR.find_optimal_bw()`, minimizing AICc.
:param y: Optional dataframe, can be used to provide dependent variable array directly to the fit function. If
None, will use :attr `targets_expr` computed using the given AnnData object to create this (each
individual column will serve as an independent variable). Needed to be given as a dataframe so that
column(s) are labeled, so each result can be associated with a labeled dependent variable.
:param X: Optional array, can be used to provide dependent variable array directly to the fit function. If
None, will use :attr `X` computed using the given AnnData object and the type of the model to create.
:param n_feat: Optional int, can be used to specify one column of the X array to fit to.
:param init_betas: Optional dictionary containing arrays with initial values for the coefficients. Keys should
correspond to target genes and values should be arrays of shape [n_features, 1].
:param fit_predictor: Set True to indicate that dependent variable to fit is a linear predictor rather than a
response variable
:param verbose: Set True to print out information about the bandwidth selection and/or fitting process.
.. py:method:: predict(input: Optional[pandas.DataFrame] = None, coeffs: Optional[Union[numpy.ndarray, Dict[str, pandas.DataFrame]]] = None, adjust_for_subsampling: bool = False) -> pandas.DataFrame
Given input data and learned coefficients, predict the dependent variables.
:param input: Input data to be predicted on.
:param coeffs: Coefficients to be used in the prediction. If None, will attempt to load the coefficients learned
in the fitting process from file.
.. py:method:: compute_aicc_linear(RSS: float, trace_hat: float, n_samples: Optional[int] = None) -> float
Compute the corrected Akaike Information Criterion (AICc) for the linear GWR model.
.. py:method:: compute_aicc_glm(ll: float, trace_hat: float, n_samples: Optional[int] = None) -> float
Compute the corrected Akaike Information Criterion (AICc) for the generalized linear GWR models. Given by:
:math AICc = -2*log-likelihood + 2k + (2k(k+1))/(n_eff-k-1).
:param ll: Model log-likelihood
:param trace_hat: Trace of the hat matrix
:param n_samples: Number of samples model was fitted to
.. py:method:: output_diagnostics(aicc: Optional[float] = None, ENP: Optional[float] = None, r_squared: Optional[float] = None, deviance: Optional[float] = None, y_label: Optional[str] = None) -> None
Output diagnostic information about the GWR model.
.. py:method:: save_results(data: numpy.ndarray, header: str, label: Optional[str]) -> None
Save the results of the GWR model to file, and return the coefficients.
:param data: Elements of data to save to .csv
:param header: Column names
:param label: Optional, can be used to provide unique ID to save file- notably used when multiple dependent
variables with different names are fit during this process.
:returns: Model coefficients
:rtype: betas
.. py:method:: predict_and_save(input: Optional[numpy.ndarray] = None, coeffs: Optional[Union[numpy.ndarray, Dict[str, pandas.DataFrame]]] = None, adjust_for_subsampling: bool = True)
Given input data and learned coefficients, predict the dependent variables and then save the output.
:param input: Input data to be predicted on.
:param coeffs: Coefficients to be used in the prediction. If None, will attempt to load the coefficients learned
in the fitting process from file.
:param adjust_for_subsampling: Set True if subsampling was performed; this indicates that the coefficients for
the subsampled points need to be extended to the neighboring non-sampled points.
.. py:method:: return_outputs(adjust_for_subsampling: bool = True, load_for_interpreter: bool = False, load_from_downstream: Optional[Literal['ligand', 'receptor', 'target_gene']] = None) -> Tuple[Dict[str, pandas.DataFrame], Dict[str, pandas.DataFrame]]
Return final coefficients for all fitted models.
:param adjust_for_subsampling: Set True if subsampling was performed; this indicates that the coefficients for
the subsampled points need to be extended to the neighboring non-sampled points.
:param load_for_interpreter: Set True if this is being called from within instance of :class `MuSIC_Interpreter`.
:param load_from_downstream: Set to "ligand", "receptor", or "target_gene" to load coefficients from downstream
models where targets are ligands, receptors or target genes. Must be given if "load_downstream" is True.
Outputs:
all_coeffs: Dictionary containing dataframe consisting of coefficients for each target gene
all_se: Dictionary containing dataframe consisting of standard errors for each target gene
.. py:method:: return_intercepts() -> Union[None, numpy.ndarray, Dict[str, numpy.ndarray]]
Return final intercepts for all fitted models.