spateo.tools.CCI_effects_modeling.MuSIC ======================================= .. py:module:: spateo.tools.CCI_effects_modeling.MuSIC .. autoapi-nested-parse:: Modeling cell-cell communication using a regression model that is considerate of the spatial heterogeneity of (and thus the context-dependency of the relationships of) the response variable. Classes ------- .. autoapisummary:: spateo.tools.CCI_effects_modeling.MuSIC.MuSIC Module Contents --------------- .. py:class:: MuSIC(parser: argparse.ArgumentParser, args_list: Optional[List[str]] = None, verbose: bool = True, save_subsampling: bool = True) Spatially weighted regression on spatial omics data with parallel processing. Runs after being called from the command line. :param comm: MPI communicator object initialized with mpi4py, to control parallel processing operations :param parser: ArgumentParser object initialized with argparse, to parse command line arguments for arguments pertinent to modeling. :param args_list: If parser is provided by function call, the arguments to parse must be provided as a separate list. It is recommended to use the return from :func `define_spateo_argparse()` for this. :param verbose: Set True to print updates to screen. Will be set False when initializing downstream analysis object, which inherits from this class but for which the information is generally not as useful. :param save_subsampling: Set True to save the subsampled data to a .json file. Defaults to True, recommended to set True for ease of access to the subsampling results. .. attribute:: mod_type The type of model that will be employed- this dictates how the data will be processed and prepared. Options: - "niche": Spatially-aware, uses categorical cell type labels as independent variables. - "lr": Spatially-aware, essentially uses the combination of receptor expression in the "target" cell and spatially lagged ligand expression in the neighboring cells as independent variables. - "ligand": Spatially-aware, essentially uses ligand expression in the neighboring cells as independent variables. - "receptor": Uses receptor expression in the "target" cell as independent variables. .. attribute:: adata_path Path to the AnnData object from which to extract data for modeling .. attribute:: csv_path Can also be used to specify path to non-AnnData .csv object. Assumes the first three columns contain x- and y-coordinates and then dependent variable values, in that order, with all subsequent columns containing independent variable values. .. attribute:: normalize Set True to perform library size normalization, to set total counts in each cell to the same number (adjust for cell size). .. attribute:: smooth Set True to correct for dropout effects by leveraging gene expression neighborhoods to smooth expression. It is advisable not to do this if performing Poisson or negative binomial regression. .. attribute:: log_transform Set True if log-transformation should be applied to expression. It is advisable not to do this if performing Poisson or negative binomial regression. .. attribute:: normalize_signaling Set True to minmax scale the final ligand expression array (for :attr `mod_type` = "ligand"), or the final ligand-receptor array (for :attr `mod_type` = "lr"). This is recommended to associate downstream expression with rarer/less prevalent signaling mechanisms. .. attribute:: target_expr_threshold Only used if :param `mod_type` is "lr" or "ligand" and :param `targets_path` is not given. When manually selecting targets, expression above a threshold percentage of cells will be used to filter to a smaller subset of interesting genes. Defaults to 0.2. .. attribute:: multicollinear_threshold Variance inflation factor threshold used to filter out multicollinear features. A value of 5 or 10 is recommended. .. attribute:: custom_lig_path Optional path to a .txt file containing a list of ligands for the model, separated by newlines. Only used if :attr `mod_type` is "lr" or "ligand" (and thus uses ligand expression directly in the inference). If not provided, will select ligands using a threshold based on expression levels in the data. .. attribute:: custom_ligands Optional list of ligands for the model, can be used as an alternative to :attr `custom_lig_path`. Only used if :attr `mod_type` is "lr" or "ligand". .. attribute:: custom_rec_path Optional path to a .txt file containing a list of receptors for the model, separated by newlines. Only used if :attr `mod_type` is "lr" (and thus uses receptor expression directly in the inference). If not provided, will select receptors using a threshold based on expression levels in the data. .. attribute:: custom_receptors Optional list of receptors for the model, can be used as an alternative to :attr `custom_rec_path`. Only used if :attr `mod_type` is "lr". .. attribute:: custom_pathways_path Rather than providing a list of receptors, can provide a list of signaling pathways- all receptors with annotations in this pathway will be included in the model. Only used if :attr `mod_type` is "lr". .. attribute:: custom_pathways Optional list of signaling pathways for the model, can be used as an alternative to :attr `custom_pathways_path`. Only used if :attr `mod_type` is "lr". .. attribute:: targets_path Optional path to a .txt file containing a list of prediction target genes for the model, separated by newlines. If not provided, targets will be strategically selected from the given receptors. .. attribute:: custom_targets Optional list of prediction target genes for the model, can be used as an alternative to :attr `targets_path`. .. attribute:: init_betas_path Optional path to a .json file or .csv file containing initial coefficient values for the model for each target variable. If encoded in .json, keys should be target gene names, values should be numpy arrays containing coefficients. If encoded in .csv, columns should be target gene names. Initial coefficients should have shape [n_features, ]. .. attribute:: cci_dir Full path to the directory containing cell-cell communication databases .. attribute:: species Selects the cell-cell communication database the relevant ligands will be drawn from. Options: "human", "mouse". .. attribute:: output_path Full path name for the .csv file in which results will be saved .. attribute:: coords_key Key in .obsm of the AnnData object that contains the coordinates of the cells .. attribute:: group_key Key in .obs of the AnnData object that contains the category grouping for each cell .. attribute:: group_subset Subset of cell types to include in the model (provided as a whitespace-separated list in command line). If given, will consider only cells of these types in modeling. Defaults to all cell types. .. attribute:: covariate_keys Can be used to optionally provide any number of keys in .obs or .var containing a continuous covariate (e.g. expression of a particular TF, avg. distance from a perturbed cell, etc.) .. attribute:: total_counts_key Entry in :attr:`adata` .obs that contains total counts for each cell. Required if subsetting by total counts. .. attribute:: total_counts_threshold Threshold for total counts to subset cells by- cells with total counts greater than this threshold will be retained. .. attribute:: bw Used to provide previously obtained bandwidth for the spatial kernel. Consists of either a distance value or N for the number of nearest neighbors. Pass "np.inf" if all other points should have the same spatial weight. .. attribute:: minbw For use in automated bandwidth selection- the lower-bound bandwidth to test. .. attribute:: maxbw For use in automated bandwidth selection- the upper-bound bandwidth to test. .. attribute:: distr Distribution family for the dependent variable; one of "gaussian", "poisson", "nb" .. attribute:: kernel Type of kernel function used to weight observations; one of "bisquare", "exponential", "gaussian", "quadratic", "triangular" or "uniform". .. attribute:: n_neighbors_membrane_bound For :attr:`mod_type` "ligand" or "lr"- ligand expression will be taken from the neighboring cells- this defines the number of cells to use for membrane-bound ligands. .. attribute:: n_neighbors_secreted For :attr:`mod_type` "ligand" or "lr"- ligand expression will be taken from the neighboring cells- this defines the number of cells to use for secreted or ECM ligands. .. attribute:: use_expression_neighbors The default for finding spatial neighborhoods for the modeling process is to use neighbors in physical space. If this argument is provided, expression will instead be used to find neighbors. .. attribute:: bw_fixed Set True for distance-based kernel function and False for nearest neighbor-based kernel function .. attribute:: exclude_self If True, ignore each sample itself when computing the kernel density estimation .. attribute:: fit_intercept Set True to include intercept in the model and False to exclude intercept .. py:attribute:: logger .. py:attribute:: parser .. py:attribute:: args_list .. py:attribute:: verbose .. py:attribute:: save_subsampling .. py:attribute:: mod_type :value: None .. py:attribute:: species :value: None .. py:attribute:: ligands :value: None .. py:attribute:: receptors :value: None .. py:attribute:: targets :value: None .. py:attribute:: normalize :value: None .. py:attribute:: smooth :value: None .. py:attribute:: log_transform :value: None .. py:attribute:: target_expr_threshold :value: None .. py:attribute:: coords :value: None .. py:attribute:: groups :value: None .. py:attribute:: y :value: None .. py:attribute:: X :value: None .. py:attribute:: bw :value: None .. py:attribute:: minbw :value: None .. py:attribute:: maxbw :value: None .. py:attribute:: distr :value: None .. py:attribute:: kernel :value: None .. py:attribute:: n_samples :value: None .. py:attribute:: n_features :value: None .. py:attribute:: set_up :value: False .. py:attribute:: X_df :value: None .. py:attribute:: adata :value: None .. py:attribute:: cell_categories :value: None .. py:attribute:: clip :value: None .. py:attribute:: cof_db :value: None .. py:attribute:: ct_vec :value: None .. py:attribute:: feature_distance :value: None .. py:attribute:: feature_names :value: None .. py:attribute:: grn :value: None .. py:attribute:: ligands_expr :value: None .. py:attribute:: ligands_expr_nonlag :value: None .. py:attribute:: lr_db :value: None .. py:attribute:: lr_pairs :value: None .. py:attribute:: n_samples_subsampled :value: None .. py:attribute:: n_samples_subset :value: None .. py:attribute:: neighboring_unsampled :value: None .. py:attribute:: optimal_bw :value: None .. py:attribute:: r_tf_db :value: None .. py:attribute:: receptors_expr :value: None .. py:attribute:: sample_names :value: None .. py:attribute:: subsampled :value: None .. py:attribute:: subsampled_sample_names :value: None .. py:attribute:: subset :value: None .. py:attribute:: subset_indices :value: None .. py:attribute:: subset_sample_names :value: None .. py:attribute:: targets_expr :value: None .. py:attribute:: tf_tf_db :value: None .. py:attribute:: x_chunk :value: None .. py:method:: _set_up_model(verbose: bool = True) .. py:method:: parse_stgwr_args() Parse command line arguments for arguments pertinent to modeling. .. py:method:: load_and_process(upstream: bool = False) Load AnnData object and process it for modeling. :param upstream: Set False if performing the actual model fitting process, True to define only the AnnData object for upstream purposes. :param downstream: Set True if setting up a downstream model- in this case, ligand/receptor preprocessing will be skipped. .. py:method:: setup_downstream(adata: Optional[anndata.AnnData] = None) Setup for downstream tasks- namely, models for inferring signaling-associated differential expression. .. py:method:: define_sig_inputs(adata: Optional[anndata.AnnData] = None, recompute: bool = False) For signaling-relevant models, define necessary quantities that will later be used to define the independent variable array- the one-hot cell-type array, the ligand expression array and the receptor expression array. :param recompute: Re-calculate all quantities and re-save even if already-existing file can be found in path .. py:method:: run_subsample(verbose: bool = True, y: Optional[pandas.DataFrame] = None) To combat computational intensiveness of this regressive protocol, subsampling will be performed in cases where there are >= 5000 cells or in cases where specific cell types are manually selected for fitting- local fit will be performed only on this subset under the assumption that discovered signals will not be significantly different for the subsampled data. New Attributes: subsampled_indices: Dictionary containing indices of the subsampled cells for each dependent variable n_samples_subsampled: Dictionary containing number of samples to be fit (not total number of samples) for each dependent variable subsampled_sample_names: Dictionary containing lists of names of the subsampled cells for each dependent variable neighboring_unsampled: Dictionary containing a mapping between each unsampled point and the closest sampled point .. py:method:: map_new_cells() There may be instances where new cells are added to an AnnData object that has already been fit to- in this instance, accelerate the process by using neighboring results to project model fit to the new cells. .. py:method:: _set_search_range() Set the search range for the bandwidth selection procedure. :param y: Array of dependent variable values, used to determine the search range for the bandwidth selection .. py:method:: _compute_all_wi(bw: Union[float, int], bw_fixed: Optional[bool] = None, exclude_self: Optional[bool] = None, kernel: Optional[str] = None, verbose: bool = False) -> scipy.sparse.spmatrix Compute spatial weights for all samples in the dataset given a specified bandwidth. :param bw: Bandwidth for the spatial kernel :param fixed_bw: Whether the bandwidth considers a uniform distance for each sample (True) or a nonconstant distance for each sample that depends on the number of neighbors (False). If not given, will default to self.fixed_bw. :param exclude_self: Whether to include each sample itself as one of its nearest neighbors. If not given, will default to self.exclude_self. :param kernel: Kernel to use for the spatial weights. If not given, will default to self.kernel. :param verbose: Whether to display messages during runtime :returns: Array of weights for all samples in the dataset :rtype: wi .. py:method:: local_fit(i: int, y: numpy.ndarray, X: numpy.ndarray, bw: Union[float, int], y_label: str, coords: Optional[numpy.ndarray] = None, mask_indices: Optional[numpy.ndarray] = None, feature_mask: Optional[numpy.ndarray] = None, final: bool = False, fit_predictor: bool = False) -> Union[numpy.ndarray, List[float]] Fit a local regression model for each sample. :param i: Index of sample for which local regression model is to be fitted :param y: Response variable :param X: Independent variable array :param bw: Bandwidth for the spatial kernel :param y_label: Name of the response variable :param coords: Can be optionally used to provide coordinates for samples- used if subsampling was performed to maintain all original sample coordinates (to take original neighborhoods into account) :param mask_indices: Can be optionally used to provide indices of samples to mask out of the dataset :param feature_mask: Can be optionally used to provide a mask for features to mask out of the dataset :param final: Set True to indicate that no additional parameter selection needs to be performed; the model can be fit and more stats can be returned. :param fit_predictor: Set True to indicate that dependent variable to fit is a linear predictor rather than a true response variable :returns: A single output will be given for each case, and can contain either `betas` or a list w/ combinations of the following: - i: Index of sample for which local regression model was fitted - diagnostic: Portion of the output to be used for diagnostic purposes- for Gaussian regression, this is the residual for the fitted response variable value compared to the observed value. For non-Gaussian generalized linear regression, this is the fitted response variable value (which will be used to compute deviance and log-likelihood later on). - hat_i: Row i of the hat matrix, which is the effect of deleting sample i from the dataset on the estimated predicted value for sample i - bw_diagnostic: Output to be used for diagnostic purposes during bandwidth selection- for Gaussian regression, this is the squared residual, for non-Gaussian generalized linear regression, this is the fitted response variable value. One of the returns if :param `final` is False - betas: Estimated coefficients for sample i - leverages: Leverages for sample i, representing the influence of each independent variable on the predicted values (linear predictor for GLMs, response variable for Gaussian regression). .. py:method:: find_optimal_bw(range_lowest: float, range_highest: float, function: Callable) -> float Perform golden section search to find the optimal bandwidth. :param range_lowest: Lower bound of the search range :param range_highest: Upper bound of the search range :param function: Function to be minimized :returns: Optimal bandwidth :rtype: bw .. py:method:: mpi_fit(y: Optional[numpy.ndarray], X: Optional[numpy.ndarray], X_labels: List[str], y_label: str, bw: Union[float, int], coords: Optional[numpy.ndarray] = None, mask_indices: Optional[numpy.ndarray] = None, feature_mask: Optional[numpy.ndarray] = None, final: bool = False, fit_predictor: bool = False) -> None Fit local regression model for each sample in parallel, given a specified bandwidth. :param y: Response variable :param X: Independent variable array- if not given, will default to :attr `X`. Note that if object was initialized using an AnnData object, this will be overridden with :attr `X` even if a different array is given. :param X_labels: Optional list of labels for the features in the X array. Needed if :attr `X` passed to the function is not identical to the dependent variable array compiled in preprocessing. :param y_label: Used to provide a unique ID for the dependent variable for saving purposes and to query keys from various dictionaries :param bw: Bandwidth for the spatial kernel :param coords: Coordinates of each point in the X array :param mask_indices: Optional array used to mask out indices in the fitting process :param feature_mask: Optional array used to mask out features in the fitting process :param final: Set True to indicate that no additional parameter selection needs to be performed; the model can be fit and more stats can be returned. :param fit_predictor: Set True to indicate that dependent variable to fit is a linear predictor rather than a true response variable .. py:method:: fit(y: Optional[pandas.DataFrame] = None, X: Optional[numpy.ndarray] = None, fit_predictor: bool = False, verbose: bool = True) -> Optional[Tuple[Union[None, Dict[str, numpy.ndarray]], Dict[str, float]]] For each column of the dependent variable array, fit model. If given bandwidth, run :func `SWR.mpi_fit()` with the given bandwidth. Otherwise, compute optimal bandwidth using :func `SWR.find_optimal_bw()`, minimizing AICc. :param y: Optional dataframe, can be used to provide dependent variable array directly to the fit function. If None, will use :attr `targets_expr` computed using the given AnnData object to create this (each individual column will serve as an independent variable). Needed to be given as a dataframe so that column(s) are labeled, so each result can be associated with a labeled dependent variable. :param X: Optional array, can be used to provide dependent variable array directly to the fit function. If None, will use :attr `X` computed using the given AnnData object and the type of the model to create. :param n_feat: Optional int, can be used to specify one column of the X array to fit to. :param init_betas: Optional dictionary containing arrays with initial values for the coefficients. Keys should correspond to target genes and values should be arrays of shape [n_features, 1]. :param fit_predictor: Set True to indicate that dependent variable to fit is a linear predictor rather than a response variable :param verbose: Set True to print out information about the bandwidth selection and/or fitting process. .. py:method:: predict(input: Optional[pandas.DataFrame] = None, coeffs: Optional[Union[numpy.ndarray, Dict[str, pandas.DataFrame]]] = None, adjust_for_subsampling: bool = False) -> pandas.DataFrame Given input data and learned coefficients, predict the dependent variables. :param input: Input data to be predicted on. :param coeffs: Coefficients to be used in the prediction. If None, will attempt to load the coefficients learned in the fitting process from file. .. py:method:: compute_aicc_linear(RSS: float, trace_hat: float, n_samples: Optional[int] = None) -> float Compute the corrected Akaike Information Criterion (AICc) for the linear GWR model. .. py:method:: compute_aicc_glm(ll: float, trace_hat: float, n_samples: Optional[int] = None) -> float Compute the corrected Akaike Information Criterion (AICc) for the generalized linear GWR models. Given by: :math AICc = -2*log-likelihood + 2k + (2k(k+1))/(n_eff-k-1). :param ll: Model log-likelihood :param trace_hat: Trace of the hat matrix :param n_samples: Number of samples model was fitted to .. py:method:: output_diagnostics(aicc: Optional[float] = None, ENP: Optional[float] = None, r_squared: Optional[float] = None, deviance: Optional[float] = None, y_label: Optional[str] = None) -> None Output diagnostic information about the GWR model. .. py:method:: save_results(data: numpy.ndarray, header: str, label: Optional[str]) -> None Save the results of the GWR model to file, and return the coefficients. :param data: Elements of data to save to .csv :param header: Column names :param label: Optional, can be used to provide unique ID to save file- notably used when multiple dependent variables with different names are fit during this process. :returns: Model coefficients :rtype: betas .. py:method:: predict_and_save(input: Optional[numpy.ndarray] = None, coeffs: Optional[Union[numpy.ndarray, Dict[str, pandas.DataFrame]]] = None, adjust_for_subsampling: bool = True) Given input data and learned coefficients, predict the dependent variables and then save the output. :param input: Input data to be predicted on. :param coeffs: Coefficients to be used in the prediction. If None, will attempt to load the coefficients learned in the fitting process from file. :param adjust_for_subsampling: Set True if subsampling was performed; this indicates that the coefficients for the subsampled points need to be extended to the neighboring non-sampled points. .. py:method:: return_outputs(adjust_for_subsampling: bool = True, load_for_interpreter: bool = False, load_from_downstream: Optional[Literal['ligand', 'receptor', 'target_gene']] = None) -> Tuple[Dict[str, pandas.DataFrame], Dict[str, pandas.DataFrame]] Return final coefficients for all fitted models. :param adjust_for_subsampling: Set True if subsampling was performed; this indicates that the coefficients for the subsampled points need to be extended to the neighboring non-sampled points. :param load_for_interpreter: Set True if this is being called from within instance of :class `MuSIC_Interpreter`. :param load_from_downstream: Set to "ligand", "receptor", or "target_gene" to load coefficients from downstream models where targets are ligands, receptors or target genes. Must be given if "load_downstream" is True. Outputs: all_coeffs: Dictionary containing dataframe consisting of coefficients for each target gene all_se: Dictionary containing dataframe consisting of standard errors for each target gene .. py:method:: return_intercepts() -> Union[None, numpy.ndarray, Dict[str, numpy.ndarray]] Return final intercepts for all fitted models.