spateo.tools.CCI_effects_modeling.regression_utils ================================================== .. py:module:: spateo.tools.CCI_effects_modeling.regression_utils .. autoapi-nested-parse:: Auxiliary functions to aid in the interpretation functions for the spatial and spatially-lagged regression models. Functions --------- .. autoapisummary:: spateo.tools.CCI_effects_modeling.regression_utils.sparse_dot spateo.tools.CCI_effects_modeling.regression_utils.sparse_element_by_element spateo.tools.CCI_effects_modeling.regression_utils.sparse_minmax_scale spateo.tools.CCI_effects_modeling.regression_utils.sparse_add_pseudocount spateo.tools.CCI_effects_modeling.regression_utils.compute_betas spateo.tools.CCI_effects_modeling.regression_utils.compute_betas_local spateo.tools.CCI_effects_modeling.regression_utils.iwls spateo.tools.CCI_effects_modeling.regression_utils.weighted_binary_crossentropy spateo.tools.CCI_effects_modeling.regression_utils.logistic_objective spateo.tools.CCI_effects_modeling.regression_utils.golden_section_search spateo.tools.CCI_effects_modeling.regression_utils.library_scaling_factors spateo.tools.CCI_effects_modeling.regression_utils.softplus spateo.tools.CCI_effects_modeling.regression_utils.multicollinearity_check spateo.tools.CCI_effects_modeling.regression_utils.assign_significance spateo.tools.CCI_effects_modeling.regression_utils.wald_test spateo.tools.CCI_effects_modeling.regression_utils.multitesting_correction spateo.tools.CCI_effects_modeling.regression_utils.get_fisher_inverse spateo.tools.CCI_effects_modeling.regression_utils.run_permutation_test spateo.tools.CCI_effects_modeling.regression_utils.permutation_testing spateo.tools.CCI_effects_modeling.regression_utils.mae spateo.tools.CCI_effects_modeling.regression_utils.mse Module Contents --------------- .. py:function:: sparse_dot(a: Union[numpy.ndarray, scipy.sparse.csr_matrix, scipy.sparse.csc_matrix], b: Union[numpy.ndarray, scipy.sparse.csr_matrix, scipy.sparse.csc_matrix], return_array: bool = True) Matrix multiplication function to deal with sparse and dense objects :param a: First of two matrices to be multiplied, can be sparse or dense :param b: Second of two matrices to be multiplied, can be sparse or dense :param return_array: Set True to return dense array :returns: Matrix product of a and b :rtype: prod .. py:function:: sparse_element_by_element(a: Union[numpy.ndarray, scipy.sparse.csr_matrix, scipy.sparse.csc_matrix], b: Union[numpy.ndarray, scipy.sparse.csr_matrix, scipy.sparse.csc_matrix], return_array: bool = True) Element-by-element multiplication function to deal with either sparse or dense objects. :param a: First of two matrices to be multiplied, can be sparse or dense :param b: Second of two matrices to be multiplied, can be sparse or dense :param return_array: Set True to return dense array :returns: Element-wise multiplied product of a and b :rtype: prod .. py:function:: sparse_minmax_scale(a: Union[scipy.sparse.csr_matrix, scipy.sparse.csc_matrix]) Column-wise minmax scaling of a sparse matrix. .. py:function:: sparse_add_pseudocount(a: Union[scipy.sparse.csr_matrix, scipy.sparse.csc_matrix], pseudocount: float = 1.0) Add pseudocount to sparse matrix. .. py:function:: compute_betas(y: Union[numpy.ndarray, scipy.sparse.csr_matrix, scipy.sparse.csc_matrix], x: Union[numpy.ndarray, scipy.sparse.csr_matrix, scipy.sparse.csc_matrix], ridge_lambda: float = 0.0, clip: float = 5.0) Maximum likelihood estimation procedure, to be used in iteratively weighted least squares to compute the regression coefficients for a given set of dependent and independent variables. Can be combined with either Lasso (L1), Ridge (L2), or Elastic Net (L1 + L2) regularization. Source: Iteratively (Re)weighted Least Squares (IWLS), Fotheringham, A. S., Brunsdon, C., & Charlton, M. (2002). Geographically weighted regression: the analysis of spatially varying relationships. :param y: Array of shape [n_samples,]; dependent variable :param x: Array of shape [n_samples, n_features]; independent variables :param ridge_lambda: Regularization parameter for Ridge regression. Higher values will tend to shrink coefficients further towards zero. :param clip: Float; upper and lower bound to constrain betas and prevent numerical overflow :returns: Array of shape [n_features,]; regression coefficients :rtype: betas .. py:function:: compute_betas_local(y: numpy.ndarray, x: numpy.ndarray, w: numpy.ndarray, ridge_lambda: float = 0.0, clip: Optional[float] = None) Maximum likelihood estimation procedure, to be used in iteratively weighted least squares to compute the regression coefficients for a given set of dependent and independent variables while accounting for spatial heterogeneity in the dependent variable. Source: Iteratively (Re)weighted Least Squares (IWLS), Fotheringham, A. S., Brunsdon, C., & Charlton, M. (2002). Geographically weighted regression: the analysis of spatially varying relationships. :param y: Array of shape [n_samples,]; dependent variable :param x: Array of shape [n_samples, n_features]; independent variables :param ridge_lambda: Regularization parameter for Ridge regression. Higher values will tend to shrink coefficients further towards zero. :param w: Array of shape [n_samples, 1]; spatial weights matrix :param clip: Float; upper and lower bound to constrain betas and prevent numerical overflow :returns: Array of shape [n_features,]; regression coefficients pseudoinverse: Array of shape [n_samples, n_samples]; Moore-Penrose pseudoinverse of the X matrix cov_inverse: Array of shape [n_samples, n_samples]; inverse of the covariance matrix :rtype: betas .. py:function:: iwls(y: Union[numpy.ndarray, scipy.sparse.csr_matrix, scipy.sparse.csc_matrix], x: Union[numpy.ndarray, scipy.sparse.csr_matrix, scipy.sparse.csc_matrix], distr: Literal['gaussian', 'poisson', 'nb', 'binomial'] = 'gaussian', init_betas: Optional[numpy.ndarray] = None, offset: Optional[numpy.ndarray] = None, tol: float = 1e-08, clip: Optional[Union[float, numpy.ndarray]] = None, threshold: float = 0.0001, max_iter: int = 200, spatial_weights: Optional[numpy.ndarray] = None, i: Optional[int] = None, link: Optional[spateo.tools.CCI_effects_modeling.distributions.Link] = None, ridge_lambda: Optional[float] = None, mask: Optional[numpy.ndarray] = None) Iteratively weighted least squares (IWLS) algorithm to compute the regression coefficients for a given set of dependent and independent variables. Source: Iteratively (Re)weighted Least Squares (IWLS), Fotheringham, A. S., Brunsdon, C., & Charlton, M. (2002). Geographically weighted regression: the analysis of spatially varying relationships. :param y: Array of shape [n_samples, 1]; dependent variable :param x: Array of shape [n_samples, n_features]; independent variables :param distr: Distribution family for the dependent variable; one of "gaussian", "poisson", "nb", "binomial" :param init_betas: Array of shape [n_features,]; initial regression coefficients :param offset: Optional array of shape [n_samples,]; if provided, will be added to the linear predictor. This is meant to deal with differences in scale that are not caused by the predictor variables, e.g. by differences in library size :param tol: Convergence tolerance :param clip: Sets magnitude of the upper and lower bound to constrain betas and prevent numerical overflow. Either one floating point value or an array for sample-specific clipping. :param threshold: Coefficients with absolute values below this threshold will be set to zero (as these are insignificant) :param max_iter: Maximum number of iterations if convergence is not reached :param spatial_weights: Array of shape [n_samples, 1]; weights to transform observations from location i for a geographically-weighted regression :param i: Optional integer; index of the observation to be used as the center of the geographically-weighted regression. Required if "clip" is an array. :param link: Link function for the distribution family. If None, will default to the default value for the specified distribution family. :param variance: Variance function for the distribution family. If None, will default to the default value for the specified distribution family. :param ridge_lambda: Ridge regularization parameter. :returns: Array of shape [n_features, 1]; regression coefficients y_hat: Array of shape [n_samples, 1]; predicted values of the dependent variable wx: Array of shape [n_samples, 1]; weighted independent variables n_iter: Number of iterations completed upon convergence w_final: Array of shape [n_samples, 1]; final spatial weights used for IWLS. linear_predictor_final: Array of shape [n_samples, 1]; final unadjusted linear predictor used for IWLS. Only returned if "spatial_weights" is not None. adjusted_predictor_final: Array of shape [n_samples, 1]; final adjusted linear predictor used for IWLS. Only returned if "spatial_weights" is not None. pseudoinverse: Array of shape [n_samples, n_samples]; optional influence matrix that is only returned if "spatial_weights" is not None. The pseudoinverse is the Moore-Penrose pseudoinverse of the X matrix. inv: Array of shape [n_samples, n_samples]; the inverse covariance matrix (for Gaussian modeling) or the inverse Fisher matrix (for GLM models). :rtype: betas .. py:function:: weighted_binary_crossentropy(y_true: numpy.ndarray, y_pred: numpy.ndarray, weight_0: float = 1.0, weight_1: float = 1.0) Custom binary cross-entropy loss function with class weights. :param y_true: True binary labels :param y_pred: Predicted probabilities :param weight_0: Weight for class 0 (negative class) :param weight_1: Weight for class 1 (positive class) :returns: Weighted binary cross-entropy loss .. py:function:: logistic_objective(threshold: float, proba: numpy.ndarray, y_true: numpy.ndarray) For binomial regression models with IWLS, the objective function is the weighted sum of recall and specificity. :param threshold: Threshold for converting predicted probabilities to binary predictions :param proba: Predicted probabilities from logistic model :param y_true: True binary labels :returns: Weighted sum of recall and specificity :rtype: score .. py:function:: golden_section_search(func: Callable, a: float, b: float, tol: float = 1e-05, min_or_max: str = 'min') Find the extremum of a function within a specified range using Golden Section Search. :param func: The function to find the extremum of. :param a: Lower bound of the range. :param b: Upper bound of the range. :param tol: Tolerance for stopping criterion. :param min_or_max: Whether to find the minimum or maximum of the function. :returns: The x-value of the function's extremum. .. py:function:: library_scaling_factors(offset: Optional[numpy.ndarray] = None, counts: Optional[Union[numpy.ndarray, scipy.sparse.csr_matrix]] = None, distr: Literal['gaussian', 'poisson', 'nb'] = 'gaussian') Get offset values to account for differences in library sizes when comparing expression levels between samples. If the offset is not provided, it calculates the offset based on library sizes. The offset is the logarithm of the library sizes. :param offset: Offset values. If provided, it is returned as is. If None, the offset is calculated based on library sizes. :param counts: Gene expression array :param distr: Distribution of the data. Defaults to "gaussian", but can also be "poisson" or "nb". :returns: Array of shape [n_samples, ] containing offset values. :rtype: offset .. py:function:: softplus(z: numpy.ndarray) Numerically stable version of log(1 + exp(z)). .. py:function:: multicollinearity_check(X: pandas.DataFrame, thresh: float = 5.0, logger: Optional = None) Checks for multicollinearity in dependent variable array, and drops the most multicollinear features until all features have VIF less than a given threshold. :param X: Dependent variable array, in dataframe format :param thresh: VIF threshold; features with values greater than this value will be removed from the regression :param logger: If not provided, will create a new logger :returns: Dependent variable array following filtering :rtype: X .. py:function:: assign_significance(row) .. py:function:: wald_test(theta_mle: Union[float, numpy.ndarray], theta_sd: Union[float, numpy.ndarray], theta0: Union[float, numpy.ndarray] = 0) -> numpy.ndarray Perform Wald test, informing whether a given coefficient deviates significantly from the supplied reference value (theta0), based on the standard deviation of the posterior of the parameter estimate. Function from diffxpy: https://github.com/theislab/diffxpy :param theta_mle: Maximum likelihood estimation of given parameter by feature :param theta_sd: Standard deviation of the maximum likelihood estimation :param theta0: Value(s) to test theta_mle against. Must be either a single number or an array w/ equal number of entries to theta_mle. :returns: p-values for each feature, indicating whether the feature's coefficient deviates significantly from the reference value :rtype: pvals .. py:function:: multitesting_correction(pvals: Union[List[float], numpy.ndarray], method: str = 'fdr_bh', alpha: float = 0.05) -> numpy.ndarray In the case of testing multiple hypotheses from the same experiment, perform multiple test correction to adjust q-values. Function from diffxpy: https://github.com/theislab/diffxpy :param pvals: Uncorrected p-values; must be given as a one-dimensional array :param method: Method to use for correction. Available methods can be found in the documentation for statsmodels.stats.multitest.multipletests(), and are also listed below (in correct case) for convenience: - Named methods: - bonferroni - sidak - holm-sidak - holm - simes-hochberg - hommel - Abbreviated methods: - fdr_bh: Benjamini-Hochberg correction - fdr_by: Benjamini-Yekutieli correction - fdr_tsbh: Two-stage Benjamini-Hochberg - fdr_tsbky: Two-stage Benjamini-Krieger-Yekutieli method :param alpha: Family-wise error rate (FWER) Returns qval: p-values post-correction .. py:function:: get_fisher_inverse(x: numpy.ndarray, y: numpy.ndarray) -> numpy.ndarray Computes the Fisher matrix that measures the amount of information each feature in x provides about y- that is, whether the log-likelihood is sensitive to change in the parameter x. Function derived from diffxpy: https://github.com/theislab/diffxpy :param x: Array of shape [n_samples, n_features]; independent variable array :param fitted: Array of shape [n_samples, 1] or [n_samples, n_variables]; estimated dependent variable :returns: np.ndarray :rtype: inverse_fisher .. py:function:: run_permutation_test(data, thresh, subset_rows=None, subset_cols=None) Permutes the input data array and calculates whether the mean of the permuted array is higher than the provided value. :param data: Input array or sparse matrix :param thresh: Value to compare against the permuted means :param subset_rows: Optional indices to subset the rows of 'data' :param subset_cols: Optional indices to subset the columns of 'data' :returns: Boolean indicating whether the mean of the permuted data is higher than 'thresh' :rtype: is_higher .. py:function:: permutation_testing(data: Union[numpy.ndarray, scipy.sparse.csr_matrix], n_permutations: int = 10000, n_jobs: int = 1, subset_rows: Optional[Union[numpy.ndarray, List[int]]] = None, subset_cols: Optional[Union[numpy.ndarray, List[int]]] = None) -> float Permutes the input array and calculates the p-value based on the number of times the mean of the permuted array is higher than the provided value. :param data: Input array or sparse matrix :param n_permutations: Number of permutations :param n_jobs: Number of parallel jobs to use (-1 for all available CPUs) :param subset_rows: Optional indices to subset the rows of 'data' (to take the mean value of only the subset of interest) :param subset_cols: Optional indices to subset the columns of 'data' (to take the mean value of only the subset of interest) :returns: The calculated p-value. :rtype: pval .. py:function:: mae(y_true, y_pred) -> float Mean absolute error- in this context, actually log1p mean absolute error :param y_true: Regression model output :param y_pred: Observed values for the dependent variable :returns: Mean absolute error value across all samples :rtype: mae .. py:function:: mse(y_true, y_pred) -> float Mean squared error- in this context, actually log1p mean squared error :param y_true: Regression model output :param y_pred: Observed values for the dependent variable :returns: Mean squared error value across all samples :rtype: mse