spateo.tools.CCI_effects_modeling.regression_utils¶
Auxiliary functions to aid in the interpretation functions for the spatial and spatially-lagged regression models.
Functions¶
|
Matrix multiplication function to deal with sparse and dense objects |
|
Element-by-element multiplication function to deal with either sparse or dense objects. |
Column-wise minmax scaling of a sparse matrix. |
|
|
Add pseudocount to sparse matrix. |
|
Maximum likelihood estimation procedure, to be used in iteratively weighted least squares to compute the |
|
Maximum likelihood estimation procedure, to be used in iteratively weighted least squares to compute the |
|
Iteratively weighted least squares (IWLS) algorithm to compute the regression coefficients for a given set of |
|
Custom binary cross-entropy loss function with class weights. |
|
For binomial regression models with IWLS, the objective function is the weighted sum of recall and specificity. |
|
Find the extremum of a function within a specified range using Golden Section Search. |
|
Get offset values to account for differences in library sizes when comparing expression levels between samples. |
|
Numerically stable version of log(1 + exp(z)). |
|
Checks for multicollinearity in dependent variable array, and drops the most multicollinear features until |
|
|
|
Perform Wald test, informing whether a given coefficient deviates significantly from the |
|
In the case of testing multiple hypotheses from the same experiment, perform multiple test correction to adjust |
|
Computes the Fisher matrix that measures the amount of information each feature in x provides about y- that is, |
|
Permutes the input data array and calculates whether the mean of the permuted array is higher than the |
|
Permutes the input array and calculates the p-value based on the number of times the mean of the permuted |
|
Mean absolute error- in this context, actually log1p mean absolute error |
|
Mean squared error- in this context, actually log1p mean squared error |
Module Contents¶
- spateo.tools.CCI_effects_modeling.regression_utils.sparse_dot(a: numpy.ndarray | scipy.sparse.csr_matrix | scipy.sparse.csc_matrix, b: numpy.ndarray | scipy.sparse.csr_matrix | scipy.sparse.csc_matrix, return_array: bool = True)[source]¶
Matrix multiplication function to deal with sparse and dense objects
- Parameters:
- a
First of two matrices to be multiplied, can be sparse or dense
- b
Second of two matrices to be multiplied, can be sparse or dense
- return_array
Set True to return dense array
- Returns:
Matrix product of a and b
- Return type:
prod
- spateo.tools.CCI_effects_modeling.regression_utils.sparse_element_by_element(a: numpy.ndarray | scipy.sparse.csr_matrix | scipy.sparse.csc_matrix, b: numpy.ndarray | scipy.sparse.csr_matrix | scipy.sparse.csc_matrix, return_array: bool = True)[source]¶
Element-by-element multiplication function to deal with either sparse or dense objects.
- Parameters:
- a
First of two matrices to be multiplied, can be sparse or dense
- b
Second of two matrices to be multiplied, can be sparse or dense
- return_array
Set True to return dense array
- Returns:
Element-wise multiplied product of a and b
- Return type:
prod
- spateo.tools.CCI_effects_modeling.regression_utils.sparse_minmax_scale(a: scipy.sparse.csr_matrix | scipy.sparse.csc_matrix)[source]¶
Column-wise minmax scaling of a sparse matrix.
- spateo.tools.CCI_effects_modeling.regression_utils.sparse_add_pseudocount(a: scipy.sparse.csr_matrix | scipy.sparse.csc_matrix, pseudocount: float = 1.0)[source]¶
Add pseudocount to sparse matrix.
- spateo.tools.CCI_effects_modeling.regression_utils.compute_betas(y: numpy.ndarray | scipy.sparse.csr_matrix | scipy.sparse.csc_matrix, x: numpy.ndarray | scipy.sparse.csr_matrix | scipy.sparse.csc_matrix, ridge_lambda: float = 0.0, clip: float = 5.0)[source]¶
Maximum likelihood estimation procedure, to be used in iteratively weighted least squares to compute the regression coefficients for a given set of dependent and independent variables. Can be combined with either Lasso (L1), Ridge (L2), or Elastic Net (L1 + L2) regularization.
Source: Iteratively (Re)weighted Least Squares (IWLS), Fotheringham, A. S., Brunsdon, C., & Charlton, M. (2002). Geographically weighted regression: the analysis of spatially varying relationships.
- Parameters:
- y
Array of shape [n_samples,]; dependent variable
- x
Array of shape [n_samples, n_features]; independent variables
- ridge_lambda
Regularization parameter for Ridge regression. Higher values will tend to shrink coefficients further towards zero.
- clip
Float; upper and lower bound to constrain betas and prevent numerical overflow
- Returns:
Array of shape [n_features,]; regression coefficients
- Return type:
betas
- spateo.tools.CCI_effects_modeling.regression_utils.compute_betas_local(y: numpy.ndarray, x: numpy.ndarray, w: numpy.ndarray, ridge_lambda: float = 0.0, clip: float | None = None)[source]¶
Maximum likelihood estimation procedure, to be used in iteratively weighted least squares to compute the regression coefficients for a given set of dependent and independent variables while accounting for spatial heterogeneity in the dependent variable.
Source: Iteratively (Re)weighted Least Squares (IWLS), Fotheringham, A. S., Brunsdon, C., & Charlton, M. (2002). Geographically weighted regression: the analysis of spatially varying relationships.
- Parameters:
- y
Array of shape [n_samples,]; dependent variable
- x
Array of shape [n_samples, n_features]; independent variables
- ridge_lambda
Regularization parameter for Ridge regression. Higher values will tend to shrink coefficients further towards zero.
- w
Array of shape [n_samples, 1]; spatial weights matrix
- clip
Float; upper and lower bound to constrain betas and prevent numerical overflow
- Returns:
Array of shape [n_features,]; regression coefficients pseudoinverse: Array of shape [n_samples, n_samples]; Moore-Penrose pseudoinverse of the X matrix cov_inverse: Array of shape [n_samples, n_samples]; inverse of the covariance matrix
- Return type:
betas
- spateo.tools.CCI_effects_modeling.regression_utils.iwls(y: numpy.ndarray | scipy.sparse.csr_matrix | scipy.sparse.csc_matrix, x: numpy.ndarray | scipy.sparse.csr_matrix | scipy.sparse.csc_matrix, distr: Literal['gaussian', 'poisson', 'nb', 'binomial'] = 'gaussian', init_betas: numpy.ndarray | None = None, offset: numpy.ndarray | None = None, tol: float = 1e-08, clip: float | numpy.ndarray | None = None, threshold: float = 0.0001, max_iter: int = 200, spatial_weights: numpy.ndarray | None = None, i: int | None = None, link: spateo.tools.CCI_effects_modeling.distributions.Link | None = None, ridge_lambda: float | None = None, mask: numpy.ndarray | None = None)[source]¶
Iteratively weighted least squares (IWLS) algorithm to compute the regression coefficients for a given set of dependent and independent variables.
Source: Iteratively (Re)weighted Least Squares (IWLS), Fotheringham, A. S., Brunsdon, C., & Charlton, M. (2002). Geographically weighted regression: the analysis of spatially varying relationships.
- Parameters:
- y
Array of shape [n_samples, 1]; dependent variable
- x
Array of shape [n_samples, n_features]; independent variables
- distr
Distribution family for the dependent variable; one of “gaussian”, “poisson”, “nb”, “binomial”
- init_betas
Array of shape [n_features,]; initial regression coefficients
- offset
Optional array of shape [n_samples,]; if provided, will be added to the linear predictor. This is meant to deal with differences in scale that are not caused by the predictor variables, e.g. by differences in library size
- tol
Convergence tolerance
- clip
Sets magnitude of the upper and lower bound to constrain betas and prevent numerical overflow. Either one floating point value or an array for sample-specific clipping.
- threshold
Coefficients with absolute values below this threshold will be set to zero (as these are insignificant)
- max_iter
Maximum number of iterations if convergence is not reached
- spatial_weights
Array of shape [n_samples, 1]; weights to transform observations from location i for a geographically-weighted regression
- i
Optional integer; index of the observation to be used as the center of the geographically-weighted regression. Required if “clip” is an array.
- link
Link function for the distribution family. If None, will default to the default value for the specified distribution family.
- variance
Variance function for the distribution family. If None, will default to the default value for the specified distribution family.
- ridge_lambda
Ridge regularization parameter.
- Returns:
Array of shape [n_features, 1]; regression coefficients y_hat: Array of shape [n_samples, 1]; predicted values of the dependent variable wx: Array of shape [n_samples, 1]; weighted independent variables n_iter: Number of iterations completed upon convergence w_final: Array of shape [n_samples, 1]; final spatial weights used for IWLS. linear_predictor_final: Array of shape [n_samples, 1]; final unadjusted linear predictor used for IWLS. Only
returned if “spatial_weights” is not None.
- adjusted_predictor_final: Array of shape [n_samples, 1]; final adjusted linear predictor used for IWLS. Only
returned if “spatial_weights” is not None.
- pseudoinverse: Array of shape [n_samples, n_samples]; optional influence matrix that is only returned if
”spatial_weights” is not None. The pseudoinverse is the Moore-Penrose pseudoinverse of the X matrix.
- inv: Array of shape [n_samples, n_samples]; the inverse covariance matrix (for Gaussian modeling) or the
inverse Fisher matrix (for GLM models).
- Return type:
betas
- spateo.tools.CCI_effects_modeling.regression_utils.weighted_binary_crossentropy(y_true: numpy.ndarray, y_pred: numpy.ndarray, weight_0: float = 1.0, weight_1: float = 1.0)[source]¶
Custom binary cross-entropy loss function with class weights.
- Parameters:
- y_true
True binary labels
- y_pred
Predicted probabilities
- weight_0
Weight for class 0 (negative class)
- weight_1
Weight for class 1 (positive class)
- Returns:
Weighted binary cross-entropy loss
- spateo.tools.CCI_effects_modeling.regression_utils.logistic_objective(threshold: float, proba: numpy.ndarray, y_true: numpy.ndarray)[source]¶
For binomial regression models with IWLS, the objective function is the weighted sum of recall and specificity.
- Parameters:
- threshold
Threshold for converting predicted probabilities to binary predictions
- proba
Predicted probabilities from logistic model
- y_true
True binary labels
- Returns:
Weighted sum of recall and specificity
- Return type:
score
- spateo.tools.CCI_effects_modeling.regression_utils.golden_section_search(func: Callable, a: float, b: float, tol: float = 1e-05, min_or_max: str = 'min')[source]¶
Find the extremum of a function within a specified range using Golden Section Search.
- Parameters:
- func
The function to find the extremum of.
- a
Lower bound of the range.
- b
Upper bound of the range.
- tol
Tolerance for stopping criterion.
- min_or_max
Whether to find the minimum or maximum of the function.
- Returns:
The x-value of the function’s extremum.
- spateo.tools.CCI_effects_modeling.regression_utils.library_scaling_factors(offset: numpy.ndarray | None = None, counts: numpy.ndarray | scipy.sparse.csr_matrix | None = None, distr: Literal['gaussian', 'poisson', 'nb'] = 'gaussian')[source]¶
Get offset values to account for differences in library sizes when comparing expression levels between samples.
If the offset is not provided, it calculates the offset based on library sizes. The offset is the logarithm of the library sizes.
- Parameters:
- offset
Offset values. If provided, it is returned as is. If None, the offset is calculated based on library sizes.
- counts
Gene expression array
- distr
Distribution of the data. Defaults to “gaussian”, but can also be “poisson” or “nb”.
- Returns:
Array of shape [n_samples, ] containing offset values.
- Return type:
offset
- spateo.tools.CCI_effects_modeling.regression_utils.softplus(z: numpy.ndarray)[source]¶
Numerically stable version of log(1 + exp(z)).
- spateo.tools.CCI_effects_modeling.regression_utils.multicollinearity_check(X: pandas.DataFrame, thresh: float = 5.0, logger: Optional = None)[source]¶
Checks for multicollinearity in dependent variable array, and drops the most multicollinear features until all features have VIF less than a given threshold.
- Parameters:
- X
Dependent variable array, in dataframe format
- thresh
VIF threshold; features with values greater than this value will be removed from the regression
- logger
If not provided, will create a new logger
- Returns:
Dependent variable array following filtering
- Return type:
X
- spateo.tools.CCI_effects_modeling.regression_utils.wald_test(theta_mle: float | numpy.ndarray, theta_sd: float | numpy.ndarray, theta0: float | numpy.ndarray = 0) numpy.ndarray [source]¶
Perform Wald test, informing whether a given coefficient deviates significantly from the supplied reference value (theta0), based on the standard deviation of the posterior of the parameter estimate.
Function from diffxpy: https://github.com/theislab/diffxpy
- Parameters:
- theta_mle
Maximum likelihood estimation of given parameter by feature
- theta_sd
Standard deviation of the maximum likelihood estimation
- theta0
Value(s) to test theta_mle against. Must be either a single number or an array w/ equal number of entries to theta_mle.
- Returns:
- p-values for each feature, indicating whether the feature’s coefficient deviates significantly from
the reference value
- Return type:
pvals
- spateo.tools.CCI_effects_modeling.regression_utils.multitesting_correction(pvals: List[float] | numpy.ndarray, method: str = 'fdr_bh', alpha: float = 0.05) numpy.ndarray [source]¶
In the case of testing multiple hypotheses from the same experiment, perform multiple test correction to adjust q-values.
Function from diffxpy: https://github.com/theislab/diffxpy
- Parameters:
- pvals
Uncorrected p-values; must be given as a one-dimensional array
- method
Method to use for correction. Available methods can be found in the documentation for statsmodels.stats.multitest.multipletests(), and are also listed below (in correct case) for convenience:
- Named methods:
bonferroni
sidak
holm-sidak
holm
simes-hochberg
hommel
- Abbreviated methods:
fdr_bh: Benjamini-Hochberg correction
fdr_by: Benjamini-Yekutieli correction
fdr_tsbh: Two-stage Benjamini-Hochberg
fdr_tsbky: Two-stage Benjamini-Krieger-Yekutieli method
- alpha
Family-wise error rate (FWER)
- Returns
qval: p-values post-correction
- spateo.tools.CCI_effects_modeling.regression_utils.get_fisher_inverse(x: numpy.ndarray, y: numpy.ndarray) numpy.ndarray [source]¶
Computes the Fisher matrix that measures the amount of information each feature in x provides about y- that is, whether the log-likelihood is sensitive to change in the parameter x.
Function derived from diffxpy: https://github.com/theislab/diffxpy
- Parameters:
- x
Array of shape [n_samples, n_features]; independent variable array
- fitted
Array of shape [n_samples, 1] or [n_samples, n_variables]; estimated dependent variable
- Returns:
np.ndarray
- Return type:
inverse_fisher
- spateo.tools.CCI_effects_modeling.regression_utils.run_permutation_test(data, thresh, subset_rows=None, subset_cols=None)[source]¶
- Permutes the input data array and calculates whether the mean of the permuted array is higher than the
provided value.
- Parameters:
- data
Input array or sparse matrix
- thresh
Value to compare against the permuted means
- subset_rows
Optional indices to subset the rows of ‘data’
- subset_cols
Optional indices to subset the columns of ‘data’
- Returns:
Boolean indicating whether the mean of the permuted data is higher than ‘thresh’
- Return type:
is_higher
- spateo.tools.CCI_effects_modeling.regression_utils.permutation_testing(data: numpy.ndarray | scipy.sparse.csr_matrix, n_permutations: int = 10000, n_jobs: int = 1, subset_rows: numpy.ndarray | List[int] | None = None, subset_cols: numpy.ndarray | List[int] | None = None) float [source]¶
- Permutes the input array and calculates the p-value based on the number of times the mean of the permuted
array is higher than the provided value.
- Parameters:
- data
Input array or sparse matrix
- n_permutations
Number of permutations
- n_jobs
Number of parallel jobs to use (-1 for all available CPUs)
- subset_rows
Optional indices to subset the rows of ‘data’ (to take the mean value of only the subset of interest)
- subset_cols
Optional indices to subset the columns of ‘data’ (to take the mean value of only the subset of interest)
- Returns:
The calculated p-value.
- Return type:
pval
- spateo.tools.CCI_effects_modeling.regression_utils.mae(y_true, y_pred) float [source]¶
Mean absolute error- in this context, actually log1p mean absolute error
- Parameters:
- y_true
Regression model output
- y_pred
Observed values for the dependent variable
- Returns:
Mean absolute error value across all samples
- Return type:
mae
- spateo.tools.CCI_effects_modeling.regression_utils.mse(y_true, y_pred) float [source]¶
Mean squared error- in this context, actually log1p mean squared error
- Parameters:
- y_true
Regression model output
- y_pred
Observed values for the dependent variable
- Returns:
Mean squared error value across all samples
- Return type:
mse