`spateo.tools.CCI_effects_modeling.regression_utils`#

Auxiliary functions to aid in the interpretation functions for the spatial and spatially-lagged regression models.

Module Contents#

Functions#

`sparse_dot`(a, b[, return_array])	Matrix multiplication function to deal with sparse and dense objects
`sparse_element_by_element`(a, b[, return_array])	Element-by-element multiplication function to deal with either sparse or dense objects.
`sparse_minmax_scale`(a)	Column-wise minmax scaling of a sparse matrix.
`sparse_add_pseudocount`(a[, pseudocount])	Add pseudocount to sparse matrix.
`compute_betas`(y, x[, ridge_lambda, clip])	Maximum likelihood estimation procedure, to be used in iteratively weighted least squares to compute the
`compute_betas_local`(y, x, w[, ridge_lambda, clip])	Maximum likelihood estimation procedure, to be used in iteratively weighted least squares to compute the
`iwls`(y, x[, distr, init_betas, offset, tol, clip, ...])	Iteratively weighted least squares (IWLS) algorithm to compute the regression coefficients for a given set of
`weighted_binary_crossentropy`(y_true, y_pred[, ...])	Custom binary cross-entropy loss function with class weights.
`logistic_objective`(threshold, proba, y_true)	For binomial regression models with IWLS, the objective function is the weighted sum of recall and specificity.
`golden_section_search`(func, a, b[, tol, min_or_max])	Find the extremum of a function within a specified range using Golden Section Search.
`library_scaling_factors`([offset, counts, distr])	Get offset values to account for differences in library sizes when comparing expression levels between samples.
`softplus`(z)	Numerically stable version of log(1 + exp(z)).
`multicollinearity_check`(X[, thresh, logger])	Checks for multicollinearity in dependent variable array, and drops the most multicollinear features until
`assign_significance`(row)
`wald_test`(→ numpy.ndarray)	Perform Wald test, informing whether a given coefficient deviates significantly from the
`multitesting_correction`(→ numpy.ndarray)	In the case of testing multiple hypotheses from the same experiment, perform multiple test correction to adjust
`get_fisher_inverse`(→ numpy.ndarray)	Computes the Fisher matrix that measures the amount of information each feature in x provides about y- that is,
`run_permutation_test`(data, thresh[, subset_rows, ...])	Permutes the input data array and calculates whether the mean of the permuted array is higher than the
`permutation_testing`(→ float)	Permutes the input array and calculates the p-value based on the number of times the mean of the permuted
`mae`(→ float)	Mean absolute error- in this context, actually log1p mean absolute error
`mse`(→ float)	Mean squared error- in this context, actually log1p mean squared error

spateo.tools.CCI_effects_modeling.regression_utils.sparse_dot(a: numpy.ndarray | scipy.sparse.csr_matrix | scipy.sparse.csc_matrix, b: numpy.ndarray | scipy.sparse.csr_matrix | scipy.sparse.csc_matrix, return_array: bool = True)[source]#

Matrix multiplication function to deal with sparse and dense objects

Parameters:

a: First of two matrices to be multiplied, can be sparse or dense
b: Second of two matrices to be multiplied, can be sparse or dense
return_array: Set True to return dense array

Returns:

Matrix product of a and b

Return type:

prod

spateo.tools.CCI_effects_modeling.regression_utils.sparse_element_by_element(a: numpy.ndarray | scipy.sparse.csr_matrix | scipy.sparse.csc_matrix, b: numpy.ndarray | scipy.sparse.csr_matrix | scipy.sparse.csc_matrix, return_array: bool = True)[source]#

Element-by-element multiplication function to deal with either sparse or dense objects.

Parameters:

a: First of two matrices to be multiplied, can be sparse or dense
b: Second of two matrices to be multiplied, can be sparse or dense
return_array: Set True to return dense array

Returns:

Element-wise multiplied product of a and b

Return type:

prod

spateo.tools.CCI_effects_modeling.regression_utils.sparse_minmax_scale(a: scipy.sparse.csr_matrix | scipy.sparse.csc_matrix)[source]#: Column-wise minmax scaling of a sparse matrix.

spateo.tools.CCI_effects_modeling.regression_utils.sparse_add_pseudocount(a: scipy.sparse.csr_matrix | scipy.sparse.csc_matrix, pseudocount: float = 1.0)[source]#: Add pseudocount to sparse matrix.

spateo.tools.CCI_effects_modeling.regression_utils.compute_betas(y: numpy.ndarray | scipy.sparse.csr_matrix | scipy.sparse.csc_matrix, x: numpy.ndarray | scipy.sparse.csr_matrix | scipy.sparse.csc_matrix, ridge_lambda: float = 0.0, clip: float = 5.0)[source]#

Maximum likelihood estimation procedure, to be used in iteratively weighted least squares to compute the regression coefficients for a given set of dependent and independent variables. Can be combined with either Lasso (L1), Ridge (L2), or Elastic Net (L1 + L2) regularization.

Source: Iteratively (Re)weighted Least Squares (IWLS), Fotheringham, A. S., Brunsdon, C., & Charlton, M. (2002). Geographically weighted regression: the analysis of spatially varying relationships.

Parameters:

y: Array of shape [n_samples,]; dependent variable
x: Array of shape [n_samples, n_features]; independent variables
ridge_lambda: Regularization parameter for Ridge regression. Higher values will tend to shrink coefficients further towards zero.
clip: Float; upper and lower bound to constrain betas and prevent numerical overflow

Returns:

Array of shape [n_features,]; regression coefficients

Return type:

betas

spateo.tools.CCI_effects_modeling.regression_utils.compute_betas_local(y: numpy.ndarray, x: numpy.ndarray, w: numpy.ndarray, ridge_lambda: float = 0.0, clip: float | None = None)[source]#

Maximum likelihood estimation procedure, to be used in iteratively weighted least squares to compute the regression coefficients for a given set of dependent and independent variables while accounting for spatial heterogeneity in the dependent variable.

Source: Iteratively (Re)weighted Least Squares (IWLS), Fotheringham, A. S., Brunsdon, C., & Charlton, M. (2002). Geographically weighted regression: the analysis of spatially varying relationships.

Parameters:

y: Array of shape [n_samples,]; dependent variable
x: Array of shape [n_samples, n_features]; independent variables
ridge_lambda: Regularization parameter for Ridge regression. Higher values will tend to shrink coefficients further towards zero.
w: Array of shape [n_samples, 1]; spatial weights matrix
clip: Float; upper and lower bound to constrain betas and prevent numerical overflow

Returns:

Array of shape [n_features,]; regression coefficients pseudoinverse: Array of shape [n_samples, n_samples]; Moore-Penrose pseudoinverse of the X matrix cov_inverse: Array of shape [n_samples, n_samples]; inverse of the covariance matrix

Return type:

betas

spateo.tools.CCI_effects_modeling.regression_utils.iwls(y: numpy.ndarray | scipy.sparse.csr_matrix | scipy.sparse.csc_matrix, x: numpy.ndarray | scipy.sparse.csr_matrix | scipy.sparse.csc_matrix, distr: Literal[gaussian, poisson, nb, binomial] = 'gaussian', init_betas: numpy.ndarray | None = None, offset: numpy.ndarray | None = None, tol: float = 1e-08, clip: float | numpy.ndarray | None = None, threshold: float = 0.0001, max_iter: int = 200, spatial_weights: numpy.ndarray | None = None, i: int | None = None, link: spateo.tools.CCI_effects_modeling.distributions.Link | None = None, ridge_lambda: float | None = None, mask: numpy.ndarray | None = None)[source]#

Iteratively weighted least squares (IWLS) algorithm to compute the regression coefficients for a given set of dependent and independent variables.

Source: Iteratively (Re)weighted Least Squares (IWLS), Fotheringham, A. S., Brunsdon, C., & Charlton, M. (2002). Geographically weighted regression: the analysis of spatially varying relationships.

Parameters:

y: Array of shape [n_samples, 1]; dependent variable
x: Array of shape [n_samples, n_features]; independent variables
distr: Distribution family for the dependent variable; one of “gaussian”, “poisson”, “nb”, “binomial”
init_betas: Array of shape [n_features,]; initial regression coefficients
offset: Optional array of shape [n_samples,]; if provided, will be added to the linear predictor. This is meant to deal with differences in scale that are not caused by the predictor variables, e.g. by differences in library size
tol: Convergence tolerance
clip: Sets magnitude of the upper and lower bound to constrain betas and prevent numerical overflow. Either one floating point value or an array for sample-specific clipping.
threshold: Coefficients with absolute values below this threshold will be set to zero (as these are insignificant)
max_iter: Maximum number of iterations if convergence is not reached
spatial_weights: Array of shape [n_samples, 1]; weights to transform observations from location i for a geographically-weighted regression
i: Optional integer; index of the observation to be used as the center of the geographically-weighted regression. Required if “clip” is an array.
link: Link function for the distribution family. If None, will default to the default value for the specified distribution family.
variance: Variance function for the distribution family. If None, will default to the default value for the specified distribution family.
ridge_lambda: Ridge regularization parameter.

Returns:

Array of shape [n_features, 1]; regression coefficients y_hat: Array of shape [n_samples, 1]; predicted values of the dependent variable wx: Array of shape [n_samples, 1]; weighted independent variables n_iter: Number of iterations completed upon convergence w_final: Array of shape [n_samples, 1]; final spatial weights used for IWLS. linear_predictor_final: Array of shape [n_samples, 1]; final unadjusted linear predictor used for IWLS. Only

returned if “spatial_weights” is not None.

adjusted_predictor_final: Array of shape [n_samples, 1]; final adjusted linear predictor used for IWLS. Only: returned if “spatial_weights” is not None.
pseudoinverse: Array of shape [n_samples, n_samples]; optional influence matrix that is only returned if: ”spatial_weights” is not None. The pseudoinverse is the Moore-Penrose pseudoinverse of the X matrix.
inv: Array of shape [n_samples, n_samples]; the inverse covariance matrix (for Gaussian modeling) or the: inverse Fisher matrix (for GLM models).

Return type:

betas

spateo.tools.CCI_effects_modeling.regression_utils.weighted_binary_crossentropy(y_true: numpy.ndarray, y_pred: numpy.ndarray, weight_0: float = 1.0, weight_1: float = 1.0)[source]#

Custom binary cross-entropy loss function with class weights.

Parameters:

y_true: True binary labels
y_pred: Predicted probabilities
weight_0: Weight for class 0 (negative class)
weight_1: Weight for class 1 (positive class)

Returns:

Weighted binary cross-entropy loss

spateo.tools.CCI_effects_modeling.regression_utils.logistic_objective(threshold: float, proba: numpy.ndarray, y_true: numpy.ndarray)[source]#

For binomial regression models with IWLS, the objective function is the weighted sum of recall and specificity.

Parameters:

threshold: Threshold for converting predicted probabilities to binary predictions
proba: Predicted probabilities from logistic model
y_true: True binary labels

Returns:

Weighted sum of recall and specificity

Return type:

score

spateo.tools.CCI_effects_modeling.regression_utils.golden_section_search(func: Callable, a: float, b: float, tol: float = 1e-05, min_or_max: str = 'min')[source]#

Find the extremum of a function within a specified range using Golden Section Search.

Parameters:

func: The function to find the extremum of.
a: Lower bound of the range.
b: Upper bound of the range.
tol: Tolerance for stopping criterion.
min_or_max: Whether to find the minimum or maximum of the function.

Returns:

The x-value of the function’s extremum.

spateo.tools.CCI_effects_modeling.regression_utils.library_scaling_factors(offset: numpy.ndarray | None = None, counts: numpy.ndarray | scipy.sparse.csr_matrix | None = None, distr: Literal[gaussian, poisson, nb] = 'gaussian')[source]#

Get offset values to account for differences in library sizes when comparing expression levels between samples.

If the offset is not provided, it calculates the offset based on library sizes. The offset is the logarithm of the library sizes.

Parameters:

offset: Offset values. If provided, it is returned as is. If None, the offset is calculated based on library sizes.
counts: Gene expression array
distr: Distribution of the data. Defaults to “gaussian”, but can also be “poisson” or “nb”.

Returns:

Array of shape [n_samples, ] containing offset values.

Return type:

offset

spateo.tools.CCI_effects_modeling.regression_utils.softplus(z: numpy.ndarray)[source]#: Numerically stable version of log(1 + exp(z)).

spateo.tools.CCI_effects_modeling.regression_utils.multicollinearity_check(X: pandas.DataFrame, thresh: float = 5.0, logger: Optional = None)[source]#

Checks for multicollinearity in dependent variable array, and drops the most multicollinear features until all features have VIF less than a given threshold.

Parameters:

X: Dependent variable array, in dataframe format
thresh: VIF threshold; features with values greater than this value will be removed from the regression
logger: If not provided, will create a new logger

Returns:

Dependent variable array following filtering

Return type:

spateo.tools.CCI_effects_modeling.regression_utils.assign_significance(row)[source]#

spateo.tools.CCI_effects_modeling.regression_utils.wald_test(theta_mle: float | numpy.ndarray, theta_sd: float | numpy.ndarray, theta0: float | numpy.ndarray = 0) → numpy.ndarray[source]#

Perform Wald test, informing whether a given coefficient deviates significantly from the supplied reference value (theta0), based on the standard deviation of the posterior of the parameter estimate.

Function from diffxpy: https://github.com/theislab/diffxpy

Parameters:

theta_mle: Maximum likelihood estimation of given parameter by feature
theta_sd: Standard deviation of the maximum likelihood estimation
theta0: Value(s) to test theta_mle against. Must be either a single number or an array w/ equal number of entries to theta_mle.

Returns:

p-values for each feature, indicating whether the feature’s coefficient deviates significantly from: the reference value

Return type:

pvals

spateo.tools.CCI_effects_modeling.regression_utils.multitesting_correction(pvals: List[float] | numpy.ndarray, method: str = 'fdr_bh', alpha: float = 0.05) → numpy.ndarray[source]#

In the case of testing multiple hypotheses from the same experiment, perform multiple test correction to adjust q-values.

Function from diffxpy: https://github.com/theislab/diffxpy

Parameters:

pvals

Uncorrected p-values; must be given as a one-dimensional array

method

Method to use for correction. Available methods can be found in the documentation for statsmodels.stats.multitest.multipletests(), and are also listed below (in correct case) for convenience:

Named methods:

bonferroni

sidak

holm-sidak

holm

simes-hochberg

hommel

Abbreviated methods:

fdr_bh: Benjamini-Hochberg correction

fdr_by: Benjamini-Yekutieli correction

fdr_tsbh: Two-stage Benjamini-Hochberg

fdr_tsbky: Two-stage Benjamini-Krieger-Yekutieli method

alpha

Family-wise error rate (FWER)

Returns: qval: p-values post-correction

spateo.tools.CCI_effects_modeling.regression_utils.get_fisher_inverse(x: numpy.ndarray, y: numpy.ndarray) → numpy.ndarray[source]#

Computes the Fisher matrix that measures the amount of information each feature in x provides about y- that is, whether the log-likelihood is sensitive to change in the parameter x.

Function derived from diffxpy: https://github.com/theislab/diffxpy

Parameters:

x: Array of shape [n_samples, n_features]; independent variable array
fitted: Array of shape [n_samples, 1] or [n_samples, n_variables]; estimated dependent variable

Returns:

np.ndarray

Return type:

inverse_fisher

spateo.tools.CCI_effects_modeling.regression_utils.run_permutation_test(data, thresh, subset_rows=None, subset_cols=None)[source]#

Permutes the input data array and calculates whether the mean of the permuted array is higher than the: provided value.

Parameters:

data: Input array or sparse matrix
thresh: Value to compare against the permuted means
subset_rows: Optional indices to subset the rows of ‘data’
subset_cols: Optional indices to subset the columns of ‘data’

Returns:

Boolean indicating whether the mean of the permuted data is higher than ‘thresh’

Return type:

is_higher

spateo.tools.CCI_effects_modeling.regression_utils.permutation_testing(data: numpy.ndarray | scipy.sparse.csr_matrix, n_permutations: int = 10000, n_jobs: int = 1, subset_rows: numpy.ndarray | List[int] | None = None, subset_cols: numpy.ndarray | List[int] | None = None) → float[source]#

Permutes the input array and calculates the p-value based on the number of times the mean of the permuted: array is higher than the provided value.

Parameters:

data: Input array or sparse matrix
n_permutations: Number of permutations
n_jobs: Number of parallel jobs to use (-1 for all available CPUs)
subset_rows: Optional indices to subset the rows of ‘data’ (to take the mean value of only the subset of interest)
subset_cols: Optional indices to subset the columns of ‘data’ (to take the mean value of only the subset of interest)

Returns:

The calculated p-value.

Return type:

pval

spateo.tools.CCI_effects_modeling.regression_utils.mae(y_true, y_pred) → float[source]#

Mean absolute error- in this context, actually log1p mean absolute error

Parameters:

y_true: Regression model output
y_pred: Observed values for the dependent variable

Returns:

Mean absolute error value across all samples

Return type:

mae

spateo.tools.CCI_effects_modeling.regression_utils.mse(y_true, y_pred) → float[source]#

Mean squared error- in this context, actually log1p mean squared error

Parameters:

y_true: Regression model output
y_pred: Observed values for the dependent variable

Returns:

Mean squared error value across all samples

Return type:

mse

spateo.tools.CCI_effects_modeling.regression_utils#

Module Contents#

Functions#

`spateo.tools.CCI_effects_modeling.regression_utils`#