spateo.tools.CCI_effects_modeling.regression_utils#

Auxiliary functions to aid in the interpretation functions for the spatial and spatially-lagged regression models.

Module Contents#

Functions#

sparse_dot(a, b[, return_array])

Matrix multiplication function to deal with sparse and dense objects

sparse_element_by_element(a, b[, return_array])

Element-by-element multiplication function to deal with either sparse or dense objects.

sparse_minmax_scale(a)

Column-wise minmax scaling of a sparse matrix.

sparse_add_pseudocount(a[, pseudocount])

Add pseudocount to sparse matrix.

compute_betas(y, x[, ridge_lambda, clip])

Maximum likelihood estimation procedure, to be used in iteratively weighted least squares to compute the

compute_betas_local(y, x, w[, ridge_lambda, clip])

Maximum likelihood estimation procedure, to be used in iteratively weighted least squares to compute the

iwls(y, x[, distr, init_betas, offset, tol, clip, ...])

Iteratively weighted least squares (IWLS) algorithm to compute the regression coefficients for a given set of

weighted_binary_crossentropy(y_true, y_pred[, ...])

Custom binary cross-entropy loss function with class weights.

logistic_objective(threshold, proba, y_true)

For binomial regression models with IWLS, the objective function is the weighted sum of recall and specificity.

golden_section_search(func, a, b[, tol, min_or_max])

Find the extremum of a function within a specified range using Golden Section Search.

library_scaling_factors([offset, counts, distr])

Get offset values to account for differences in library sizes when comparing expression levels between samples.

softplus(z)

Numerically stable version of log(1 + exp(z)).

multicollinearity_check(X[, thresh, logger])

Checks for multicollinearity in dependent variable array, and drops the most multicollinear features until

assign_significance(row)

wald_test(→ numpy.ndarray)

Perform Wald test, informing whether a given coefficient deviates significantly from the

multitesting_correction(→ numpy.ndarray)

In the case of testing multiple hypotheses from the same experiment, perform multiple test correction to adjust

get_fisher_inverse(→ numpy.ndarray)

Computes the Fisher matrix that measures the amount of information each feature in x provides about y- that is,

run_permutation_test(data, thresh[, subset_rows, ...])

Permutes the input data array and calculates whether the mean of the permuted array is higher than the

permutation_testing(→ float)

Permutes the input array and calculates the p-value based on the number of times the mean of the permuted

mae(→ float)

Mean absolute error- in this context, actually log1p mean absolute error

mse(→ float)

Mean squared error- in this context, actually log1p mean squared error

spateo.tools.CCI_effects_modeling.regression_utils.sparse_dot(a: numpy.ndarray | scipy.sparse.csr_matrix | scipy.sparse.csc_matrix, b: numpy.ndarray | scipy.sparse.csr_matrix | scipy.sparse.csc_matrix, return_array: bool = True)[source]#

Matrix multiplication function to deal with sparse and dense objects

Parameters:
a

First of two matrices to be multiplied, can be sparse or dense

b

Second of two matrices to be multiplied, can be sparse or dense

return_array

Set True to return dense array

Returns:

Matrix product of a and b

Return type:

prod

spateo.tools.CCI_effects_modeling.regression_utils.sparse_element_by_element(a: numpy.ndarray | scipy.sparse.csr_matrix | scipy.sparse.csc_matrix, b: numpy.ndarray | scipy.sparse.csr_matrix | scipy.sparse.csc_matrix, return_array: bool = True)[source]#

Element-by-element multiplication function to deal with either sparse or dense objects.

Parameters:
a

First of two matrices to be multiplied, can be sparse or dense

b

Second of two matrices to be multiplied, can be sparse or dense

return_array

Set True to return dense array

Returns:

Element-wise multiplied product of a and b

Return type:

prod

spateo.tools.CCI_effects_modeling.regression_utils.sparse_minmax_scale(a: scipy.sparse.csr_matrix | scipy.sparse.csc_matrix)[source]#

Column-wise minmax scaling of a sparse matrix.

spateo.tools.CCI_effects_modeling.regression_utils.sparse_add_pseudocount(a: scipy.sparse.csr_matrix | scipy.sparse.csc_matrix, pseudocount: float = 1.0)[source]#

Add pseudocount to sparse matrix.

spateo.tools.CCI_effects_modeling.regression_utils.compute_betas(y: numpy.ndarray | scipy.sparse.csr_matrix | scipy.sparse.csc_matrix, x: numpy.ndarray | scipy.sparse.csr_matrix | scipy.sparse.csc_matrix, ridge_lambda: float = 0.0, clip: float = 5.0)[source]#

Maximum likelihood estimation procedure, to be used in iteratively weighted least squares to compute the regression coefficients for a given set of dependent and independent variables. Can be combined with either Lasso (L1), Ridge (L2), or Elastic Net (L1 + L2) regularization.

Source: Iteratively (Re)weighted Least Squares (IWLS), Fotheringham, A. S., Brunsdon, C., & Charlton, M. (2002). Geographically weighted regression: the analysis of spatially varying relationships.

Parameters:
y

Array of shape [n_samples,]; dependent variable

x

Array of shape [n_samples, n_features]; independent variables

ridge_lambda

Regularization parameter for Ridge regression. Higher values will tend to shrink coefficients further towards zero.

clip

Float; upper and lower bound to constrain betas and prevent numerical overflow

Returns:

Array of shape [n_features,]; regression coefficients

Return type:

betas

spateo.tools.CCI_effects_modeling.regression_utils.compute_betas_local(y: numpy.ndarray, x: numpy.ndarray, w: numpy.ndarray, ridge_lambda: float = 0.0, clip: float | None = None)[source]#

Maximum likelihood estimation procedure, to be used in iteratively weighted least squares to compute the regression coefficients for a given set of dependent and independent variables while accounting for spatial heterogeneity in the dependent variable.

Source: Iteratively (Re)weighted Least Squares (IWLS), Fotheringham, A. S., Brunsdon, C., & Charlton, M. (2002). Geographically weighted regression: the analysis of spatially varying relationships.

Parameters:
y

Array of shape [n_samples,]; dependent variable

x

Array of shape [n_samples, n_features]; independent variables

ridge_lambda

Regularization parameter for Ridge regression. Higher values will tend to shrink coefficients further towards zero.

w

Array of shape [n_samples, 1]; spatial weights matrix

clip

Float; upper and lower bound to constrain betas and prevent numerical overflow

Returns:

Array of shape [n_features,]; regression coefficients pseudoinverse: Array of shape [n_samples, n_samples]; Moore-Penrose pseudoinverse of the X matrix cov_inverse: Array of shape [n_samples, n_samples]; inverse of the covariance matrix

Return type:

betas

spateo.tools.CCI_effects_modeling.regression_utils.iwls(y: numpy.ndarray | scipy.sparse.csr_matrix | scipy.sparse.csc_matrix, x: numpy.ndarray | scipy.sparse.csr_matrix | scipy.sparse.csc_matrix, distr: Literal[gaussian, poisson, nb, binomial] = 'gaussian', init_betas: numpy.ndarray | None = None, offset: numpy.ndarray | None = None, tol: float = 1e-08, clip: float | numpy.ndarray | None = None, threshold: float = 0.0001, max_iter: int = 200, spatial_weights: numpy.ndarray | None = None, i: int | None = None, link: spateo.tools.CCI_effects_modeling.distributions.Link | None = None, ridge_lambda: float | None = None, mask: numpy.ndarray | None = None)[source]#

Iteratively weighted least squares (IWLS) algorithm to compute the regression coefficients for a given set of dependent and independent variables.

Source: Iteratively (Re)weighted Least Squares (IWLS), Fotheringham, A. S., Brunsdon, C., & Charlton, M. (2002). Geographically weighted regression: the analysis of spatially varying relationships.

Parameters:
y

Array of shape [n_samples, 1]; dependent variable

x

Array of shape [n_samples, n_features]; independent variables

distr

Distribution family for the dependent variable; one of “gaussian”, “poisson”, “nb”, “binomial”

init_betas

Array of shape [n_features,]; initial regression coefficients

offset

Optional array of shape [n_samples,]; if provided, will be added to the linear predictor. This is meant to deal with differences in scale that are not caused by the predictor variables, e.g. by differences in library size

tol

Convergence tolerance

clip

Sets magnitude of the upper and lower bound to constrain betas and prevent numerical overflow. Either one floating point value or an array for sample-specific clipping.

threshold

Coefficients with absolute values below this threshold will be set to zero (as these are insignificant)

max_iter

Maximum number of iterations if convergence is not reached

spatial_weights

Array of shape [n_samples, 1]; weights to transform observations from location i for a geographically-weighted regression

i

Optional integer; index of the observation to be used as the center of the geographically-weighted regression. Required if “clip” is an array.

link

Link function for the distribution family. If None, will default to the default value for the specified distribution family.

variance

Variance function for the distribution family. If None, will default to the default value for the specified distribution family.

ridge_lambda

Ridge regularization parameter.

Returns:

Array of shape [n_features, 1]; regression coefficients y_hat: Array of shape [n_samples, 1]; predicted values of the dependent variable wx: Array of shape [n_samples, 1]; weighted independent variables n_iter: Number of iterations completed upon convergence w_final: Array of shape [n_samples, 1]; final spatial weights used for IWLS. linear_predictor_final: Array of shape [n_samples, 1]; final unadjusted linear predictor used for IWLS. Only

returned if “spatial_weights” is not None.

adjusted_predictor_final: Array of shape [n_samples, 1]; final adjusted linear predictor used for IWLS. Only

returned if “spatial_weights” is not None.

pseudoinverse: Array of shape [n_samples, n_samples]; optional influence matrix that is only returned if

”spatial_weights” is not None. The pseudoinverse is the Moore-Penrose pseudoinverse of the X matrix.

inv: Array of shape [n_samples, n_samples]; the inverse covariance matrix (for Gaussian modeling) or the

inverse Fisher matrix (for GLM models).

Return type:

betas

spateo.tools.CCI_effects_modeling.regression_utils.weighted_binary_crossentropy(y_true: numpy.ndarray, y_pred: numpy.ndarray, weight_0: float = 1.0, weight_1: float = 1.0)[source]#

Custom binary cross-entropy loss function with class weights.

Parameters:
y_true

True binary labels

y_pred

Predicted probabilities

weight_0

Weight for class 0 (negative class)

weight_1

Weight for class 1 (positive class)

Returns:

Weighted binary cross-entropy loss

spateo.tools.CCI_effects_modeling.regression_utils.logistic_objective(threshold: float, proba: numpy.ndarray, y_true: numpy.ndarray)[source]#

For binomial regression models with IWLS, the objective function is the weighted sum of recall and specificity.

Parameters:
threshold

Threshold for converting predicted probabilities to binary predictions

proba

Predicted probabilities from logistic model

y_true

True binary labels

Returns:

Weighted sum of recall and specificity

Return type:

score

Find the extremum of a function within a specified range using Golden Section Search.

Parameters:
func

The function to find the extremum of.

a

Lower bound of the range.

b

Upper bound of the range.

tol

Tolerance for stopping criterion.

min_or_max

Whether to find the minimum or maximum of the function.

Returns:

The x-value of the function’s extremum.

spateo.tools.CCI_effects_modeling.regression_utils.library_scaling_factors(offset: numpy.ndarray | None = None, counts: numpy.ndarray | scipy.sparse.csr_matrix | None = None, distr: Literal[gaussian, poisson, nb] = 'gaussian')[source]#

Get offset values to account for differences in library sizes when comparing expression levels between samples.

If the offset is not provided, it calculates the offset based on library sizes. The offset is the logarithm of the library sizes.

Parameters:
offset

Offset values. If provided, it is returned as is. If None, the offset is calculated based on library sizes.

counts

Gene expression array

distr

Distribution of the data. Defaults to “gaussian”, but can also be “poisson” or “nb”.

Returns:

Array of shape [n_samples, ] containing offset values.

Return type:

offset

spateo.tools.CCI_effects_modeling.regression_utils.softplus(z: numpy.ndarray)[source]#

Numerically stable version of log(1 + exp(z)).

spateo.tools.CCI_effects_modeling.regression_utils.multicollinearity_check(X: pandas.DataFrame, thresh: float = 5.0, logger: Optional = None)[source]#

Checks for multicollinearity in dependent variable array, and drops the most multicollinear features until all features have VIF less than a given threshold.

Parameters:
X

Dependent variable array, in dataframe format

thresh

VIF threshold; features with values greater than this value will be removed from the regression

logger

If not provided, will create a new logger

Returns:

Dependent variable array following filtering

Return type:

X

spateo.tools.CCI_effects_modeling.regression_utils.assign_significance(row)[source]#
spateo.tools.CCI_effects_modeling.regression_utils.wald_test(theta_mle: float | numpy.ndarray, theta_sd: float | numpy.ndarray, theta0: float | numpy.ndarray = 0) numpy.ndarray[source]#

Perform Wald test, informing whether a given coefficient deviates significantly from the supplied reference value (theta0), based on the standard deviation of the posterior of the parameter estimate.

Function from diffxpy: https://github.com/theislab/diffxpy

Parameters:
theta_mle

Maximum likelihood estimation of given parameter by feature

theta_sd

Standard deviation of the maximum likelihood estimation

theta0

Value(s) to test theta_mle against. Must be either a single number or an array w/ equal number of entries to theta_mle.

Returns:

p-values for each feature, indicating whether the feature’s coefficient deviates significantly from

the reference value

Return type:

pvals

spateo.tools.CCI_effects_modeling.regression_utils.multitesting_correction(pvals: List[float] | numpy.ndarray, method: str = 'fdr_bh', alpha: float = 0.05) numpy.ndarray[source]#

In the case of testing multiple hypotheses from the same experiment, perform multiple test correction to adjust q-values.

Function from diffxpy: https://github.com/theislab/diffxpy

Parameters:
pvals

Uncorrected p-values; must be given as a one-dimensional array

method

Method to use for correction. Available methods can be found in the documentation for statsmodels.stats.multitest.multipletests(), and are also listed below (in correct case) for convenience:

  • Named methods:
    • bonferroni

    • sidak

    • holm-sidak

    • holm

    • simes-hochberg

    • hommel

  • Abbreviated methods:
    • fdr_bh: Benjamini-Hochberg correction

    • fdr_by: Benjamini-Yekutieli correction

    • fdr_tsbh: Two-stage Benjamini-Hochberg

    • fdr_tsbky: Two-stage Benjamini-Krieger-Yekutieli method

alpha

Family-wise error rate (FWER)

Returns

qval: p-values post-correction

spateo.tools.CCI_effects_modeling.regression_utils.get_fisher_inverse(x: numpy.ndarray, y: numpy.ndarray) numpy.ndarray[source]#

Computes the Fisher matrix that measures the amount of information each feature in x provides about y- that is, whether the log-likelihood is sensitive to change in the parameter x.

Function derived from diffxpy: https://github.com/theislab/diffxpy

Parameters:
x

Array of shape [n_samples, n_features]; independent variable array

fitted

Array of shape [n_samples, 1] or [n_samples, n_variables]; estimated dependent variable

Returns:

np.ndarray

Return type:

inverse_fisher

spateo.tools.CCI_effects_modeling.regression_utils.run_permutation_test(data, thresh, subset_rows=None, subset_cols=None)[source]#
Permutes the input data array and calculates whether the mean of the permuted array is higher than the

provided value.

Parameters:
data

Input array or sparse matrix

thresh

Value to compare against the permuted means

subset_rows

Optional indices to subset the rows of ‘data’

subset_cols

Optional indices to subset the columns of ‘data’

Returns:

Boolean indicating whether the mean of the permuted data is higher than ‘thresh’

Return type:

is_higher

spateo.tools.CCI_effects_modeling.regression_utils.permutation_testing(data: numpy.ndarray | scipy.sparse.csr_matrix, n_permutations: int = 10000, n_jobs: int = 1, subset_rows: numpy.ndarray | List[int] | None = None, subset_cols: numpy.ndarray | List[int] | None = None) float[source]#
Permutes the input array and calculates the p-value based on the number of times the mean of the permuted

array is higher than the provided value.

Parameters:
data

Input array or sparse matrix

n_permutations

Number of permutations

n_jobs

Number of parallel jobs to use (-1 for all available CPUs)

subset_rows

Optional indices to subset the rows of ‘data’ (to take the mean value of only the subset of interest)

subset_cols

Optional indices to subset the columns of ‘data’ (to take the mean value of only the subset of interest)

Returns:

The calculated p-value.

Return type:

pval

spateo.tools.CCI_effects_modeling.regression_utils.mae(y_true, y_pred) float[source]#

Mean absolute error- in this context, actually log1p mean absolute error

Parameters:
y_true

Regression model output

y_pred

Observed values for the dependent variable

Returns:

Mean absolute error value across all samples

Return type:

mae

spateo.tools.CCI_effects_modeling.regression_utils.mse(y_true, y_pred) float[source]#

Mean squared error- in this context, actually log1p mean squared error

Parameters:
y_true

Regression model output

y_pred

Observed values for the dependent variable

Returns:

Mean squared error value across all samples

Return type:

mse