spateo.preprocessing.normalize#

Functions to either scale single-cell data or normalize such that the row-wise sums are identical.

Module Contents#

Functions#

_normalize_data(X, counts[, after, copy, rows, round])

Row-wise or column-wise normalization of sparse data array.

normalize_total(→ Union[anndata.AnnData, Dict[str, ...)

Normalize counts per cell.

calcFactorRLE(→ numpy.ndarray)

Calculate scaling factors using the Relative Log Expression (RLE) method. Python implementation of the same-named

calcFactorQuantile(→ numpy.ndarray)

Calculate scaling factors using the Quantile method. Python implementation of the same-named function from edgeR:

calcFactorTMM(→ float)

Calculate scaling factors using the Trimmed Mean of M-values (TMM) method. Python implementation of the

calcFactorTMMwsp(→ float)

Calculate scaling factors using the Trimmed Mean of M-values with singleton pairing (TMMwsp) method. Python

calcNormFactors(→ numpy.ndarray)

Function to scale normalize RNA-Seq data for count matrices.

factor_normalization(adata[, norm_factors, ...])

Wrapper to apply factor normalization to AnnData object.

spateo.preprocessing.normalize._normalize_data(X, counts, after=None, copy=False, rows=True, round=False)[source]#

Row-wise or column-wise normalization of sparse data array.

Parameters:
X

Sparse data array to modify.

counts

Array of shape [1, n], where n is the number of buckets or number of genes, containing the total counts in each cell or for each gene, respectively.

after

Target sum total counts for each gene or each cell. Defaults to None, in which case each observation (cell) will have a total count equal to the median of total counts for observations (cells) before normalization.

copy

Whether to operate on a copy of X.

rows

Whether to perform normalization over rows (normalize each cell to have the same total count number) or over columns (normalize each gene to have the same total count number).

round

Whether to round to three decimal places to more exactly match the desired number of total counts.

spateo.preprocessing.normalize.normalize_total(adata: anndata.AnnData, target_sum: float | None = None, norm_factor: numpy.ndarray | None = None, exclude_highly_expressed: bool = False, max_fraction: float = 0.05, key_added: str | None = None, layer: str | None = None, inplace: bool = True, copy: bool = False) anndata.AnnData | Dict[str, numpy.ndarray][source]#

Normalize counts per cell. Normalize each cell by total counts over all genes, so that every cell has the same total count after normalization.

If exclude_highly_expressed=True, very highly expressed genes are excluded from the computation of the normalization factor (size factor) for each cell. This is meaningful as these can strongly influence the resulting normalized values for all other genes.

Parameters:
adata

The annotated data matrix of shape n_obs × n_vars. Rows correspond to cells and columns to genes.

target_sum

Desired sum of counts for each gene post-normalization. If None, after normalization, each observation (cell) will have a total count equal to the median of total counts for observations ( cells) before normalization. 1e4 is a suitable recommendation, but if not given, will find a suitable number based on the library sizes.

norm_factor

Optional array of shape n_obs × 1, where n_obs is the number of observations (cells). Each entry contains a pre-computed normalization factor for that cell.

exclude_highly_expressed

Exclude (very) highly expressed genes for the computation of the normalization factor for each cell. A gene is considered highly expressed if it has more than max_fraction of the total counts in at least one cell.

max_fraction

If exclude_highly_expressed=True, this is the cutoff threshold for excluding genes.

key_added

Name of the field in adata.obs where the normalization factor is stored.

layer

Layer to normalize instead of X. If None, X is normalized.

inplace

Whether to update adata or return dictionary with normalized copies of adata.X and adata.layers.

copy

Whether to modify copied input object. Not compatible with inplace=False.

Returns:

Returns dictionary with normalized copies of adata.X and adata.layers or updates adata with normalized version of the original adata.X and adata.layers, depending on inplace.

spateo.preprocessing.normalize.calcFactorRLE(data: numpy.ndarray) numpy.ndarray[source]#

Calculate scaling factors using the Relative Log Expression (RLE) method. Python implementation of the same-named function from edgeR:

Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139-140.

Parameters:
data

An array-like object representing the data matrix.

Returns:

An array of scaling factors for each cell

Return type:

factors

spateo.preprocessing.normalize.calcFactorQuantile(data: numpy.ndarray, lib_size: float, p: float = 0.95) numpy.ndarray[source]#

Calculate scaling factors using the Quantile method. Python implementation of the same-named function from edgeR:

Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139-140.

Parameters:
data

An array-like object representing the data matrix.

lib_size

The library size or total count to normalize against.

p

The quantile value (default: 0.75).

Returns:

An array of scaling factors for each cell

Return type:

factors

spateo.preprocessing.normalize.calcFactorTMM(obs: float | numpy.ndarray, ref: float | numpy.ndarray, libsize_obs: float | None = None, libsize_ref: float | None = None, logratioTrim: float = 0.3, sumTrim: float = 0.05, doWeighting: bool = True, Acutoff: float = -10000000000.0) float[source]#

Calculate scaling factors using the Trimmed Mean of M-values (TMM) method. Python implementation of the same-named function from edgeR:

Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139-140.

Parameters:
obs

An array-like object representing the observed library counts.

ref

An array-like object representing the reference library counts.

libsize_obs

The library size of the observed library (default: sum of observed counts).

libsize_ref

The library size of the reference library (default: sum of reference counts).

logratioTrim

The fraction of extreme log-ratios to be trimmed (default: 0.3).

sumTrim

The fraction of extreme log-ratios to be trimmed based on the absolute expression (default: 0.05).

doWeighting

Whether to perform weighted TMM estimation (default: True).

Acutoff

The cutoff value for removing infinite values (default: -1e10).

Returns:

floating point scaling factor

Return type:

factor

spateo.preprocessing.normalize.calcFactorTMMwsp(obs: float | numpy.ndarray, ref: float | numpy.ndarray, libsize_obs: float | None = None, libsize_ref: float | None = None, logratioTrim: float = 0.3, sumTrim: float = 0.05, doWeighting: bool = True) float[source]#

Calculate scaling factors using the Trimmed Mean of M-values with singleton pairing (TMMwsp) method. Python implementation of the same-named function from edgeR:

Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139-140.

Parameters:
obs

An array-like object representing the observed library counts.

ref

An array-like object representing the reference library counts.

libsize_obs

The library size of the observed library (default: sum of observed counts).

libsize_ref

The library size of the reference library (default: sum of reference counts).

logratioTrim

The fraction of extreme log-ratios to be trimmed (default: 0.3).

sumTrim

The fraction of extreme log-ratios to be trimmed based on the absolute expression (default: 0.05).

doWeighting

Whether to perform weighted TMM estimation (default: True).

Returns:

floating point scale factor

Return type:

factor

spateo.preprocessing.normalize.calcNormFactors(counts: numpy.ndarray | scipy.sparse.spmatrix, lib_size: numpy.ndarray | None = None, method: str = 'TMM', refColumn: int | None = None, logratioTrim: float = 0.3, sumTrim: float = 0.05, doWeighting: bool = True, Acutoff: float = -10000000000.0, p: float = 0.75) numpy.ndarray[source]#

Function to scale normalize RNA-Seq data for count matrices. This is a Python translation of an R function from edgeR package.

Parameters:
object

Array or sparse array of shape [n_samples, n_features] containing gene expression data. Note that a sparse array will be converted to dense before calculations.

lib_size

The library sizes for each sample.

method

The normalization method. Can be:

-“TMM”: trimmed mean of M-values, -“TMMwsp”: trimmed mean of M-values with singleton pairings, -“RLE”: relative log expression, or -“upperquartile”: using the quantile method

Defaults to “TMM”.

refColumn

Optional reference column for normalization

logratioTrim

For TMM normalization, the fraction of extreme log-ratios to be trimmed (default: 0.3).

sumTrim

For TMM normalization, the fraction of extreme log-ratios to be trimmed based on the absolute expression (default: 0.05).

doWeighting

Whether to perform weighted TMM estimation (default: True).

Acutoff

For TMM normalization, the cutoff value for removing infinite values (default: -1e10).

p

Parameter for upper quartile normalization. Defaults to 0.75.

Returns:

The normalization factors for each sample.

Return type:

factors

spateo.preprocessing.normalize.factor_normalization(adata: anndata.AnnData, norm_factors: numpy.ndarray | None = None, compute_norm_factors: bool = False, **kwargs)[source]#

Wrapper to apply factor normalization to AnnData object.

Parameters:
adata

The annotated data matrix of shape n_obs × n_vars. Rows correspond to cells and columns to genes.

norm_factors

Array of shape (n_obs, ), the normalization factors for each sample. If not given, will compute using :func calcNormFactors and any arguments given to kwargs.

compute_norm_factors

Set True to compute (or recompute) normalization factors using :func calcNormFactors.

**kwargs

Keyword arguments to pass to :func calcNormFactors or :func normalize_total. Options: lib_size: The library sizes for each sample. method: The normalization method. Can be:

-“TMM”: trimmed mean of M-values, -“TMMwsp”: trimmed mean of M-values with singleton pairings, -“RLE”: relative log expression, or -“upperquartile”: using the quantile method

Defaults to “TMM” if given.

refColumn: Optional reference column for normalization logratioTrim: For TMM normalization, the fraction of extreme log-ratios to be trimmed (default: 0.3). sumTrim: For TMM normalization, the fraction of extreme log-ratios to be trimmed based on the absolute

expression (default: 0.05).

doWeighting: Whether to perform weighted TMM estimation (default: True). Acutoff: For TMM normalization, the cutoff value for removing infinite values (default: -1e10). p: Parameter for upper quartile normalization. Defaults to 0.75. target_sum: Desired sum of counts for each gene post-normalization. If None, after normalization, each observation (cell) will have a total count equal to the median of total counts for observations ( cells) before normalization. exclude_highly_expressed: Exclude (very) highly expressed genes for the computation of the normalization

factor for each cell. A gene is considered highly expressed if it has more than max_fraction of the total counts in at least one cell.

max_fraction: If exclude_highly_expressed=True, this is the cutoff threshold for excluding genes. key_added: Name of the field in adata.obs where the normalization factor is stored. layer: Layer to normalize instead of X. If None, X is normalized. inplace: Whether to update adata or return dictionary with normalized copies of adata.X and

adata.layers.

copy: Whether to modify copied input object. Not compatible with inplace=False.

Returns:

The normalized AnnData object.

Return type:

adata