spateo.preprocessing.normalize
#
Functions to either scale single-cell data or normalize such that the row-wise sums are identical.
Module Contents#
Functions#
|
Row-wise or column-wise normalization of sparse data array. |
|
Normalize counts per cell. |
|
Calculate scaling factors using the Relative Log Expression (RLE) method. Python implementation of the same-named |
|
Calculate scaling factors using the Quantile method. Python implementation of the same-named function from edgeR: |
|
Calculate scaling factors using the Trimmed Mean of M-values (TMM) method. Python implementation of the |
|
Calculate scaling factors using the Trimmed Mean of M-values with singleton pairing (TMMwsp) method. Python |
|
Function to scale normalize RNA-Seq data for count matrices. |
|
Wrapper to apply factor normalization to AnnData object. |
- spateo.preprocessing.normalize._normalize_data(X, counts, after=None, copy=False, rows=True, round=False)[source]#
Row-wise or column-wise normalization of sparse data array.
- Parameters:
- X
Sparse data array to modify.
- counts
Array of shape [1, n], where n is the number of buckets or number of genes, containing the total counts in each cell or for each gene, respectively.
- after
Target sum total counts for each gene or each cell. Defaults to None, in which case each observation (cell) will have a total count equal to the median of total counts for observations (cells) before normalization.
- copy
Whether to operate on a copy of X.
- rows
Whether to perform normalization over rows (normalize each cell to have the same total count number) or over columns (normalize each gene to have the same total count number).
- round
Whether to round to three decimal places to more exactly match the desired number of total counts.
- spateo.preprocessing.normalize.normalize_total(adata: anndata.AnnData, target_sum: float | None = None, norm_factor: numpy.ndarray | None = None, exclude_highly_expressed: bool = False, max_fraction: float = 0.05, key_added: str | None = None, layer: str | None = None, inplace: bool = True, copy: bool = False) anndata.AnnData | Dict[str, numpy.ndarray] [source]#
Normalize counts per cell. Normalize each cell by total counts over all genes, so that every cell has the same total count after normalization.
If exclude_highly_expressed=True, very highly expressed genes are excluded from the computation of the normalization factor (size factor) for each cell. This is meaningful as these can strongly influence the resulting normalized values for all other genes.
- Parameters:
- adata
The annotated data matrix of shape n_obs × n_vars. Rows correspond to cells and columns to genes.
- target_sum
Desired sum of counts for each gene post-normalization. If None, after normalization, each observation (cell) will have a total count equal to the median of total counts for observations ( cells) before normalization. 1e4 is a suitable recommendation, but if not given, will find a suitable number based on the library sizes.
- norm_factor
Optional array of shape n_obs × 1, where n_obs is the number of observations (cells). Each entry contains a pre-computed normalization factor for that cell.
- exclude_highly_expressed
Exclude (very) highly expressed genes for the computation of the normalization factor for each cell. A gene is considered highly expressed if it has more than max_fraction of the total counts in at least one cell.
- max_fraction
If exclude_highly_expressed=True, this is the cutoff threshold for excluding genes.
- key_added
Name of the field in adata.obs where the normalization factor is stored.
- layer
Layer to normalize instead of X. If None, X is normalized.
- inplace
Whether to update adata or return dictionary with normalized copies of adata.X and adata.layers.
- copy
Whether to modify copied input object. Not compatible with inplace=False.
- Returns:
Returns dictionary with normalized copies of adata.X and adata.layers or updates adata with normalized version of the original adata.X and adata.layers, depending on inplace.
- spateo.preprocessing.normalize.calcFactorRLE(data: numpy.ndarray) numpy.ndarray [source]#
Calculate scaling factors using the Relative Log Expression (RLE) method. Python implementation of the same-named function from edgeR:
Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139-140.
- Parameters:
- data
An array-like object representing the data matrix.
- Returns:
An array of scaling factors for each cell
- Return type:
factors
- spateo.preprocessing.normalize.calcFactorQuantile(data: numpy.ndarray, lib_size: float, p: float = 0.95) numpy.ndarray [source]#
Calculate scaling factors using the Quantile method. Python implementation of the same-named function from edgeR:
Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139-140.
- Parameters:
- data
An array-like object representing the data matrix.
- lib_size
The library size or total count to normalize against.
- p
The quantile value (default: 0.75).
- Returns:
An array of scaling factors for each cell
- Return type:
factors
- spateo.preprocessing.normalize.calcFactorTMM(obs: float | numpy.ndarray, ref: float | numpy.ndarray, libsize_obs: float | None = None, libsize_ref: float | None = None, logratioTrim: float = 0.3, sumTrim: float = 0.05, doWeighting: bool = True, Acutoff: float = -10000000000.0) float [source]#
Calculate scaling factors using the Trimmed Mean of M-values (TMM) method. Python implementation of the same-named function from edgeR:
Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139-140.
- Parameters:
- obs
An array-like object representing the observed library counts.
- ref
An array-like object representing the reference library counts.
- libsize_obs
The library size of the observed library (default: sum of observed counts).
- libsize_ref
The library size of the reference library (default: sum of reference counts).
- logratioTrim
The fraction of extreme log-ratios to be trimmed (default: 0.3).
- sumTrim
The fraction of extreme log-ratios to be trimmed based on the absolute expression (default: 0.05).
- doWeighting
Whether to perform weighted TMM estimation (default: True).
- Acutoff
The cutoff value for removing infinite values (default: -1e10).
- Returns:
floating point scaling factor
- Return type:
factor
- spateo.preprocessing.normalize.calcFactorTMMwsp(obs: float | numpy.ndarray, ref: float | numpy.ndarray, libsize_obs: float | None = None, libsize_ref: float | None = None, logratioTrim: float = 0.3, sumTrim: float = 0.05, doWeighting: bool = True) float [source]#
Calculate scaling factors using the Trimmed Mean of M-values with singleton pairing (TMMwsp) method. Python implementation of the same-named function from edgeR:
Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139-140.
- Parameters:
- obs
An array-like object representing the observed library counts.
- ref
An array-like object representing the reference library counts.
- libsize_obs
The library size of the observed library (default: sum of observed counts).
- libsize_ref
The library size of the reference library (default: sum of reference counts).
- logratioTrim
The fraction of extreme log-ratios to be trimmed (default: 0.3).
- sumTrim
The fraction of extreme log-ratios to be trimmed based on the absolute expression (default: 0.05).
- doWeighting
Whether to perform weighted TMM estimation (default: True).
- Returns:
floating point scale factor
- Return type:
factor
- spateo.preprocessing.normalize.calcNormFactors(counts: numpy.ndarray | scipy.sparse.spmatrix, lib_size: numpy.ndarray | None = None, method: str = 'TMM', refColumn: int | None = None, logratioTrim: float = 0.3, sumTrim: float = 0.05, doWeighting: bool = True, Acutoff: float = -10000000000.0, p: float = 0.75) numpy.ndarray [source]#
Function to scale normalize RNA-Seq data for count matrices. This is a Python translation of an R function from edgeR package.
- Parameters:
- object
Array or sparse array of shape [n_samples, n_features] containing gene expression data. Note that a sparse array will be converted to dense before calculations.
- lib_size
The library sizes for each sample.
- method
- The normalization method. Can be:
-“TMM”: trimmed mean of M-values, -“TMMwsp”: trimmed mean of M-values with singleton pairings, -“RLE”: relative log expression, or -“upperquartile”: using the quantile method
Defaults to “TMM”.
- refColumn
Optional reference column for normalization
- logratioTrim
For TMM normalization, the fraction of extreme log-ratios to be trimmed (default: 0.3).
- sumTrim
For TMM normalization, the fraction of extreme log-ratios to be trimmed based on the absolute expression (default: 0.05).
- doWeighting
Whether to perform weighted TMM estimation (default: True).
- Acutoff
For TMM normalization, the cutoff value for removing infinite values (default: -1e10).
- p
Parameter for upper quartile normalization. Defaults to 0.75.
- Returns:
The normalization factors for each sample.
- Return type:
factors
- spateo.preprocessing.normalize.factor_normalization(adata: anndata.AnnData, norm_factors: numpy.ndarray | None = None, compute_norm_factors: bool = False, **kwargs)[source]#
Wrapper to apply factor normalization to AnnData object.
- Parameters:
- adata
The annotated data matrix of shape n_obs × n_vars. Rows correspond to cells and columns to genes.
- norm_factors
Array of shape (n_obs, ), the normalization factors for each sample. If not given, will compute using :func calcNormFactors and any arguments given to kwargs.
- compute_norm_factors
Set True to compute (or recompute) normalization factors using :func calcNormFactors.
- **kwargs
Keyword arguments to pass to :func calcNormFactors or :func normalize_total. Options: lib_size: The library sizes for each sample. method: The normalization method. Can be:
-“TMM”: trimmed mean of M-values, -“TMMwsp”: trimmed mean of M-values with singleton pairings, -“RLE”: relative log expression, or -“upperquartile”: using the quantile method
Defaults to “TMM” if given.
refColumn: Optional reference column for normalization logratioTrim: For TMM normalization, the fraction of extreme log-ratios to be trimmed (default: 0.3). sumTrim: For TMM normalization, the fraction of extreme log-ratios to be trimmed based on the absolute
expression (default: 0.05).
doWeighting: Whether to perform weighted TMM estimation (default: True). Acutoff: For TMM normalization, the cutoff value for removing infinite values (default: -1e10). p: Parameter for upper quartile normalization. Defaults to 0.75. target_sum: Desired sum of counts for each gene post-normalization. If None, after normalization, each observation (cell) will have a total count equal to the median of total counts for observations ( cells) before normalization. exclude_highly_expressed: Exclude (very) highly expressed genes for the computation of the normalization
factor for each cell. A gene is considered highly expressed if it has more than max_fraction of the total counts in at least one cell.
max_fraction: If exclude_highly_expressed=True, this is the cutoff threshold for excluding genes. key_added: Name of the field in adata.obs where the normalization factor is stored. layer: Layer to normalize instead of X. If None, X is normalized. inplace: Whether to update adata or return dictionary with normalized copies of adata.X and
adata.layers.
copy: Whether to modify copied input object. Not compatible with inplace=False.
- Returns:
The normalized AnnData object.
- Return type:
adata