spateo.preprocessing.normalize ============================== .. py:module:: spateo.preprocessing.normalize .. autoapi-nested-parse:: Functions to either scale single-cell data or normalize such that the row-wise sums are identical. Functions --------- .. autoapisummary:: spateo.preprocessing.normalize._normalize_data spateo.preprocessing.normalize.normalize_total spateo.preprocessing.normalize.calcFactorRLE spateo.preprocessing.normalize.calcFactorQuantile spateo.preprocessing.normalize.calcFactorTMM spateo.preprocessing.normalize.calcFactorTMMwsp spateo.preprocessing.normalize.calcNormFactors spateo.preprocessing.normalize.factor_normalization spateo.preprocessing.normalize.calc_mean_and_var spateo.preprocessing.normalize.calc_expm1 spateo.preprocessing.normalize.select_hvf_seurat_single spateo.preprocessing.normalize.select_hvf_seurat Module Contents --------------- .. py:function:: _normalize_data(X, counts, after=None, copy=False, rows=True, round=False) Row-wise or column-wise normalization of sparse data array. :param X: Sparse data array to modify. :param counts: Array of shape [1, n], where n is the number of buckets or number of genes, containing the total counts in each cell or for each gene, respectively. :param after: Target sum total counts for each gene or each cell. Defaults to `None`, in which case each observation (cell) will have a total count equal to the median of total counts for observations (cells) before normalization. :param copy: Whether to operate on a copy of X. :param rows: Whether to perform normalization over rows (normalize each cell to have the same total count number) or over columns (normalize each gene to have the same total count number). :param round: Whether to round to three decimal places to more exactly match the desired number of total counts. .. py:function:: normalize_total(adata: anndata.AnnData, target_sum: Optional[float] = None, norm_factor: Optional[numpy.ndarray] = None, exclude_highly_expressed: bool = False, max_fraction: float = 0.05, key_added: Optional[str] = None, layer: Optional[str] = None, inplace: bool = True, copy: bool = False) -> Union[anndata.AnnData, Dict[str, numpy.ndarray]] Normalize counts per cell. Normalize each cell by total counts over all genes, so that every cell has the same total count after normalization. If `exclude_highly_expressed=True`, very highly expressed genes are excluded from the computation of the normalization factor (size factor) for each cell. This is meaningful as these can strongly influence the resulting normalized values for all other genes. :param adata: The annotated data matrix of shape `n_obs` × `n_vars`. Rows correspond to cells and columns to genes. :param target_sum: Desired sum of counts for each gene post-normalization. If `None`, after normalization, each observation (cell) will have a total count equal to the median of total counts for observations ( cells) before normalization. 1e4 is a suitable recommendation, but if not given, will find a suitable number based on the library sizes. :param norm_factor: Optional array of shape `n_obs` × `1`, where `n_obs` is the number of observations (cells). Each entry contains a pre-computed normalization factor for that cell. :param exclude_highly_expressed: Exclude (very) highly expressed genes for the computation of the normalization factor for each cell. A gene is considered highly expressed if it has more than `max_fraction` of the total counts in at least one cell. :param max_fraction: If `exclude_highly_expressed=True`, this is the cutoff threshold for excluding genes. :param key_added: Name of the field in `adata.obs` where the normalization factor is stored. :param layer: Layer to normalize instead of `X`. If `None`, `X` is normalized. :param inplace: Whether to update `adata` or return dictionary with normalized copies of `adata.X` and `adata.layers`. :param copy: Whether to modify copied input object. Not compatible with inplace=False. :returns: Returns dictionary with normalized copies of `adata.X` and `adata.layers` or updates `adata` with normalized version of the original `adata.X` and `adata.layers`, depending on `inplace`. .. py:function:: calcFactorRLE(data: numpy.ndarray) -> numpy.ndarray Calculate scaling factors using the Relative Log Expression (RLE) method. Python implementation of the same-named function from edgeR: Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139-140. :param data: An array-like object representing the data matrix. :returns: An array of scaling factors for each cell :rtype: factors .. py:function:: calcFactorQuantile(data: numpy.ndarray, lib_size: float, p: float = 0.95) -> numpy.ndarray Calculate scaling factors using the Quantile method. Python implementation of the same-named function from edgeR: Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139-140. :param data: An array-like object representing the data matrix. :param lib_size: The library size or total count to normalize against. :param p: The quantile value (default: 0.75). :returns: An array of scaling factors for each cell :rtype: factors .. py:function:: calcFactorTMM(obs: Union[float, numpy.ndarray], ref: Union[float, numpy.ndarray], libsize_obs: Optional[float] = None, libsize_ref: Optional[float] = None, logratioTrim: float = 0.3, sumTrim: float = 0.05, doWeighting: bool = True, Acutoff: float = -10000000000.0) -> float Calculate scaling factors using the Trimmed Mean of M-values (TMM) method. Python implementation of the same-named function from edgeR: Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139-140. :param obs: An array-like object representing the observed library counts. :param ref: An array-like object representing the reference library counts. :param libsize_obs: The library size of the observed library (default: sum of observed counts). :param libsize_ref: The library size of the reference library (default: sum of reference counts). :param logratioTrim: The fraction of extreme log-ratios to be trimmed (default: 0.3). :param sumTrim: The fraction of extreme log-ratios to be trimmed based on the absolute expression (default: 0.05). :param doWeighting: Whether to perform weighted TMM estimation (default: True). :param Acutoff: The cutoff value for removing infinite values (default: -1e10). :returns: floating point scaling factor :rtype: factor .. py:function:: calcFactorTMMwsp(obs: Union[float, numpy.ndarray], ref: Union[float, numpy.ndarray], libsize_obs: Optional[float] = None, libsize_ref: Optional[float] = None, logratioTrim: float = 0.3, sumTrim: float = 0.05, doWeighting: bool = True) -> float Calculate scaling factors using the Trimmed Mean of M-values with singleton pairing (TMMwsp) method. Python implementation of the same-named function from edgeR: Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139-140. :param obs: An array-like object representing the observed library counts. :param ref: An array-like object representing the reference library counts. :param libsize_obs: The library size of the observed library (default: sum of observed counts). :param libsize_ref: The library size of the reference library (default: sum of reference counts). :param logratioTrim: The fraction of extreme log-ratios to be trimmed (default: 0.3). :param sumTrim: The fraction of extreme log-ratios to be trimmed based on the absolute expression (default: 0.05). :param doWeighting: Whether to perform weighted TMM estimation (default: True). :returns: floating point scale factor :rtype: factor .. py:function:: calcNormFactors(counts: Union[numpy.ndarray, scipy.sparse.spmatrix], lib_size: Optional[numpy.ndarray] = None, method: str = 'TMM', refColumn: Optional[int] = None, logratioTrim: float = 0.3, sumTrim: float = 0.05, doWeighting: bool = True, Acutoff: float = -10000000000.0, p: float = 0.75) -> numpy.ndarray Function to scale normalize RNA-Seq data for count matrices. This is a Python translation of an R function from edgeR package. :param object: Array or sparse array of shape [n_samples, n_features] containing gene expression data. Note that a sparse array will be converted to dense before calculations. :param lib_size: The library sizes for each sample. :param method: The normalization method. Can be: -"TMM": trimmed mean of M-values, -"TMMwsp": trimmed mean of M-values with singleton pairings, -"RLE": relative log expression, or -"upperquartile": using the quantile method Defaults to "TMM". :param refColumn: Optional reference column for normalization :param logratioTrim: For TMM normalization, the fraction of extreme log-ratios to be trimmed (default: 0.3). :param sumTrim: For TMM normalization, the fraction of extreme log-ratios to be trimmed based on the absolute expression (default: 0.05). :param doWeighting: Whether to perform weighted TMM estimation (default: True). :param Acutoff: For TMM normalization, the cutoff value for removing infinite values (default: -1e10). :param p: Parameter for upper quartile normalization. Defaults to 0.75. :returns: The normalization factors for each sample. :rtype: factors .. py:function:: factor_normalization(adata: anndata.AnnData, norm_factors: Optional[numpy.ndarray] = None, compute_norm_factors: bool = False, **kwargs) Wrapper to apply factor normalization to AnnData object. :param adata: The annotated data matrix of shape `n_obs` × `n_vars`. Rows correspond to cells and columns to genes. :param norm_factors: Array of shape (`n_obs`, ), the normalization factors for each sample. If not given, will compute using :func `calcNormFactors` and any arguments given to `kwargs`. :param compute_norm_factors: Set True to compute (or recompute) normalization factors using :func `calcNormFactors`. :param \*\*kwargs: Keyword arguments to pass to :func `calcNormFactors` or :func `normalize_total`. Options: lib_size: The library sizes for each sample. method: The normalization method. Can be: -"TMM": trimmed mean of M-values, -"TMMwsp": trimmed mean of M-values with singleton pairings, -"RLE": relative log expression, or -"upperquartile": using the quantile method Defaults to "TMM" if given. refColumn: Optional reference column for normalization logratioTrim: For TMM normalization, the fraction of extreme log-ratios to be trimmed (default: 0.3). sumTrim: For TMM normalization, the fraction of extreme log-ratios to be trimmed based on the absolute expression (default: 0.05). doWeighting: Whether to perform weighted TMM estimation (default: True). Acutoff: For TMM normalization, the cutoff value for removing infinite values (default: -1e10). p: Parameter for upper quartile normalization. Defaults to 0.75. target_sum: Desired sum of counts for each gene post-normalization. If `None`, after normalization, each observation (cell) will have a total count equal to the median of total counts for observations ( cells) before normalization. exclude_highly_expressed: Exclude (very) highly expressed genes for the computation of the normalization factor for each cell. A gene is considered highly expressed if it has more than `max_fraction` of the total counts in at least one cell. max_fraction: If `exclude_highly_expressed=True`, this is the cutoff threshold for excluding genes. key_added: Name of the field in `adata.obs` where the normalization factor is stored. layer: Layer to normalize instead of `X`. If `None`, `X` is normalized. inplace: Whether to update `adata` or return dictionary with normalized copies of `adata.X` and `adata.layers`. copy: Whether to modify copied input object. Not compatible with inplace=False. :returns: The normalized AnnData object. :rtype: adata .. py:function:: calc_mean_and_var(X: Union[scipy.sparse.csr_matrix, numpy.ndarray], axis: int) .. py:function:: calc_expm1(X: Union[scipy.sparse.csr_matrix, numpy.ndarray]) -> numpy.ndarray exponential minus one .. py:function:: select_hvf_seurat_single(X: Union[scipy.sparse.csr_matrix, numpy.ndarray], n_top: int, min_disp: float, max_disp: float, min_mean: float, max_mean: float) HVF selection for one channel using Seurat method .. py:function:: select_hvf_seurat(data: anndata.AnnData, n_top: int = 2000, min_disp: float = 0.5, max_disp: float = np.inf, min_mean: float = 0.0125, max_mean: float = 7) -> None Select highly variable features using Seurat method.