Publications

Score matching for differential abundance testing of compositional high-throughput sequencing data

Published in BioRxiv, 2024

The class of a-b power interaction models, proposed by Yu et al. (2024), provides a general framework for modeling sparse compositional count data with pairwise feature interactions. This class includes many distributions as special cases and enables modeling of zero counts through power transformations, making it particularly suitable for modern high-throughput sequencing data with excess zeros, including single-cell RNA-Seq and microbial amplicon data. Here, we present an extension of this class of models that allows inclusion of covariate information, thus enabling accurate characterization of covariate dependencies in heterogeneous populations. Combining this model with a tailored differential abundance (DA) test leads to a novel DA testing scheme, cosmoDA, that can reduce false positive detection rate caused by correlated features. cosmoDA uses penalized generalized score matching for parsimonious model fitting. We show on simulated benchmarks that cosmoDA can accurately estimate feature interactions in the presence of population heterogeneity and significantly reduces the false discovery rate when testing for differential abundance of correlated features. Using single-cell and amplicon data, we illustrate cosmoDA's ability to estimate data-adaptive Box-Cox-type data transformations and assess the impact of zero replacement and power transformations on downstream differential abundance results. cosmoDA is available at https://github.com/bio-datascience/cosmoDA.

Recommended citation: Ostner, J., Li, H., and Müller, C. L. (2024). Score matching for differential abundance testing of compositional high-throughput sequencing data. bioRxiv, 2024.12.05.627006. doi: 10.1101/2024.12.05.627006 https://www.biorxiv.org/content/10.1101/2024.12.05.627006v1

Pertpy: an end-to-end framework for perturbation analysis

Published in BioRxiv, 2024

Advances in single-cell technology have enabled the measurement of cell-resolved molecular states across a variety of cell lines and tissues under a plethora of genetic, chemical, environmental, or disease perturbations. Current methods focus on differential comparison or are specific to a particular task in a multi-condition setting with purely statistical perspectives. The quickly growing number, size, and complexity of such studies requires a scalable analysis framework that takes existing biological context into account. Here, we present pertpy, a Python-based modular framework for the analysis of large-scale perturbation single-cell experiments. Pertpy provides access to harmonized perturbation datasets and metadata databases along with numerous fast and user-friendly implementations of both established and novel methods such as automatic metadata annotation or perturbation distances to efficiently analyze perturbation data. As part of the scverse ecosystem, pertpy interoperates with existing libraries for the analysis of single-cell data and is designed to be easily extended.

Recommended citation: Heumos, L., Ji, Y., May, L., Green, T., Zhang, X., Wu, X., et al. (2024). Pertpy: an end-to-end framework for perturbation analysis. bioRxiv, 2024.08.04.606516. doi: 10.1101/2024.08.04.606516 https://www.biorxiv.org/content/10.1101/2024.08.04.606516v1

BacSC: A general workflow for bacterial single-cell RNA sequencing data analysis

Published in BioRxiv, 2024

Bacterial single-cell RNA sequencing has the potential to elucidate within-population heterogeneity of prokaryotes, as well as their interaction with host systems. Despite conceptual similarities, the statistical properties of bacterial single-cell datasets are highly dependent on the protocol, making proper processing essential to tap their full potential. We present BacSC, a fully data-driven computational pipeline that processes bacterial single-cell data without requiring manual intervention. BacSC performs data-adaptive quality control and variance stabilization, selects suitable parameters for dimension reduction, neighborhood embedding, and clustering, and provides false discovery rate control in differential gene expression testing. We validated BacSC on a broad selection of bacterial single-cell datasets spanning multiple protocols and species. Here, BacSC detected subpopulations in Klebsiella pneumoniae, found matching structures of Pseudomonas aeruginosa under regular and low-iron conditions, and better represented subpopulation dynamics of Bacillus subtilis. BacSC thus simplifies statistical processing of bacterial single-cell data and reduces the danger of incorrect processing.

Recommended citation: Ostner, J., Kirk, T., Olayo-Alarcon, R., Thöming, J., Rosenthal, A. Z., Häussler, S., et al. (2024). BacSC: A general workflow for bacterial single-cell RNA sequencing data analysis. bioRxiv. doi: 10.1101/2024.06.22.600071 http://biorxiv.org/lookup/doi/10.1101/2024.06.22.600071

MetaIBS - large-scale amplicon-based meta analysis of irritable bowel syndrome

Published in BioRxiv, 2024

Irritable Bowel Syndrome (IBS) is a chronic functional bowel disorder causing abdominal discomfort, as well as transit deregulation with constipation and/or diarrhea. The pathophysiology of IBS is poorly understood and believed to be multifactorial. The role of gut microbiota in IBS has been investigated in several case-control studies, in particular via 16S rRNA amplicon sequencing surveys. These studies, however, have not yet led to a consistent picture of significant changes in gut microbial compositions across health and disease. One key bottleneck is the modest cohort sizes of most individual studies and a high diversity of experimental, bioinformatics, and statistical analysis approaches across studies.Results We address these shortcomings by presenting MetaIBS, an open-access data repository and associated meta-analysis workflow of thirteen 16S rRNA amplicon datasets comprising both fecal matter and sigmoid biopsy samples spanning ∽2,500 IBS and healthy individuals. MetaIBS includes a tailored computational framework that (i) enables coherent de novo processing and taxonomic assignments of the raw 16S rRNA amplicon reads across experimental protocols and sequencing technologies, and (ii) statistical workflows for visualization and analysis at different taxonomic ranks and data granularity. Our statistical meta-analysis shows that popular high-level microbiome summary statistics, including Firmicutes/Bacteroidota ratios or diversity indices, are insufficient for reliable discrimination between IBS patients and healthy controls. Fine-grained multi-method differential abundance and classification analysis, however, can identify sets of differentially abundant taxa that replicate across multiple datasets, including Coprococcus eutactus and Alistipes finegoldii.Conclusions MetaIBS provides a curated and reproducible data and (meta-)analysis resource for amplicon-based IBS research at unprecedented scale. MetaIBS allows assessing the heterogeneity of IBS cohorts across multiple experimental protocols, sample types, and IBS phenotypes. Our framework will likely contribute to more coherent insights into the role of the microbiome in IBS and the discovery of reliable microbial IBS biomarkers for follow-up functional and translational studies.

Recommended citation: Carcy, S., Ostner, J., Tran, V., Menden, M., and Müller, C. L. (2024). "MetaIBS - large-scale amplicon-based meta analysis of irritable bowel syndrome". bioRxiv, 2024.01.22.575775. doi: 10.1101/2024.01.22.575775 https://www.biorxiv.org/content/10.1101/2024.01.22.575775v1

Best practices for single-cell analysis across modalities

Published in Nature reviews. Genetics, 2023

Recent advances in single-cell technologies have enabled high-throughput molecular profiling of cells across modalities and locations. Single-cell transcriptomics data can now be complemented by chromatin accessibility, surface protein expression, adaptive immune receptor repertoire profiling and spatial information. The increasing availability of single-cell data across modalities has motivated the development of novel computational methods to help analysts derive biological insights. As the field grows, it becomes increasingly difficult to navigate the vast landscape of tools and analysis steps. Here, we summarize independent benchmarking studies of unimodal and multimodal single-cell analysis across modalities to suggest comprehensive best-practice workflows for the most common analysis steps. Where independent benchmarks are not available, we review and contrast popular methods. Our article serves as an entry point for novices in the field of single-cell (multi-)omic analysis and guides advanced users to the most recent best practices.

Recommended citation: Heumos, L., Schaar, A. C., Lance, C., Litinetskaya, A., Drost, F., Zappia, L., et al. (2023). "Best practices for single-cell analysis across modalities". Nat. Rev. Genet., 1–23. https://www.nature.com/articles/s41576-023-00586-w

tascCODA: Bayesian Tree-Aggregated Analysis of Compositional Amplicon and Single-Cell Data

Published in Frontiers in Genetics, 2021

Accurate generative statistical modeling of count data is of critical relevance for the analysis of biological datasets from high-throughput sequencing technologies. Important instances include the modeling of microbiome compositions from amplicon sequencing surveys and the analysis of cell type compositions derived from single-cell RNA sequencing. Microbial and cell type abundance data share remarkably similar statistical features, including their inherent compositionality and a natural hierarchical ordering of the individual components from taxonomic or cell lineage tree information, respectively. To this end, we introduce a Bayesian model for tree-aggregated amplicon and single-cell compositional data analysis (tascCODA) that seamlessly integrates hierarchical information and experimental covariate data into the generative modeling of compositional count data. By combining latent parameters based on the tree structure with spike-and-slab Lasso penalization, tascCODA can determine covariate effects across different levels of the population hierarchy in a data-driven parsimonious way. In the context of differential abundance testing, we validate tascCODA's excellent performance on a comprehensive set of synthetic benchmark scenarios. Our analyses on human single-cell RNA-seq data from ulcerative colitis patients and amplicon data from patients with irritable bowel syndrome, respectively, identified aggregated cell type and taxon compositional changes that were more predictive and parsimonious than those proposed by other schemes. We posit that tascCODA constitutes a valuable addition to the growing statistical toolbox for generative modeling and analysis of compositional changes in microbial or cell population data.

Recommended citation: Ostner, J., Carcy, S., and Müller, C. L. (2021). "tascCODA: Bayesian Tree-Aggregated Analysis of Compositional Amplicon and Single-Cell Data." Front. Genet. 12, 766405. https://www.frontiersin.org/articles/10.3389/fgene.2021.766405/full

scCODA is a Bayesian model for compositional single-cell data analysis

Published in Nature Communications, 2021

Compositional changes of cell types are main drivers of biological processes. Their detection through single-cell experiments is difficult due to the compositionality of the data and low sample sizes. We introduce scCODA ( https://github.com/theislab/scCODA ), a Bayesian model addressing these issues enabling the study of complex cell type effects in disease, and other stimuli. scCODA demonstrated excellent detection performance, while reliably controlling for false discoveries, and identified experimentally verified cell type changes that were missed in original analyses.

Recommended citation: Büttner, M., Ostner, J., Müller, C. L., Theis, F. J., and Schubert, B. (2021). "scCODA is a Bayesian model for compositional single-cell data analysis" Nat. Commun. 12, 6876. https://www.nature.com/articles/s41467-021-27150-6