Aim: Data-driven analysis using large-scale NGS data
Genome-wide analysis using next-generation sequencing (NGS) is a central method in genomics and epigenomics fields. NGS analysis has led to important discoveries for dynamic regulation of the genome and dysregulation in diseases.
To understand the cooperative regulation of the genomic functions, we have been implementing computational analysis for various NGS assays including:
- ChIP-seq: protein-DNA binding and histone modifications,
- DNase-seq and ATAC-seq: open chromatin,
- Bisulfite-seq: DNA methylation,
- RNA-seq: gene expression,
- Exome-seq: gene mutation,
- Hi-C (Micro-C): three-dimensional chromatin folding,
- ChIA-PET and Hi-ChIP: chromatin looping, and
- Single-cell analysis: cellular heterogeneity and cell differentiation trajectory.
We focus on "data-driven NGS analysis" that achieves a ground-breaking biological discovery from large NGS datasets (~hundreds of samples) without help from existing biological knowledge.
We aim to develop a pipeline for multi-NGS omics analysis that integratively analyzes large datasets from multiple NGS assays, and discover higher-order biological information, e.g., interaction variation of multiple DNA-binding factors across tissue types.
The current research themes
- Development of integrated analysis method using multiple NGS assays (e.g., epigenome, gene expression and 3D structure)
- Whole-genome annotation based on various epigenomic data (e.g. histone modifications) to identify novel functional regulatory regions
- Development of methods for detailed analysis of 3D structural data (e.g., Hi-C and Micro-C)
- Enhancer-promoter interaction prediction from large-scale epigenomic data
- Functional analysis of cohesin and CTCF, essential proteins for the regulation of chromosomal structure and gene transcription
- Trajectory analysis using single-cell data to investigate cell fate regulation
- Gene networks reconstruction and comparative network analysis using single-cell data
Current projects
Development of a robust computational platform for data-driven epigenome analysis
We aim to develop a computational platform for comparative epigenomic analysis for datasets obtained from multiple NGS assays (multi-NGS omics, Figure 1). This platform integrates ChIP-seq, RNA-seq, Hi-C, as well as other types of genomic data (e.g., GWAS) and implements semi-automated genome annotation (SAGA) to characterize the whole genomic region in more detail. We will also develop a data imputation strategy that predicts the "true enrichment pattern" of ChIP-seq and other epigenome data by leveraging information from other relevant cell lines and epigenome marks available in existing databases. This approach can reduce technical noise from raw NGS data (de-noising) and generate virtual data of lacked samples in a large dataset in silico (data reconstruction).
This platform can greatly reduce the cost of both NGS data generation and computational analysis for large-scale epigenome analysis, especially for precious biological samples (e.g., clinical samples of rare diseases).
Using this system, we will collaborate with other AMED groups to obtain new biological insights on early life stages. We also participate in International Human Epigenome Consortium (IHEC), where we work together with great experts for integrative analysis.
Link:
- Advanced Research & Development Programs for Medical Innovation(AMED-PRIME)"Understanding of the biological phenomena and responses at the early life stages to improve the quality of health and medical care"
- IHEC Team Japan
Figure 1: A comparative epigenomic analysis platform
Functional analysis of cohesin complex
From the biological perspective, we are interested in the gene expression and three-dimensional chromatin folding regulated by the cohesin complex. Cohesin is involved in gene regulation via mediation and/or insulation of enhancer-promoter loop, the formation of topologically associating domains (TADs) by loop extrusion, and RNA polymerase II elongation, cooperating with various factors such as cohesin loader, cohesin acetyltransferase, insulator-binding factor CTCF, and super-elongation complex (Figure 2). Although the mutation of cohesin and cohesin loader causes a developmental syndrome with a complex phenotype Cornelia de Lange syndrome (CdLS), an underlying molecular mechanism is still unclear.
Figure 2: The proposed models for cohesin function.
To clarify the shared and unique roles of cohesin and related factors, we implemented large-scale comparative analysis using Hi-C, RNA-seq, and ChIP-seq with depleting cohesin and related factors.
We categorized genes and genomic regions into several clusters, and could identify candidate regions in which the whole three-dimensional structure is differentially regulated among factors (Figure 3) [Nakato et al., bioRxiv, 2022].
Figure 3: Left: Hierarchical genome structure and multi-Hi-C clustering. Right: A visualization of an example region.
Past works
Single-cell analysis
Single-cell analysis is a powerful technique for characterizing cellular heterogeneity in tissue cells and the trajectory (cell fate) in differentiating cells.
In our project, we developed a computational platform for single-cell analysis using Docker (Figure 4; Now the latest version of the image is available as "ShortCake"). By downloading the Docker image, users can use various tools for scRNA-seq and scATAC-seq in it on a single platform. This pipeline does not rely on the big cluster server; therefore, is not affected by server trouble and power-off.
This platform can be used in the character user interface (the command line) and the graphical one (Jupyter notebook and RStudio).
Figure 4: Overview of single-cell analysis platform.
We also developed a new approach for reconstructing gene regulatory networks from sparse single-cell transcriptome data [Nakajima et al., NAR, 2021] (Figure 5). The gene-community-based comparison of multiple networks can found candidates of new marker genes that cannot be identified by the gene-based comparison.
Figure 5: Gene network reconstruction and comparative analysis from sparse scRNA-seq data.
Link:
Grant-in-Aid for Scientific Research on Innovative Areas "Integrated analysis and regulation of cellular diversity"
Epigenome database project for human vascular endothelial cells (ECs)
As a part of the International Human Epigenome Consortium (IHEC), we cataloged gene expression and active histone marks in different types of human endothelial cells (ECs) from multiple donors (Figure 6) and developed a database site.
Using this database, we performed a comprehensive analysis with chromatin interaction data to understand their diverse phenotypes and physiological functions .
We developed a robust procedure for comparative epigenome analysis that circumvents variations at the individual level and technical noise. Through this approach, we identified 3,765 EC-specific enhancers, and some of them were associated with disease-associated genetic variations (GWAS) .
We identified various candidate marker genes for each EC type. Notably, many homeobox genes were differentially expressed across EC types, and their expression was correlated with the relative position of each body organ (Figure 7) [Nakato et al., Epigenetics & Chromatin, 2019]. This reflects the developmental origins of ECs and their roles in angiogenesis, vasculogenesis, and wound healing.
Figure 6: Schematic illustration of the cardiovascular system.
This indicates nine EC types and 33 individual samples (indicated by the prefix “EC”) used in this study.
Figure 7: Differential expression of HOX genes across nine EC types.
Left: heatmaps visualizing the gene expression of four HOX clusters.
Right: read distribution of the enhancer marker H3K27ac (highlighted in orange) around the HOXD cluster. Red arcs indicate chromatin loops obtained by ChIA-PET analysis.
DROMPAplus: easy-to-handle ChIP-seq pipeline tool
We have been developing a ChIP-seq pipeline tool named DROMPAplus [Nakato et al., Methods, 2020]. DROMPAplus can be used for quality check, PCR-bias filtering, read normalization, peak calling, visualization, and other multiple analyses of ChIP-seq data. DROMPA has been specially designed for easy handling for users without a strong bioinformatics background (Figure 8).
See the DROMPAplus website for more information.
Figure 8: Summary of DROMPAplus
SSP: Quality assessment tool for ChIP-seq data
In large-scale NGS analysis, the quality of input samples is absolutely critical, and one-by-one checking is difficult; therefore, objective quality metrics are essential to filter poor-quality data automatically. However, in the EC epigenome project (see above), poor-quality data were present even after removing the low-quality samples identified by recommended quality metrics.
Therefore, we developed SSP, a new quality assessment tool for ChIP-seq analysis (Figure 9; [Nakato et al., Bioinformatics, 2018]). SSP provides a quantifiable and sensitive score for different S/Ns for both point- and broad-source factors, which can be standardized across diverse cell types and read depths. SSP also provides an effective criterion to determine whether a specific normalization or rejection is required for each sample, which cannot be estimated by other currently available quality metrics.
Figure 9: SSP: Sensitive and robust assessment of ChIP-seq read distribution
using a strand-shift profile