Aim: Data-driven analysis using large-scale NGS data
"Genome" is all the genetic information passed down through generations. It contains all the genes needed for life activities. Different tissues in our body, such as bones, nerves, and skin, express the appropriate set of genes they need, and the functional units that control this expression such as enhancers are also included in the genome. These functional regions are subject to chemical modifications (epigenome) that act as "switches" controlling the level and timing of gene expression. While the genome sequence is identical in all cells of the body, the epigenome shows different patterns in different tissues and plays a critical role in ensuring their proper function. It is fascinating to note that the human genome sequence contains a wealth of information about life that doesn't even fill the data capacity of a DVD.
There are individual variations in the genome sequence that affect traits such as alcohol tolerance, eye color, and hair color. In addition, problematic mutations or abnormalities in the genome or epigenome can lead to various diseases, including cancer. "Genomics" and "epigenomics" are the fields that study the mechanisms that determine the genomic regions critical for specific biological activities and the effects of various mutations on them. Our lab focuses on genomics and epigenomics research.
Genome-wide analysis using next-generation sequencing (NGS, above) is a central method in genomics and epigenomics fields. It can capture various genomic information, including gene transcription levels, protein-DNA binding, DNA methylation, genome replication, and the three-dimensional structure of the genome. NGS analysis has led to important discoveries for dynamic regulation of the genome and dysregulation in diseases.
With the rapid increase in available genomic and epigenomic data in recent years, there is a great demand for "data-driven large-scale NGS analysis" that can simultaneously analyze a large amount of NGS data to make breakthrough discoveries that challenge previous understandings. In contrast to hypothesis-driven analysis, which uses experiments to test working hypotheses based on existing knowledge, data-driven analysis makes new discoveries by exploiting features in the data itself, without relying on existing knowledge. It has the potential to elucidate complex and unexpected mechanisms, for example, "a particular protein interacts with an unexpected protein in a particular tissue, and the loss of this interaction leads to the onset of a particular type of disease".
List of NGS assays
- ChIP-seq: protein-DNA binding and histone modifications,
- DNase-seq and ATAC-seq: open chromatin,
- Bisulfite-seq: DNA methylation,
- RNA-seq: gene expression,
- Exome-seq: gene mutation,
- Hi-C (Micro-C): three-dimensional chromatin folding,
- ChIA-PET and Hi-ChIP: chromatin looping, and
- Single-cell analysis: cellular heterogeneity and cell differentiation trajectory.
So, how do we extract biologically meaningful information from a vast amount of NGS data? Even a single sample of NGS data contains information at the whole genome level, and when it comes to hundreds or thousands of samples, the data volume becomes truly enormous. In addition, the structure and characteristics of the data vary among NGS assays, and data from technically challenging experiments often have significant variations in quality. Since the "true" results are unknown, applying supervised learning is often difficult. Currently, extracting reliable and meaningful insights from large amounts of data, including poor-quality data, is extremely challenging. Despite the high demand, only a few highly skilled researchers can perform such analyses. To draw a parallel with a restaurant, it's like having a variety of ingredients (data) but no preparation for cooking (data formatting), and a lack of utensils (tools) and chefs (analysts) capable of cooking, creating a challenging situation.
To overcome this problem, our lab is developing methods to realize data-driven large-scale NGS analysis. We are particularly interested in the relationship between epigenomic states (ChIP-seq), three-dimensional genome structures (Hi-C, Micro-C), and gene expression states (RNA-seq), and we want to understand how these fluctuate during disease and cell differentiation processes. In collaboration with experts from other fields such as life sciences, physics and mathematics, we delve into the still unsolved mysteries of the genome.
Current projects
Development of a robust computational platform for data-driven epigenome analysis
We aim to develop a computational platform for comparative epigenomic analysis of datasets obtained from multiple NGS assays (multi-NGS omics, Figure). This platform will integrate ChIP-seq, RNA-seq, Hi-C and other types of genomic data (e.g., GWAS annotation) and implement semi-automated genome annotation (SAGA) to characterize the entire genomic region in more detail. We will also develop a data imputation strategy that predicts the enrichment pattern of ChIP-seq and other epigenomic data by leveraging information from other relevant cell lines and epigenomic marks available in existing databases. This approach can reduce technical noise in raw NGS data and generate virtual data of missing samples in a large dataset in silico. This platform can significantly reduce the cost of both NGS data generation and computational analysis for large-scale epigenomic analysis, especially for valuable biological samples (e.g., clinical samples of rare diseases).
- AMED-PRIME "Understanding of the biological phenomena and responses at the early life stages to improve the quality of health and medical care"
Figure: A comparative epigenomic analysis platform
We have developed a new method, HiC1Dmetrics, which can efficiently extract a variety of one-dimensional features from Hi-C data. In this method, we have developed several new metrics that can quantitatively extract specific 3D structures, such as chromatin hubs. These metrics are effective for integrating 3D information into the SAGA approach.
Figure: Overview of HiC1Dmetrics
IHEC project
We also participate in International Human Epigenome Consortium (IHEC), where we work together with great experts for integrative analysis.
- International Human Epigenome Consortium (IHEC)
- IHEC Team Japan
Functional analysis of cohesin complex
From the biological perspective, we are interested in the gene expression and three-dimensional chromatin folding regulated by the cohesin complex. Cohesin is involved in gene regulation by various mechanisms, including the mediation or insulation of enhancer-promoter loops, the formation of topologically associating domains (TADs) by loop extrusion, and RNA polymerase II elongation, cooperating with various factors such as cohesin loader, cohesin acetyltransferase, insulator-binding factor CTCF, and super-elongation complex (Figure). Although the mutation of cohesin and cohesin loader causes a developmental syndrome with a complex phenotype Cornelia de Lange syndrome (CdLS), an underlying molecular mechanism is still unclear.
Figure: The proposed models for cohesin function.
We found a small fraction of cohesin binding sites and mediated chromatin loops located in intragenic regions that are negatively correlated with transcriptional activity, i.e., transcriptional activation of host genes attenuates binding. To characterize the decreased intragenic cohesin sites (DICs), we performed a large-scale multi-omics analysis using more than 100 NGS samples, consisting of ChIP-seq, RNA-seq, Hi-C and ChIA-PET. The analysis revealed that cohesin seems to have a negative regulation mechanism of transcription by binding or inhibiting RNA Pol2 elongation, and this function seems to fluctuate in patients with cohesin disease.
Figure A: We found that the decreased intragenic cohesin sites (DICs) that bind are attenuated by gene activations. B: We implemented a machine learning approach to characterize the DIC sites using multi-omics data.
We have also implemented large-scale comparative analysis using Hi-C, RNA-seq, and ChIP-seq with depleting cohesin and related factors to clarify the shared and unique roles of them. We categorized genes and genomic regions into several clusters, and could identify candidate regions in which the whole three-dimensional structure is differentially regulated among factors. For this analysis, we developed a new 3D genome analysis method, CustardPy. The analysis revealed that there are multiple classes of TAD boundaries regulated by different factors, that variations in long-range interactions between TADs correlate with their epigenomic states, and that the genomic abundance of cohesin differs significantly between compartments A and B.
Figure left: We generated multiple samples with depletion of cohesin and related factors and generated multi-omics data. Right: A visualization of the multi-scale insulation score analysis. Top: Hi-C contact map. Middle: Enrichment of factors and histone modifications. Bottom: Multi-scale insulation score. Red regions indicate insulated TAD boundaries.
Various analysis using single-cell methods
Single-cell analysis, which observes genomic information at the single-cell level, is used to observe cellular heterogeneity in tissues, cell differentiation trajectory, and stochastic fluctuations in gene expression. We are working on the following projects mainly using single-cell gene expression data (scRNA-seq).
- Time-course analysis of hepatocellular fibrosis initiation and healing (Figure A)
- Network analysis using gene co-expression and mutual exclusivity (Figure B)
- Trajectory analysis using stem cell differentiation system with cohesin disease model (Figure C)
Figure A: time-course scRNA-seq analysis for mouse liver. B: Gene co-expression network estimated from scRNA-seq. C: Trajectory analysis using differentiatin stem cells.
Past works
Single-cell analysis pipeline ShortCake [中戸, 実験医学, 2021]
We have developed a computational platform for single-cell analysis, ShortCake. By downloading the Docker image, users can use different tools for single-cell analysis on any platform. ShortCake can be used in the CUI (command line) and the GUI (Jupyter notebook and RStudio). Since ShorCake does not rely on the big cluster server, it is not affected by server failures and shutdowns.
Figure: Overview of ShortCake
We have also developed a new approach EEISP, which robustly estimates gene co-expression and and mutual exclusivity from sparse scRNA-seq data. By applying this method to glioblastoma stem cell data and conducting comparative analysis of gene networks between stem cells and non-stem cells, we identified several new marker gene candidates of cancer stem cells.
Figure: Gene co-expression network estimation and network comparison from sparse scRNA-seq data
Epigenome database project for human vascular endothelial cells (ECs)
As a part of the International Human Epigenome Consortium (IHEC), we cataloged gene expression and active histone marks in different types of human endothelial cells (ECs) from multiple donors and developed a database site.
・ Human Endothelial Epigenome Database
Figure: Schematic illustration of the cardiovascular system.
This indicates nine EC types and 33 individual samples (indicated by the prefix “EC”) used in this study.
Using this database, we performed a comprehensive analysis with chromatin interaction data to understand their diverse phenotypes and physiological functions .
We developed a robust procedure for comparative epigenome analysis that circumvents variations at the individual level and technical noise. Through this approach, we identified 3,765 EC-specific enhancers, and some of them were associated with disease-associated genetic variations (GWAS).
We also identified various candidate marker genes for each EC type. Notably, many homeobox genes were differentially expressed across EC types, and their expression was correlated with the relative position of each body organ. This reflects the developmental origins of ECs and their roles in angiogenesis, vasculogenesis, and wound healing.
Figure 7: Differential expression of HOX genes across nine EC types.
Left: heatmaps visualizing the gene expression of four HOX clusters.
Right: read distribution of the enhancer marker H3K27ac (highlighted in orange) around the HOXD cluster. Red arcs indicate chromatin loops obtained by ChIA-PET analysis.
Unlike other NGS assays such as RNA-seq, ChIP-seq analysis requires the generation of multiple data from a single sample (e.g., multiple histone modifications), which often leads to many samples and cumbersome analysis. We have developed DROMPAplus, a pipeline tool for efficient comparative analysis of such large numbers of ChIP-seq data. It can be used for various quality assessments, fragment length estimation, PCR bias filtering, normalization, peak extraction, and visualization.
Figure: Summary of DROMPAplus
In large-scale NGS analysis, the quality of input samples is absolutely critical, and one-by-one checking is difficult; therefore, objective quality metrics are essential to filter poor-quality data automatically. However, in the EC epigenome project (see above), poor-quality data were present even after removing the low-quality samples identified by recommended quality metrics.
Therefore, we developed SSP, a new quality assessment tool for ChIP-seq analysis. SSP provides a quantifiable and sensitive score for different S/Ns for both point- and broad-source factors, which can be standardized across diverse cell types and read depths. SSP also provides an effective criterion to determine whether a specific normalization or rejection is required for each sample, which cannot be estimated by other currently available quality metrics.
Figure: SSP: Sensitive and robust assessment of ChIP-seq read distribution
using a strand-shift profile