Location： Home / About MGI

How BioInformatics Technology handles sequencing analysis process? - Interview with MGI Senior Vice-President Dr. Ni Ming

Release date：2023-04-23Writer：MGI

As sequencing technologies and data analysis tools continue to advance, it is more important than ever to ensure your sequencing data is being handled appropriately. Analyzing sequencing data is a complex process and the current platforms are now capable of performing diverse tasks such as assembling genomes, quantifying transcripts, interpreting detailed experiments, and much more.MGI has introduced a variety of internationally competitive BIT (BioInformatics Technology) products, such as bioinformatics analysis software MegaBOLT/ZBOLT/ZBOLT Pro.

MGI Senior Vice-President Dr. Ni Ming sharedthe workings of these platformsin his interview with SEQanswers.

1. What is your approach to quality control of sequencing data and how do you ensure that the data is of a high enough quality for resulting downstream analyses?

NGS is a bit different from other businesses. All products and analysis services are based on the sequencing data, which means that the quality of this kind of data is vital for most applications. MGI and Complete Genomics ensure quality control in almost all steps of sequencing:

Firstly, our sequencing strategy is based on DNA nanoball which ensures minimized PCR cycles to avoid introducing errors during DNA copying. Following that, we launch automation which significantly reduces the use of manual operations and ensures minimal manual during lab work. During data analysis, through implementation of our in-house software SOAPnuke, which is publicly available in github and integrated into MegaBOLT for customer use and testing, we apply a number of key criteria, such as Q30, GC content, adapter contamination rate and more to evaluate sequencing data quality before and after data processing to make sure only data of good quality is fed into subsequent analyses. This effectively elevates the accuracy and reliability of the insights derived from data analyses. In addition, all parameters can be modified to cater to different customer needs.

Finally, we have in place management systems covering the whole sequencing workflow – from sample submission, to sample management, laboratory management, data analysis and reporting. The schematic diagram above represents our fully automated sequencing and analysis process, with integrated, accelerated one-stop analysis.

a) Comprehensive data filtering and detailed visualization

Once the raw data is generated, it is processed using the self-developed, full-featured SOAPnuke software by filtering reads with adapter, of low quality, and with high N content to obtain high-quality data. At the same time, the statistics of the preprocessed data are added with visualization-based method, for example by looking at the scale of Q30 (i.e. error rate 0 .1%), base and quality status of each sequencing cycle, etc., for better understanding of the data.

b) Flexible parameters adjustment

Because relevant parameters can be flexibly adjusted, and datasets from the same kinds of library and sequencers often have the same data characteristics, users can select the same parameters for similar data, and build business processes according to platform characteristics, in order to evaluate a set of suitable parameter scheme for subsequent production business.

c) Accelerated quality control process

To optimize the whole data quality control process, we have designed a streamlined pipeline and undergone data splitting for big sequencing data and auto-parallelization.

2. How do you evaluate the quality and reliability of generated assemblies or alignments?

Apart from establishing strict quality control in every step, we have also taken measures to ensure data quality and reliability throughout the entire data lifecycle, from acquisition to disposal.

a) Unified task scheduling and data management

Unified task scheduling and data management software is used during automatic analysis, while the integrity of the result file and report is verified during the scheduling process to ensure the reliability of results. To add, automatic re-analysis of each task improves the fault tolerance rate and success rate of the system.

b) Key indicators and comparison against gold standards

In terms of analysis, conventional indicators such as mapping rate and coverage are used in the results report to evaluate the level of alignment, and the results are evaluated indirectly by comparing the accuracy of variation detection. The evaluation can be done by comparing the variant calling of standard sample against gold standards, or commonly-used GATK’s best practices pipeline results.

c) Multiple alignment tools for users’ selection

At the same time, we provide different alignment software for selection, some with better precision performance in variant detection and others with better sensitivity. Users can select toolsbased on their specific business needs.

3. How do you perform transcript quantification and also differential expression analysis?

Our strategy in MegaBOLT is to help clients with analyses which require high computing time and resources. The pipeline that we are using is heavily based on GATK best practice and is a well-recognized publicly available software. The output of our RNA-seq pipeline is the standard gene expression matrix for each sample. Other downstream analyses for the gene expression are also included, i.e. heatmap, cluster analysis, etc.

In terms of the differential expression analysis, there are many software available, and this highly depends on the customer’s project design, i.e. selection of control/treatment pairs, grouping of different samples, etc. The customer can use the standards expression matrix as input for most of the tools, i.e. DEGseq, DEseq, edgeR, EBseq, NOIseq and PossionDis.

4. What types of visualization tools are available and how do you decide which is best for your data?

Our mission at MGI is to minimize the customer’s manual work and maximize data quality and security by providing visualization tools which cover the whole NGS workflow- the ZTRON series of products. Using the ZTRON series of products, we can manage biological samples, use them for laboratory management, as well as manage data generated by sequencing machines, analyze them, and generate relevant reports.

For each of the tools, we have dedicated a lot of efforts and resources to their development, testing and launch. These tools all have very unique designs and comprehensive functions, so it is difficult for us to pick up a so-called “best”. If I were doing bioinformatics, I would find PaaZ most useful as it allows checking and safety deletion. Perfectly integrated into the whole automated system, it supports customized pipelines at its user interface and therefore requires the least amount of bioinformatics capabilities.

The analysis system provides a visual interface for the scheduling of biometric analyses

l Visual view of analysis progress, status, and logs;

l Automatic management of genetic data, backup, and customization of governance rules;

l Customization of bioinformatic analysis pipeline through “drag and pull” at a visual interface. According to the quality control standards provided by the user, the quality control status of each sample can be set and displayed.

In the QC process, statistical tables and pictures can be used to visually present data performance. During sample analysis, users can rely on tools like samtools to view the detailed status of the comparison and visualize results such as variant calling using IGV view.

5. What are some of the latest trends in sequencing data analysis and how do you stay up-to-date with these developments?

In recent years, an increasing number of scientists are approaching the entirety of biology from a trans-omics angle. By integrating DNA, epigenetics, RNA, protein or other molecule detection through trans-omics data analysis, a complete cellular result can be obtained. Not only does this provide new scientific insights that cannot be found via a single-omics approach, it also offers different perspectives for discovery across multiple biological levels. With the continuous reduction of sequencing technology costs, the scale of trans-omics applications has increased dramatically, resulting in unprecedented challenges for large-scale sample collection, preservation, sequencing data production and analysis platforms. To address this challenge, we have developed an integrated platform for storage, reading, computing and usage, which provides multiple hardware and software support, therefore addressing the bottleneck in large-scale trans-omics data analysis and production. The applications ofsequencing data analysisare primarily demonstrated in the domains of computing and usage.

The following illustrates the tree latest trends in sequencing data analysis and the efforts made by MGI and Complete Genomics tostay up-to-date with these trends.

1) The prevalence of large-scale trans-omics datasets has created a large demand for computing acceleration. Moreover,by using the complete sequence of a human genome (T2T), more accurate and comprehensive analysis results can be obtained.To achieve calculation, storage, and management of large population genomics, it is necessary to create highly cost-effective, high-density, and highly scalable technology and products. Therefore, we have independently developed the MegaBOLT series and ZTRON series products. Among them, the MegaBOLT/ZBOLT/ZBOLT Pro bioinformatics analysis accelerator adopts a parallel computing architecture with multiple pipelines and is over 300 times faster than classic analysis algorithm. Its analytic ability can reach up to 5,000/17,000/70,000 WGS per year, with a daily throughput of approximately 1TB/5TB/20TB genetic data. Combined with our sequencer products, it can achieve highly efficient genetic data management. For example, the DNBSEQ-T7* platform produces 60 sets of WGS data per run, which then take only half a day to be processed by the ZBOLT Pro. Additionally, we provide the ZTRON genetic data center all-in-one machine as a “one-stop-shop” for large population with tens of thousands, hundreds of thousands, and millions of samples, fused with ZBOLT bioinformatics analysis accelerator (also capable of high-performance data management) to maximize cost reduction and fully accelerate genomic data processing capabilities.

2) The prevalence of large-scale trans-omics datasets also creates increased requirements for data security and management. When running large-scale trans-omics data analysis and management, data security must be considered. Life data management and privacy security sharing are major challenges faced by big data in genomics. The technological bottlenecks include difficulties in personal data tracing, security protection, and information island sharing. Currently, countries around the world have introduced relevant personal data privacy protection laws and regulations, such as GDPR in Europe, HHIPA in the United States, and China's "Data Security Law" and "Personal Information Protection Law". For data security and privacy protection, genomic data-related products need to follow the principle of ‘Privacy by Design’, which emphasizes privacy security since the beginning of process – from design to conducting safe and efficient computing, storage, and encryption of data, as well as safe and efficient transmission and management. So far, we have completed the research and development of products based on the above three points, fully protecting customer data privacy and security.

In addition, large-scale trans-omics research not only puts forward great requirements for the innovative development of relevant software and hardware for storage, reading, computing and usage, it also brings along new development opportunities for laboratory management, which is gradually moving towards digitization and intelligence. For this reason, we have established the ZLIMS four-layer laboratory management architecture to provide full-process and full-cycle management from sample to experimental results for laboratories. The four layers refer to environmental management, equipment management, application management, and data management. The ZLIMS has been successfully applied in large-scale sequencing laboratories with a million of samples.

3) The breakthrough of AI technology has promoted the integration of AI and bioinformatics. In this regard, we have always kept up through cutting-edge technology. For instance, by utilizing the self-developed MegaBOLT-DV deep learning variant calling algorithm and combining it with our dataset for model training, we can obtain more accurate results. Further, we look forward to new opportunities that will be brought on by the integration of GPT4 and bioinformatics applications.

In the future, in response to the characteristics of large-scale trans-omics research, such as numerous samples, lengthy cycles, complex projects, and voluminous data, MGI and Complete Genomics will continue to develop a complete set of digitalized core tools for life sciences to facilitate efficient trans-omics research.

6. What steps do you take to ensure that your analysis and pipelines are accurate and reproducible?

In terms of accuracy, as mentioned earlier, QC preprocessing of data ensures that the analysis data is as accurate as possible. The use of T2T genome and comparison software with higher level of accuracy further add to this, while AI algorithms help to continuously improve accuracy.

On the other hand, mutation detection software will perform random down-sampling operations on high-depth data to a certain extent and employ random functions, etc. to generate unreproducible results. You can cancel down-sampling, modify the use of random functions, etc. to ensure reproducibility.

*Unless otherwise informed, StandardMPS and CoolMPS sequencing reagents, and sequencers for use with such reagents are not available in Germany, Spain, UK, Sweden, Italy, Czech Republic, Switzerland and Hong Kong (CoolMPS is available in Hong Kong).

*Products are provided for Research Use Only. Not for use in diagnostic procedures (except as specifically noted).

Prev：MGI Secured First Corporate Order of Ultra-high Throughput Sequencer DNBSEQ-T20×2*Next：MGI Receives iF Design Award for DNBSEQ-G99* Gene Sequencing Platform