October 6, 2017 | Use Case
How complete is your bacteria whole genome assembly?
The tremendous reduction in sequencing costs has resulted in escalating number of genomes being sequenced or re-sequenced. As for October 2017, there are more than 100,000 complete and draft genomes available in National Center for Biotechnology Information (NCBI).
Majority of these genomes are small microbial genomes which are less than 10 Mb in size, which sequenced using Next Generation Sequencing (NGS) technology followed by de novo assembly. Obtaining a complete catalog of genes is a first and foremost step to run a genome annotation project. However, assembling draft genomes from short sequencing reads remains very challenging, with genome completeness being questionable.
Arkgene aims to provide our users with user-friendly solution to evaluate the completeness of their draft genomes, through the integration of BUSCO (Benchmarking Universal Single-Copy Orthologs; doi.org/10.1093/bioinformatics/btv351) tool into this platform. BUSCO has recently emerged as a common tool to evaluate the completeness of genome assembly, with more than 200 cumulative citations in 2017 (statistics based on Web of Science). The BUSCO algorithm finds open reading frame (ORF) from the assembled genomes and compares them to the core gene datasets from specific lineages, such as bacteria, eukaryota, metazoa, fungi and plants.
To cite an example, a postgraduate student had sequenced five bacterial genomes with the genome size of 5 Mb each. The information on the genome completeness is important for him/her to know how complete is his/her bacterial de novo draft genomes, which also acts as an important prerequisite for publication in a research journal. The BUSCO is a suitable tool to address this objective. However, the student does not have any bioinformatics server setup, prior knowledge in command line execution and Linux/UNIX like operating systems, making him/her unable to run BUSCO analysis.
Hence, with the introduction of Arkgene, the student can easily upload his/her genome files, or access the uploaded files in Arkgene, followed by a job execution to run BUSCO. All these steps are done via a simple and interactive user interface, offering an alternative to the troublesome command line execution or operation of virtual machine to run the BUSCO tool. All the student needs is to select the input file from his/her folder directory in Arkgene, specify the running mode (genome, transcriptome or protein) and key in the output name, before clicking on the “Go” button to kickstart the running of BUSCO. It took less than 3 minutes to run BUSCO assessment for a 5 Mb genome!
In the end of the job, this student will be provided with a few text files containing the complete list of BUSCOs (complete, fragmented and missing) and a summary report. Most importantly, a publication-ready image for BUSCO summary will be generated for use in his/her research paper, ready for publication.