BioHealthBase.org

GBrowse Genome Browser

GBrowse Genome Browser Introduction

GBrowse is among the first web-based browsers to be widely utilized outside the constraints of its development. It was originally developed for use with WormBase (www.wormbase.org) and subsequently released into the mainstream audience as a standalone project in January 2002 (1). Over the years since its release, the developers have continued to release new updates on a steady basis, with new UI improvements in the newer versions. Among the notable improvements in subsequent versions is the introduction of next-generation sequencing (NGS) data that was integrated into GBrowse version 2.0-released in January 2010. Its popularity is based on the fact that it supports both DNA-seq and RNA-seq NGS alignments, and the ability to display data in various variations with multiple resolutions-ranging from an entire chromosome histogram to individual base pairs.

On GBrowse, it is possible to upload NGS data directly to the browser through a linked URL or directly to the browser. For sharing purposes, uploaded and linked sequencing data can be made public or selectively shared with collaborators. As mentioned, GBrowse Version 2.0 marked a significant shift in the data sharing, particularly with the new client-side architecture that enhanced browser interactivity and performance. Considering that proliferation of genome sequencing data has made available volumes of data that would overwhelm any individual researcher, GBrowse makes it possible to share sequenced data to a bigger audience (2). The baseline requirement is for the researchers to make the bare sequence DNA sequence, its properties, and associated annotations more accessible in the bioinformatics tools integrated as genome browsers.

GBrowse vs. other genome browsers

Unlike other genome browsers, GBrowse represents an updated class of genome browsing since its raison d’etre is integrated as a user interface to other third-party genome databases (3). Initially, genome browsers were species-specific databases with parochialism software infrastructure development such as Wormbase, Flybase and Saccharomyces Genome Database (SGD) as well as private companies like Celera. However, changes in the base code used by the model organism community gave rise to software standardization movement, which was code-named Generic Model Organism Database project (GMOD)(4). GMOD has been overly successful in that GBrowse is now used widely by hundreds of sites, including the aforementioned model organism databases. Currently, GBrowse is the principal data browser for the model organism ENCODE (modENCODE) data coordinating center, in addition to being used by the international HapMap consortium.

The GBrowse user-interface features three horizontal display panels-the chromosome overview, the region view and detailed view. Data is organized into tracks, while individual features are organized as glyphs with tracks that facilitate displaying of all panels even at zoom levels. Unlike other genome browsers, GBrowse’s interface offers a rich set of core functions, such as the ability to vertically re-order tracks, in-line track configuration, popup balloon tooltips, and a convenient track sharing function called by clicking on a button. However, just like Ensembl, GBrowse supports mouse-based navigation through a rubber band selection of sequence regions in all three panels.

GBrowse configuration

GBrowse presents an array of unique features that are not available in other core genome browsers. During its development, the UCSC genome browser programmers developed the “wiggle” track that makes it possible to cope with dense quantitative data such as microarray. It uses a combination of relation database and data serialization to achieve a more feasiblerendition of quantitative data, making it more tractable for real-time and collaborative genome browsing. The same interface has been adopted by GBrowse (5). GBrowse supports custom tracks and uploading of third-party annotations in a manner similar to UCSC genome browser is also possible. It can be used as both a DAS server and client with support for additional DAS-like protocols to allow for more control over the configuration and graphical rendering, especially for more complex sequence features that are supported by DAS.

It is also possible to share tracks with a click of a button, subsequently producing a popup balloon with a URL which, when sequenced to another GBrowse instance will enhance data viewing in a separate browser. Apart from an interactive and versatile user-interface, GBrowse has additional advantages over other browsers. Firstly, GBrowse is totally decoupled from the underlying data sources, making it easier for users while adding new data sources. Secondly, installation of GBrowse is very easy with easily configuration files. A typical installation can be done within a few hours while fully configuration and optimization can be completed in a day or two. Through hooks that are accessible via configuration files, it is possible to customize GBrowse. Currently, there are over 70 glyphs available for displaying graphical data within the genome browser context, while the style and color of a track’s glyph can be controlled dynamically through insertion of Perl callbacks into the configuration file. This can include change of background types within the same track based on feature type, zoom level, or other context.

Installation and use

GBrowse is a Linux based browse and will run on any recent Linux distribution and hardware. The minimum system requirements are 4 GB RAM and 200 GB of free disk space, which are recommended for proper viewing of large BAM files and gene annotation databases (6). It can be installed from source code or binaries using ‘apt’ and ‘rpm’ package managers. Although there exist many guidelines on installation of GBrowse, the most hassle-free installation method us to run GBrowse in one of the prebuilt virtual machines. This will provide you with full functionality and performance without making modifications to your existing system. Also, you have the option of downloading the virtual machine to your local desktop/laptop or running it on Amazon’s EC2 cloud. Local installation of GBrowse requires a local VirtualBox machine virtualization software installed. VirtualBox is free software that runs on Windows, Linux and Macintosh OS computers.

GBrowse is intended for groups that intend to display and share genome annotations over an environment that makes it possible to access different formats without preinstallation of desktop software. Thus, it is suitable for installation in public websites, as well as web sites of small-to-medium collaborations among several geographical groups (7). It is well suited for collaborative environments in which some annotation tracks are public while others are private/ restricted to individuals or groups. Collaboration is made possible since GBrowse provides high-level security models while at the same time being able to integrate with a variety of popular enterprise authentication systems.

Types of genome databases

GBrowse provides users with access to genome assemblies obtained from the NCBI and external sequencing centers at its server location. The annotations contain reference sequences and draft work assemblies for a large collection of genomes, while at the same time providing portals to ENCODE data (2003 to 2012) and to Neandetal projects. The sources for GBrowse’s annotations are drawn from different sources, including data from the NCBI’s RefSeq and the Encyclopedia of DNA Elements (8). The annotation pipeline is responsible for more than half of the tracks in their browser. For instance, the GBrowse Genes set, which is an improved version of GBrowse Known Genes is set on the parent pipeline. Most of the data in GBrowse is high quality and moderately conservative gene annotations based on data drawn from RefSeq, GenBank and Uniprot. The comparative tracks are yet another example of UCSC annotations available in GBrowse.

The tracks are designed in a way that they display a summary sequence conservation amongst species and also includes chain and net tracks that highlight some alignments and rearrangements. The GBrowse annotation for the human genome sequence features conservation data for a 28-way vertebrate genome alignment (9). Currently, GBrowse supports over 56 species with the human genome browser hosting over 178 tracks. It features a simple user interface. The typical entry point is a home page that displays details of each species, making available background information about the genome build and the ability to search by either a keyword or position.

Data storage and management

The browser design features a single window, whereby a detailed description of annotations is displayed in one panel and the navigation and search functionality in other parts of the page. Depending on the genome under review, a chromosome overview of the displayed region is also displayed at the top. Like most genome browsers, the orientation sequence in GBrowse is horizontal and the view can be zoomed in or out, and panned right or left. This makes it easier to search for genes or specific chromosome positions, using a search box and navigation buttons at the top of the base. Data on the browser is organized into linear, horizontal tracks, containing glyphs that are a representation of individual sequence features. They are mostly a large number of tracks, but they are offset by being grouping the tracks into logical categories, each of which can be opened or collapsed by the user.

Additionally, the bowser also supports custom tracks via upload of third-party annotation files or through the table browser. Outbound data sharing via the Distributed Annotation Service protocol is also supported. Upon users choosing a combination of tracks and selecting the display options, they are able to save the session to a user account, which can be stored in a privately or shared with colleagues. GBrowse displays access to data in two ways: Bulk data can be downloaded from the FTP (File Transfer Protocol) site, which can be accessed via web browsers or dedicated FTP client software (10). The browser supports custom track generation and data downloads are supported by the table browser interface.

Data search and retrieval

The end users access GBrowse via a Web page, and consequently select the feature types he or she desires to view in the browser from a list of checkboxes. A single command fetches genome information based on the user preferences.In case a user is searching for regions of interest, GBrowse has a versatile search function that accepts chromosomal names as well as associated nomenclature. Users can search using chromosomal names and their correlating organisms as displayed on the front window, as it can accept various search connotations with the ability of being able to distinguish IDs of landmarks within the system. In addition, it can accept a contig name, a clone accession number, a GenBank accession number, a gene symbol, a geneticmarker name, and SNP ID, or indeed any other uniquefeature name that is known to the database (11). It is also possible to customize the display in GBrowse, add third-party features, access BioPerl and Bio:Graphics Software Libraries, as well as Bio::DB::GFF Database which are integrated in the underlying layers of GBrowse.

Scope of use

Apart from genetic and physical map rendering tools, genome browsers are run on web sites and are part of common applications used by bioinformatics research communities. Various genome browsers can be downloaded from various online sources provided that the user’s virtual machine meets the minimum operating threshold as required by the genome browser specifications. Some of the genome browsers’ databases have been reinvented a couple of times with focus on providing the most accurate data to users in the research community (12).

GBrowse is one class of the genome browsers with different genome databases; which is informed by the proliferation of existence of different models of data as well as different interfaces.The GBrowse design is enhanced for portability, while at the same time being made to be extensible at various modular ranges. Unlike other browsers, it does not rely on proprietary software but rather open source software such as MySQL and the Bioperl libraries (13). The baseline programming language used in GBrowse is Perl, which is quite familiar with both beginning and intermediate programmers, which are likely to be involved in setting up the systems. The programming language provides easily customizable database backends.

Conclusion

As discussed, GBrowse makes it possible for integration of dual-paths that are complimentary to each other, whilst harboring a complex organism database project. The initial phase involves the use of the project model as a complimentary product with stand-alone features. Within this scheme, the databases used in GBrowse back-end are not linked to the project. The first path is to use it as a completely stand-alone product, which targets separation of the project from the database backend. On the other hand, incoming links make use of standard URL calling conversations to identify the targeted regions as well as identify a series of feature types to be displayed GBrowse is currently being used as the primary genome browser for the WormBase and FlyBase, displacing previously used browsers. If its underlying database schemata are anything to go by, GBrowse is bound to replace conventional genome browsers considering that most research institutions such as Texas A&M University (E. coli), Ingenium AG (mouse), Bristol-Myers Squibb (D. melanogaster), and the Medical College of Wisconsin (R. rattus) are using it as their primary browser in respective research projects (14).

 

References

1. Blankenberg D, Coraor N, Von Kuster G, et al. Integrating diverse databases into an unified analysis framework: A Galaxy approach. The journal of biological databases and curation. 2011 2011; 12(01).
2. Lewis S, Searle S, Harris N, Gibson M, Iyer V, Richter J, et al. Apollo: a sequence annotation editor. Genome Biol. 2002; 3(12).
3. Torarinsson E. Genome browsers. Methods in molecular biology. 2008; 703(14).
4. Dolan ME, Holden CC, Beard MK, Bult CJ. Genomes as geography: using GIS technology to build interactive genome feature maps. BioMed Central. 2009; 12(4).
5. Hermida L, Philippsen P. The Ashbya Genome Database (AGD) : a tool for the yeast community and genome biologists. Nucleic acids research. 2005; 33(19).
6. Arnaoudova EG, Bowens PJ, Chui RG, Dinkins RD, Hesse U, Jaromczyk JW, et al. Visualizing and sharing results in bioinformatics projects: GBrowse and GenBank exports. BioMed Central. 2009; 06(25).
7. McKay S, Vergara I, Stajich J. Using the Generic Synteny Browser (GBrowse_syn). Current protocols in bioinformatics. 2010; 9(12).
8. Laundy GJ, Bidwell JL. Mouse cytokine gene nucleotide sequence alignments, 2000. Part I. European Journal of Immunogenetic. 2000; 27(4).
9. Bairoch A, Apweiler R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic acids research. 2000; 28(1).
10. Harris N. Genotator: a workbench for sequence annotation. Genome research. 1997; 7(7).
11. Camon E, Barrell D, Brooksbank C, Magrane M, Apweiler R. The Gene Ontology Annotation (GOA) project — application of GO in SWISS-PROT, TrEMBL and InterPro. Comparative and Functional Genomics. 2003; 4(1).
12. Fortna A, Gardiner K. Genomic sequence analysis tools: a user’s guide. Trends in Genetic. 2001; 17(3).
13. Dalpé G, Joly Y. Opportunities and Challenges Provided by Cloud Repositories for Bioinformatics-Enabled Drug Discovery. Drug Development Research. 2014; 75(6).
14. Stefan Stefanov JLaBG, Gold B. An Analysis Pipeline for Genome-wide Association Studies. Cancer Informatics. 2008; 1(6).