Bioinformatics Unit 2: Exercise

SSU Home | SSU Biology | CourseInfo | Calendar | Home

Glossary   |   Self Tests   |   Software   |   Objectives   |   Articles


Unit 2: Genomics   Exercise 2

Introduction
Preparation
Activity
Questions


Sequence Assembly, Feature Mapping, and Gene Finding


Objectives:

1. Examine a sequence assembly program (CAP3).
2. Explore open source software for bioinformatics (EMBOSS).
3. Learn how to find a contig or BAC clone by sequence query using Ensembl.
4. Find genes in a contig using GrailExp.

Introduction:

This activity continues our look at the process of sequencing a whole genome, using the human genome project as a model. We examine some software packages and database interfaces used in various phases of sequence assembly and annotation.

We start with examination of CAP3, an open source DNA sequence assembly program. The program is described in Huang, X. and Madan, A. (1999) CAP3: A DNA Sequence Assembly Program, Genome Research, 9: 868-877. CAP3 is part of a package of programs for sequence annotation called the Analysis and Annotation Tool maintained by the authors at Michigan Tech.

Once sequence assembly is complete to the draft stage, annotation of the assemblies begins. The annotation stage is described in the section entitled "Annotating the Assemblies" in the assigned article "Introduction: putting it together."

The European Molecular Biology Open Software Suite, an open source (non-commercial) package of programs, provides free software for genome annotation. EMBOSS is described in the paper by Rice P, Longden I, and Bleasby A. (2000) EMBOSS: the European Molecular Biology Open Software Suite, Trends Genet. 16(6): 276-277.

GrailExp (part of the Grail Experimental Gene Discovery Suite) is a software package developed specifically for gene finding using pattern recognition and expressed sequence tags. The original paper, "Automated Gene Identification in Large-Scale Genomic Sequences" (Xu, Y. and Uberbacher, E.C., Journal of Computational Biology, Volume 4, Number 3, 1997) describes Grail as an algorithm for inferring gene structures from predicted exon candidates, based on Expressed Sequence Tags (ESTs) and biological intuition/rules. The algorithm first finds all the related ESTs in the EST database (dbEST) for each predicted exon, and infers the boundaries of one or a series of genes based on the available EST information and biological rules. Then it constructs gene models within each pair of gene boundaries, that are most consistent with the EST information. By exploiting EST information and biological rules, the algorithm can (1) model complicated multiple gene structures, including embedded genes, (2) identify falsely-predicted exons and locate missed exons, and (3) make more accurate exon boundary predictions.

There are three major portals for genome sequence data: Ensembl (http://www.ensembl.org), the UCSC Genome Browser (http://genome.ucsc.edu) and NCBI (http://www.ncbi.nlm.nih.gov). Each portal has its own user interface. In this activity we use the Ensembl data portal.

Annotation is an ongoing process within the domain of bioinformatics. For a brief overview of other processes that begin with annotation, see "What happens when the human genome sequence is completed?."

There are summary questions at the end of this section. Read them through before you start the activities in this exercise. You can answer them as you go, or answer them afterward. Points = 5. Due at the end of class on September 30, 2003.

Preparation:

Review the assigned article "Introduction: putting it together." by A. Baxevanis, F. Collins (eds) (2002) The Genome User's Guide Nature Genetics Supplement September issue, pp. 5-8.

Activity:

Part A: Sequence assembly with CAP3.

1. CAP3 is part of the Sequence Analysis Server at mtu.edu Michigan Tech University http://genome.cs.mtu.edu/. There is also an Analysis and Annotation Tool at this site (http://genome.cs.mtu.edu/aat/aat.html). A page describing the algorithm used by CAP3 may be found at http://genome.cs.mtu.edu/cap/cap3.html. For an example of the CAP3 input file format see http://genome.cs.mtu.edu/cap/data/seq. The output file format is shown at http://genome.cs.mtu.edu/cap/data/cap3.out. A finished contig looks like the file at http://genome.cs.mtu.edu/cap/data/seq.cap.contigs.

2. Try submitting the example imput file to the application service provider for CAP3 at http://deepc2.zool.iastate.edu/aat/cap/cap.html.

 

Part B: Annotation using tools in EMBOSS.

1. The EMBOSS Homepage is at http://www.hgmp.mrc.ac.uk/Software/EMBOSS/index.html. A graphical user interface (GUI) is located at http://bioinfo.pbi.nrc.ca:8090/EMBOSS/.

2. Check out the list of programs available through this interface.

 

Part C: Find a contig through Ensembl and scan for genes with GrailExp.

1. The text file musA5_read.txt contains a short sequence "read" with vector DNA trimmed off. Use this sequence as a query to search for a contig or BAC in the mouse genome through Ensembl at http://www.ensembl.org. Hint: This sequence should give two matches, a bad one to a region of mouse chromosome 10, and a perfect one to a region on mouse chromosome 11.

2. Go to the GrailEXP (Grail Experimental Gene Discovery Suite) homepage at http://compbio.ornl.gov/grailexp/. The Genome Analysis Toolkit incorporates a wide variety of analysis tools: exon and gene prediction tools, other kinds of feature recognition systems and database homology search systems. The exon and gene recognition systems include Grail, GrailExp and Genscan; and microbial gene prediction systems, Generation and Glimmer. Additionally, Grail suite of tools, consisting of CpG islands, PolyA sites, Simple and Complex Repeats, and BAC End analysis tools, have also been incorporated. Also included are NCBI STS E-PCR, RepeatMasker and TRNAScan-SE systems. Database homology systems include NCBI BLAST and Beauty post-processing. Supported organisms include human, mouse, Arabidopsis, Drosophila, and most sequenced microbial organisms. Submit the BAC or contig you identified above for analysis by GrailExp.

 

Summary Questions:

Please limit your answers to the set of questions to a maximum of three double-spaced pages [12 pt font]. This length should be sufficient for your comments and any appropriate copy/pasted examples. [You need not retype or copy/paste the questions as part of your responses.]

1. What are NCBI reference sequences and why are these designations necessary?

2. What output did you get from your submission to the CAP3 sequence assembler? Briefly explain how the consensus sequence was derived.

3. What gene-finding software is available in the EMBOSS suite?

4. What is the accession number of the contig containing the musddA5 sequence? Briefly explain how you tracked this down, or where you got stuck.

5. Did Grail predict any genes in the contig or BAC sequence you submitted? If so, briefly describe the result. If not, why not?

[top of page]

Related Links:

Updated 09/25/2003 by thatcher@sonoma.edu; bchapman@classroomtools.com