Bptutorial.pl

From BioPerl

Jump to: navigation, search

Contents

NAME

Bioperl Tutorial - a tutorial for Bioperl

VERSION

1.5.1

AUTHOR

Written by Peter Schattner <schattner at alum.mit.edu>

Added to BioPerl Wiki by Jay Hannah (Pod::Simple::Wiki, format mediawiki)

Copyright Peter Schattner

Contributions, additions and corrections have been made to this document by the following individuals:

DESCRIPTION

This tutorial is an overview of Bioperl, it includes snippets of code and text from various Bioperl documents including module documentation, example scripts and "t" test scripts.

Introduction

Overview

Bioperl is a collection of Perl modules that facilitate the development of Perl scripts for bioinformatics applications. As such, it does not include ready to use programs in the sense that many commercial packages and free web-based interfaces do (e.g. Entrez, SRS). On the other hand, Bioperl does provide reusable Perl modules that facilitate writing Perl scripts for sequence manipulation, accessing of databases using a range of data formats and execution and parsing of the results of various molecular biology programs including Blast, clustalw, TCoffee, genscan, ESTscan and HMMER. Consequently, bioperl enables developing scripts that can analyze large quantities of sequence data in ways that are typically difficult or impossible with web based systems.

In order to take advantage of bioperl, the user needs a basic understanding of the Perl programming language including an understanding of how to use Perl references, modules, objects and methods. If these concepts are unfamiliar the user is referred to any of the various introductory or intermediate books on perl. We've liked S. Holzmer's Perl Core Language, Coriolis Technology Press, for example. This tutorial is not intended to teach the fundamentals of perl to those with little or no experience in the Perl language. On the other hand, advanced knowledge of Perl - such as how to write a object-oriented Perl module - is not required for successfully using bioperl.

Bioperl is open source software that is still under active development. The advantages of open source software are well known. They include the ability to freely examine and modify source code and exemption from software licensing fees. However, since open source software is typically developed by a large number of volunteer programmers, the resulting code is often not as clearly organized and its user interface not as standardized as in many commercial products. In addition, in any project under active development, documentation may not keep up with the development of new features. Consequently the learning curve for actively developed, open source source software is sometimes steep.

This tutorial is intended to ease the learning curve for new users of bioperl. To that end the tutorial includes:

Descriptions 
What bioinformatics tasks can be handled with bioperl
Directions 
Where to find the methods to accomplish these tasks within the Bioperl package
Recommendations 
Where to go for additional information.

Quick getting started scripts

For newcomers and people who want to quickly evaluate whether this package is worth using in the first place, we have a very simple module which allows easy access to a small number of Bioperl's functionality in an easy to use manner. The Bio::Perl module provides some simple access functions. For example, this script will retrieve a swissprot sequence and write it out in fasta format

use Bio::Perl;
 
# this script will only work if you have an internet connection on the
# computer you're using, the databases you can get sequences from 
# are 'swiss', 'genbank', 'genpept', 'embl', and 'refseq'
 
$seq_object = get_sequence('swiss',"ROA1_HUMAN");
 
write_sequence(">roa1.fasta",'fasta',$seq_object);

That second argument to write_sequence, 'fasta', is the sequence format. You can choose among all the formats supported by SeqIO (see the SeqIO HOWTO for more details).

Another example is the ability to blast a sequence using the facilities as NCBI. Please be careful not to abuse the resources that NCBI provides and use this only for individual searches. If you want to do a large number of BLAST searches, please download the blast package and install it locally.

use Bio::Perl;
 
$seq = get_sequence('swiss',"ROA1_HUMAN");
 
# uses the default database - nr in this case
$blast_result = blast_sequence($seq);
 
write_blast(">roa1.blast",$blast_result);

Bio::Perl has a number of other easy-to-use functions, including

get_sequence        - gets a sequence from standard, internet accessible
                             databases
read_sequence       - reads a sequence from a file
read_all_sequences  - reads all sequences from a file 
new_sequence        - makes a Bioperl sequence just from a string
write_sequence      - writes a single or an array of sequence to a file
translate           - provides a translation of a sequence
translate_as_string - provides a translation of a sequence, returning back 
                      just the sequence as a string
blast_sequence      - BLASTs a sequence against standard databases at  
                      NCBI
write_blast         - writes a blast report out to a file

Using the Bio::Perl module, it is possible to manipulate sequence data in Bioperl without explicitly creating the Seq or SeqIO objects described later in this tutorial. However, only limited data manipulation is supported in this mode.

Look at the documentation in Bio::Perl by going 'perldoc Bio::Perl' to learn more about these functions. In all these cases, Bio::Perl accesses a subset of the underlying Bioperl functions (for example, translation in Bioperl can handle many different translation tables and provides different options for stop codon processing) - in most cases, most users will migrate to using the underlying Bioperl objects as their sophistication level increases, but Bio::Perl provides an easy on-ramp for newcomers and lazy experts alike. Also see examples/bioperl.pl for more examples of usage of this module.

Software requirements

What's required to run bioperl.

Minimal Bioperl installation (Bioperl "core" installation)

For a minimal installation of bioperl, you will need to have perl itself installed as well as the Bioperl "core modules". Bioperl has been tested primarily using Perl 5.6.1 and 5.8. The minimal Bioperl installation should still work under Perl 5.6.1.

In addition to a current version of perl, the new user of Bioperl is encouraged to have access to, and familiarity with, an interactive perl debugger. Bioperl is a large collection of complex interacting software objects. Stepping through a script with an interactive debugger is a very helpful way of seeing what is happening in such a complex software system - especially when the software is not behaving in the way that you expect. The free graphical debugger ptkdb is highly recommended - it's available as Devel::ptkdb from CPAN. The standard Perl distribution also contains a powerful interactive debugger with a command-line interface (use it like "perl -d <script>").

The Perl tool Data::Dumper used with the syntax:

use Data::Dumper;
print Dumper($seqobj);

can also be helpful for obtaining debugging information on Bioperl objects.

Complete installation

Some of the capabilities of Bioperl require software beyond that of the minimal installation. This additional software includes perl modules from CPAN, package-libraries from bioperl's auxiliary code-repositories, a Bioperl xs-extension, and several standard compiled bioinformatics programs.

Perl - extensions

For a complete listing of external Perl modules required by bioperl please see the INSTALL file in the Bioperl package (or http://bioperl.org/Core/Latest/INSTALL ).

Bioperl auxiliary repositories

Some features of Bioperl that require modules from bioperl's auxiliary code repositories. See section IV and references therein for further installation instructions for these modules.

Bioperl C extensions & external bioinformatics programs

Bioperl also uses several C programs for sequence alignment and local blast searching. To use these features of Bioperl you will need an ANSI C or Gnu C compiler as well as the actual program available from sources such as:

for Smith-Waterman alignments- the bioperl-ext package

for clustalw alignments- ftp://ftp.ebi.ac.uk/pub/software/unix/clustalw/ ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalW/

for tcoffee alignments- http://igs-server.cnrs-mrs.fr/~cnotred/Projects_home_page/t_coffee_home_page.html

for local blast searching- ftp://ftp.ncbi.nih.gov/blast/executables/release/

for EMBOSS applications - http://emboss.sourceforge.net/

Installation

For details see Bioperl's INSTALL file ( http://bioperl.org/Core/Latest/INSTALL ).

For the external programs (clustal, Tcoffee, ncbi-blast), there is an extra step:

Environmental variables 
Set the relevant environmental variable (CLUSTALDIR, TCOFFEEDIR or BLASTDIR) to the directory holding the executable in your startup file - e.g. in .bashrc or .tcshrc. For running local blasts, it is also necessary that the name of local-blast database directory is known to bioperl. This will typically happen automatically, but in case of difficulty, refer to the documentation in Bio::Tools::Run::StandAloneBlast.

The only likely complication (at least on unix systems) that may occur is if you are unable to obtain system level writing privileges. For instructions on modifying the installation in this case and for more details on the overall installation procedure, see the INSTALL file (or http://bioperl.org/Core/Latest/INSTALL ) in the Bioperl distribution as well as the README files for the external programs you want to use.

Additional comments for non-Unix users

Bioperl has mainly been developed and tested under various unix environments, including Linux and MacOS X. In addition, this tutorial has been written largely from a Unix perspective.

The Bioperl core has also been tested and should work under most versions of Microsoft Windows. For many windows users the Perl and bioperl distributions from Active State, at http://www.activestate.com has been quite helpful. Other windows users have had success running bioperl under Cygwin (http://www.cygwin.com). See the package's INSTALL.WIN file for more details (or http://bioperl.org/Core/Latest/INSTALL.WIN).

Many Bioperl features require the use of CPAN modules, compiled extensions or external programs. These features probably will not work under some or all of these other operating systems. If a script attempts to access these features from a non-unix OS, bioperl is designed to simply report that the desired capability is not available. However, since the testing of Bioperl in these environments has been limited, the script may well crash in a less graceful manner.

Places to look for additional documentation

This tutorial does not intend to be a comprehensive description of all the objects and methods available in bioperl. For that the reader is directed to the documentation included with each of the modules. A very useful interface for finding one's way within all the module documentation can be found at http://doc.bioperl.org/bioperl-live/. This interface lists all Bioperl modules and descriptions of all of their methods. In addition, beginner questions can often be answered by looking at the FAQ, INSTALL and README files ( http://bioperl.org/Core/Latest/faq.html and http://bioperl.org/Core/Latest/INSTALL and http://bioperl.org/Core/Latest/README ) in the top-level directory of the Bioperl distribution.

One potential problem in locating the correct documentation is that multiple methods in different modules may all share the same name. Moreover, because of perl's complex method of inheritance it is not often clear which of the identically named methods is being called by a given object. One way to resolve this question is by using the software described in Appendix "V.1".

For those who prefer more visual descriptions, http://bioperl.org/Core/Latest/modules.html also offers links to PDF files which contain class diagrams that describe how many of the bioperl objects related to one another (Version 1.0 Class Diagrams).

In addition, a Bioperl online course is available on the web at http://www.pasteur.fr/recherche/unites/sis/formation/bioperl. The user is also referred to numerous Bioperl scripts in the scripts/ and examples/ directories (see bioscripts.pod for a description of these scripts, or http://bioperl.org/Core/Latest/bioscripts.html).

Another source of focussed documentation is the HOWTO files, found at http://bioperl.open-bio.org/wiki/HOWTOs. Current topics include OBDA Access, SeqIO, SearchIO, Graphics, Features and Annotations, Trees, PAML, and Biopipe.

Bioperl objects

The purpose of this tutorial is to get you using Bioperl to solve real-life bioinformatics problems as quickly as possible. The aim is not to explain the structure of Bioperl objects or perl object-oriented programming in general. Indeed, the relationships among the Bioperl objects is not simple; however, understanding them in detail is fortunately not necessary for successfully using the package.

Nevertheless, a little familiarity with the Bioperl object bestiary can be very helpful even to the casual user of bioperl. For example there are (at least) eight different "sequence objects" - Seq, PrimarySeq, LocatableSeq, RelSegment, LiveSeq, LargeSeq, SeqI, and SeqWithQuality. Understanding the relationships among these objects - and why there are so many of them - will help you select the appropriate one to use in your script.

Brief descriptions (Seq, PrimarySeq, LocatableSeq, RelSegment, LiveSeq, LargeSeq, RichSeq, SeqWithQuality, SeqI)

This section describes various Bioperl sequence objects. Many people using Bioperl will never know, or need to know, what kind of sequence object they are using. This is because the SeqIO module, section "III.2.1", creates exactly the right type of object when given a file or a filehandle or a string. But if you're curious, or if you need to create a sequence object manually for some reason, then read on.

Seq is the central sequence object in bioperl. When in doubt this is probably the object that you want to use to describe a DNA, RNA or protein sequence in bioperl. Most common sequence manipulations can be performed with Seq. These capabilities are described in sections "III.3.1" and "III.7.1", or in Bio::Seq.

Seq objects may be created for you automatically when you read in a file containing sequence data using the SeqIO object. This procedure is described in section "III.2.1". In addition to storing its identification labels and the sequence itself, a Seq object can store multiple annotations and associated "sequence features", such as those contained in most Genbank and EMBL sequence files. This capability can be very useful - especially in development of automated genome annotation systems, see section "III.7.1".

On the other hand, if you need a script capable of simultaneously handling hundreds or thousands sequences at a time, then the overhead of adding annotations to each sequence can be significant. For such applications, you will want to use the PrimarySeq object. PrimarySeq is basically a stripped-down version of Seq. It contains just the sequence data itself and a few identifying labels (id, accession number, alphabet = dna, rna, or protein), and no features. For applications with hundreds or thousands or sequences, using PrimarySeq objects can significantly speed up program execution and decrease the amount of RAM the program requires. See Bio::PrimarySeq for more details.

RichSeq objects store additional annotations beyond those used by standard Seq objects. If you are using sources with very rich sequence annotation, you may want to consider using these objects which are described in section "III.7.1". RichSeq objects are created automatically when Genbank, EMBL, or Swissprot format files are read by SeqIO.

SeqWithQuality objects are used to manipulate sequences with quality data, like those produced by phred. These objects are described in section "III.7.6", Bio::Seq::RichSeqI, and in Bio::Seq::SeqWithQuality.

What is called a LocatableSeq object for historical reasons might be more appropriately called an "AlignedSeq" object. It is a Seq object which is part of a multiple sequence alignment. It has start and end positions indicating from where in a larger sequence it may have been extracted. It also may have gap symbols corresponding to the alignment to which it belongs. It is used by the alignment object SimpleAlign and other modules that use SimpleAlign objects (e.g. AlignIO.pm, pSW.pm).

In general you don't have to worry about creating LocatableSeq objects because they will be made for you automatically when you create an alignment (using pSW, Clustalw, Tcoffee, Lagan, or bl2seq) or when you input an alignment data file using AlignIO. However if you need to input a sequence alignment by hand (e.g. to build a SimpleAlign object), you will need to input the sequences as LocatableSeqs. Other sources of information include Bio::LocatableSeq, Bio::SimpleAlign, Bio::AlignIO, and Bio::Tools::pSW.

The RelSegment object is also a type of Bioperl Seq object. RelSegment objects are useful when you want to be able to manipulate the origin of the genomic coordinate system. This situation may occur when looking at a sub-sequence (e.g. an exon) which is located on a longer underlying underlying sequence such as a chromosome or a contig. Such manipulations may be important, for example when designing a graphical genome browser. If your code may need such a capability, look at the documentation Bio::DB::GFF::RelSegment which describes this feature in detail.

A LargeSeq object is a special type of Seq object used for handling very long sequences (e.g. > 100 MB). If you need to manipulate such long sequences see section "III.7.3" which describes LargeSeq objects, or Bio::Seq::LargeSeq.

A LiveSeq object is another specialized object for storing sequence data. LiveSeq addresses the problem of features whose location on a sequence changes over time. This can happen, for example, when sequence feature objects are used to store gene locations on newly sequenced genomes - locations which can change as higher quality sequencing data becomes available. Although a LiveSeq object is not implemented in the same way as a Seq object, LiveSeq does implement the SeqI interface (see below). Consequently, most methods available for Seq objects will work fine with LiveSeq objects. Section "III.7.4" and Bio::LiveSeq contain further discussion of LiveSeq objects.

SeqI objects are Seq "interface objects" (see section "II.4" and Bio::SeqI). They are used to ensure bioperl's compatibility with other software packages. SeqI and other interface objects are not likely to be relevant to the casual Bioperl user.

Location objects

A Location object is designed to be associated with a Sequence Feature object in order to show where the feature is on a longer sequence. Location objects can also be standalone objects used to described positions. The reason why these simple concepts have evolved into a collection of rather complicated objects is that:

1) Some objects have multiple locations or sub-locations (e.g. a gene's exons may have multiple start and stop locations) 2) In unfinished genomes, the precise locations of features is not known with certainty.

Bioperl's various Location objects address these complications. In addition there are CoordinatePolicy objects that allow the user to specify how to measure the length of a feature if its precise start and end coordinates are not known. In most cases, you will not need to worry about these complications if you are using Bioperl to handle simple features with well-defined start and stop locations. However, if you are using Bioperl to annotate partially or unfinished genomes or to read annotations of such genomes with bioperl, understanding the various Location objects will be important. See the documentation of the various modules in the Bio::Locations directory or Bio::Location::CoordinatePolicyI or section "III.7.1" for more information.

Interface objects and implementation objects

One goal of the design of Bioperl is to separate interface and implementation objects. An interface is solely the definition of what methods one can call on an object, without any knowledge of how it is implemented. An implementation is an actual, working implementation of an object. In languages like Java, interface definition is part of the language. In Perl, you have to roll your own.

In Bioperl, the interface objects usually have names like Bio::MyObjectI, with the trailing I indicating it is an interface object. The interface objects mainly provide documentation on what the interface is, and how to use it, without any implementations (though there are some exceptions). Although interface objects are not of much direct utility to the casual Bioperl user, being aware of their existence is useful since they are the basis to understanding how bioperl programs can communicate with other bioinformatics projects and computer languages such as Ensembl and biopython and biojava.

For more discussion of design and development issues please see the biodesign.pod file in the package or biodesign.html (http://bioperl.org/Core/Latest/biodesign.html).

Representing large sequences (LargeSeq)

Very large sequences present special problems to automated sequence-annotation storage and retrieval projects. Bioperl's LargeSeq object addresses this situation.

A LargeSeq object is a SeqI compliant object that stores a sequence as a series of files in a temporary directory (see sect "II.1" or Bio::SeqI for a definition of SeqI objects). The aim is to enable storing very large sequences (e.g. > 100 MBases) without running out of memory and, at the same time, preserving the familiar Bioperl Seq object interface. As a result, from the user's perspective, using a LargeSeq object is almost identical to using a Seq object. The principal difference is in the format used in the SeqIO calls. Another difference is that the user must remember to only read in small chunks of the sequence at one time. These differences are illustrated in the following code:

$seqio = new Bio::SeqIO(-format => 'largefasta',
                        -file   => 't/data/genomic-seq.fasta');
$pseq = $seqio->next_seq();
$plength = $pseq->length();
$last_4 = $pseq->subseq($plength-3,$plength);  # this is OK 

On the other hand, the next statement would probably cause the machine to run out of memory:

$lots_of_data = $pseq->seq();  # NOT OK for a large LargeSeq object 

Representing changing sequences (LiveSeq)

Data files with sequences that are frequently being updated present special problems to automated sequence-annotation storage and retrieval projects. Bioperl's LiveSeq object is designed to address this situation.

The LiveSeq object addresses the need for a sequence object capable of handling sequence data that may be changing over time. In such a sequence, the precise locations of features along the sequence may change. LiveSeq deals with this issue by re-implementing the sequence object internally as a "double linked chain." Each element of the chain is connected to other two elements (the PREVious and the NEXT one). There is no absolute position like in an array, hence if positions are important, they need to be computed (methods are provided). Otherwise it's easy to keep track of the elements with their "LABELs". There is one LABEL (think of it as a pointer) to each ELEMENT. The labels won't change after insertions or deletions of the chain. So it's always possible to retrieve an element even if the chain has been modified by successive insertions or deletions.

Although the implementation of the LiveSeq object is novel, its bioperl user interface is unchanged since LiveSeq implements a PrimarySeqI interface (recall PrimarySeq is the subset of Seq without annotations or SeqFeatures - see section "II.1" or Bio::PrimarySeq). Consequently syntax for using LiveSeq objects is familiar although a modified version of SeqIO called Bio::LiveSeq::IO::BioPerl needs to be used to actually load the data, e.g.:

$loader = Bio::LiveSeq::IO::BioPerl->load(-db   => "EMBL",
                                          -file => "t/data/factor7.embl");
$gene = $loader->gene2liveseq(-gene_name => "factor7");
$id = $gene->get_DNA->display_id ;
$maxstart = $gene->maxtranscript->start;

See Bio::LiveSeq::IO::BioPerl for more details.

Using Bioperl

Bioperl provides software modules for many of the typical tasks of bioinformatics programming. These include:

  • Accessing sequence data from local and remote databases
  • Transforming formats of database/ file records
  • Manipulating individual sequences
  • Searching for similar sequences
  • Creating and manipulating sequence alignments
  • Searching for genes and other structures on genomic DNA
  • Developing machine readable sequence annotations

The following sections describe how Bioperl can help perform all of these tasks.

Accessing sequence data from local and remote databases

Much of Bioperl is focused on sequence manipulation. However, before bioperl can manipulate sequences, it needs to have access to sequence data. Now one can directly enter data sequence data into a bioperl Seq object, eg:

$seq = Bio::Seq->new(-seq              => 'actgtggcgtcaact',
                     -description      => 'Sample Bio::Seq object',
                     -display_id       => 'something',
                     -accession_number => 'accnum',
                     -alphabet         => 'dna' );

However, in most cases, you will probably be accessing sequence data from some online data file or database.

Bioperl supports accessing remote databases as well as creating indices for accessing local databases. There are two general approaches to accomplishing this. If you know what kind of database the sequences are stored in (i.e. flat file, local relational database or a database accessed remotely over the internet), you can write a script that specifically accesses data from that kind of database. This approach is described in sections III.1.1 and III.1.2 for access from remote databases and local indexed flat files respectively. To explicitly access sequence data from a local relational database requires installing and setting up the modules in the bioperl-db library and the BioSQL schema, see "IV.3" for more information.

The other approach is to use the recently developed OBDA (Open Bioinformatics Data Access) Registry system. Using OBDA it is possible to import sequence data from a database without your needing to know whether the required database is flat-file or relational or even whether it is local or accessible only over the net. Descriptions of how to set up the necessary registry configuration file and access sequence data with the registry in described at http://bioperl.open-bio.org/wiki/HOWTO:OBDA.

Accessing remote databases (Bio::DB::GenBank, etc)

Accessing sequence data from the principal molecular biology databases is straightforward in bioperl. Data can be accessed by means of the sequence's accession number or id. Batch mode access is also supported to facilitate the efficient retrieval of multiple sequences. For retrieving data from genbank, for example, the code could be as follows:

$gb = new Bio::DB::GenBank;
 
# this returns a Seq object :
$seq1 = $gb->get_Seq_by_id('MUSIGHBA1');
 
# this also returns a Seq object :
$seq2 = $gb->get_Seq_by_acc('AF303112');
 
# this returns a SeqIO object, which can be used to get a Seq object :
$seqio = $gb->get_Stream_by_id(["J00522","AF303112","2981014"]);
$seq3 = $seqio->next_seq;

See section "III.2.1" for information on using this SeqIO object.

Bioperl currently supports sequence data retrieval from the genbank, genpept, RefSeq, swissprot, and EMBL databases. See Bio::DB::GenBank, Bio::DB::GenPept, Bio::DB::SwissProt, Bio::DB::RefSeq and Bio::DB::EMBL for more information. A user can also specify a different database mirror for a database - this is especially relevant for the SwissProt resource where there are many ExPaSy mirrors. There are also configuration options for specifying local proxy servers for those behind firewalls.

The retrieval of NCBI RefSeqs sequences is supported through a special module called Bio::DB::RefSeq which actually queries an EBI server. Please see Bio::DB::RefSeq before using it as there are some caveats with RefSeq retrieval. RefSeq ids in Genbank begin with "NT_", "NC_", "NG_", "NM_", "NP_", "XM_", "XR_", or "XP_" (for more information see http://www.ncbi.nlm.nih.gov/projects/RefSeq/ ). Bio::DB::GenBank can be used to retrieve entries corresponding to these ids but bear in mind that these are not Genbank entries, strictly speaking. See Bio::DB::GenBank for special details on retrieving entries beginning with "NT_", these are specially formatted "CONTIG" entries.

Bioperl also supports retrieval from a remote Ace database. This capability requires the presence of the external AcePerl module. You need to download and install the aceperl module from http://stein.cshl.org/AcePerl/.

An additional module is available for accessing remote databases, BioFetch, which queries the dbfetch script at EBI. The available databases are EMBL, GenBank, or SWALL, and the entries can be retrieved in different formats as objects or streams (SeqIO objects), or as temporary files. See Bio::DB::BioFetch for the details.

Indexing and accessing local databases Bio::Index::*, bp_index.pl, bp_fetch.pl, Bio::DB::*)

Alternately, Bioperl permits indexing local sequence data files by means of the Bio::Index* or Bio::DB::Fasta objects. The following sequence data formats are supported Bio::Index: genbank, swissprot, pfam, embl and fasta. Once the set of sequences have been indexed using Bio::Index*, individual sequences can be accessed using syntax very similar to that described above for accessing remote databases. For example, if one wants to set up an indexed flat-file database of fasta files, and later wants then to retrieve one file, one could write scripts like:

# script 1: create the index
use Bio::Index::Fasta; # using fasta file format
use strict; # some users have reported that this is necessary
 
my $Index_File_Name = shift;
my $inx = Bio::Index::Fasta->new(
    -filename => $Index_File_Name,
    -write_flag => 1);
$inx->make_index(@ARGV);
 
# script 2: retrieve some files
use Bio::Index::Fasta;
use strict; # some users have reported that this is necessary
 
my $Index_File_Name = shift;
my $inx = Bio::Index::Fasta->new($Index_File_Name);
foreach  my $id (@ARGV) {
    my $seq = $inx->fetch($id);  # Returns Bio::Seq object
    # do something with the sequence
}

To facilitate the creation and use of more complex or flexible indexing systems, the Bioperl distribution includes two sample scripts in the scripts/index directory, bp_index.PLS and bp_fetch.PLS. These scripts can be used as templates to develop customized local data-file indexing systems.

Bioperl also supplies Bio::DB::Fasta as a means to index and query Fasta format files. It's similar in spirit to Bio::Index::Fasta but offers more methods, e.g.

use Bio::DB::Fasta;
use strict;
 
my $db = Bio::DB::Fasta->new($file);  # one file or many files
my $seqstring = $db->seq($id);        # get a sequence as string
my $seqobj = $db->get_Seq_by_id($id); # get a PrimarySeq obj
my $desc = $db->header($id);          # get the header, or description line 

See Bio::DB::Fasta for more information on this fully-featured module.

Both modules also offer the user the ability to designate a specific string within the fasta header as the desired id, such as the gi number within the string "gi|4556644|gb|X45555". Consider the following fasta-formatted sequence, in "test.fa":

>gi|523232|emb|AAC12345|sp|D12567 titin fragment
MHRHHRTGYSAAYGPLKJHGYVHFIMCVVVSWWASDVVTYIPLLLNNSSAGWKRWWWIIFGGE
GHGHHRTYSALWWPPLKJHGSKHFILCVKVSWLAKKERTYIPKKILLMMGGWWAAWWWI

By default Bio::Index::Fasta and Bio::DB::Fasta will use the first "word" they encounter in the fasta header as the retrieval key, in this case "gi|523232|emb|AAC12345|sp|D12567". What would be more useful as a key would be a single id. The code below will index the "test.fa" file and create an index file called "test.fa.idx" where the keys are the Swissprot, or "sp", identifiers.

$ENV{BIOPERL_INDEX_TYPE} = "SDBM_File";
# look for the index in the current directory
$ENV{BIOPERL_INDEX} = ".";
 
my $file_name = "test.fa";
my $inx = Bio::Index::Fasta->new( -filename   => $file_name . ".idx",
                                  -write_flag => 1 );
# pass a reference to the critical function to the Bio::Index object
$inx->id_parser(\&get_id);
# make the index
$inx->make_index($file_name);
 
# here is where the retrieval key is specified
sub get_id {
   my $header = shift;
   $header =~ /^>.*\bsp\|([A-Z]\d{5}\b)/;
   $1;
}

Here is how you would retrieve the sequence, as a Bio::Seq object:

my $seq = $inx->fetch("D12567");
print $seq->seq;

What if you wanted to retrieve a sequence using either a Swissprot id or a gi number and the fasta header was actually a concatenation of headers with multiple gi's and Swissprots?

>gi|523232|emb|AAC12345|sp|D12567|gi|7744242|sp|V11223 titin fragment

Modify the function that's passed to the id_parser() method:

sub get_id {
   my $header = shift;
   my (@sps) = $header =~ /^>.*\bsp\|([A-Z]\d{5})\b/g;
   my (@gis) = $header =~ /gi\|(\d+)\b/g;
   return (@sps,@gis);
}

The Bio::DB::Fasta module uses the same principle, but the syntax is slightly different, for example:

my $db = Bio::DB::Fasta->new('test.fa', -makeid=>\&make_my_id);
my $seqobj = $db->get_Seq_by_id($id);
 
sub make_my_id {
   my $description_line = shift;
   $description_line =~ /gi\|(\d+)\|emb\|(\w+)/;
   ($1,$2);
}

The core Bioperl installation does not support accessing sequences and data stored in relational databases. However, this capability is available with the auxiliary bioperl-db library. See section "IV.3" for more information.

Transforming sequence files (SeqIO)

A common - and tedious - bioinformatics task is that of converting sequence data among the many widely used data formats. Bioperl's SeqIO object, however, makes this chore a breeze. SeqIO can read a stream of sequences - located in a single or in multiple files - in a number of formats including Fasta, EMBL, GenBank, Swissprot, SCF, phd/phred, Ace, fastq, exp, chado, or raw (plain sequence). SeqIO can also parse tracefiles in alf, ztr, abi, ctf, and ctr format Once the sequence data has been read in with SeqIO, it is available to Bioperl in the form of Seq, PrimarySeq, or RichSeq objects, depending on what the sequence source is. Moreover, the sequence objects can then be written to another file using SeqIO in any of the supported data formats making data converters simple to implement, for example:

use Bio::SeqIO;
$in  = Bio::SeqIO->new(-file => "inputfilename",
                       -format => 'Fasta');
$out = Bio::SeqIO->new(-file => ">outputfilename",
                       -format => 'EMBL');
while ( my $seq = $in->next_seq() ) {$out->write_seq($seq); }

In addition, the Perl "tied filehandle" syntax is available to SeqIO, allowing you to use the standard <> and print operations to read and write sequence objects, eg:

$in  = Bio::SeqIO->newFh(-file => "inputfilename" ,
                         -format => 'fasta');
$out = Bio::SeqIO->newFh(-format => 'embl');
print $out $_ while <$in>;

If the "-format" argument isn't used then Bioperl will try to determine the format based on the file's suffix, in a case-insensitive manner. If there's no suffix available then SeqIO will attempt to guess the format based on actual content. If it can't determine the format then it will assume "fasta". A complete list of formats and suffixes can be found in the SeqIO HOWTO ( http://bioperl.open-bio.org/wiki/HOWTO:SeqIO ).

Transforming alignment files (AlignIO)

Data files storing multiple sequence alignments also appear in varied formats. Bio::AlignIO is the Bioperl object for conversion of alignment files. AlignIO is patterned on the SeqIO object and its commands have many of the same names as the commands in SeqIO. Just as in SeqIO the AlignIO object can be created with "-file" and "-format" options:

use Bio::AlignIO;
my $io = Bio::AlignIO->new(-file   => "receptors.aln",
                           -format => "clustalw" );

If the "-format" argument isn't used then Bioperl will try and determine the format based on the file's suffix, in a case-insensitive manner. Here is the current set of input formats:

Format Suffixes Comment
bl2seq
clustalw aln
emboss water needle
fasta fasta fast seq fa fsa nt aa
maf maf
mase Seaview
mega meg ega
meme meme
metafasta
msf msf pileup gcg GCG
nexus nexus nex
pfam pfam pfm Pfam
phylip phylip phlp phyl phy ph interleaved
po POA
prodom
psi psi PSI-BLAST
selex selex slx selx slex sx HMMER
stockholm stk Rfam, Pfam
XMFA xmfa
arp arp Arlequin

The emboss format refers to the output of the water, needle, matcher, stretcher, merger, and supermatcher applications. See "IV.2.1" on EMBOSS for more information, or http://emboss.sourceforge.net.

Unlike SeqIO AlignIO cannot create output files in every format. AlignIO currently supports output in these formats:

  • fasta
  • mase
  • selex
  • clustalw
  • msf
  • phylip (interleaved)
  • po
  • stockholm
  • XMFA
  • metafasta

Another significant difference between AlignIO and SeqIO is that AlignIO handles IO for only a single alignment at a time but SeqIO.pm handles IO for multiple sequences in a single stream. Syntax for AlignIO is almost identical to that of SeqIO:

use Bio::AlignIO;
$in  = Bio::AlignIO->new(-file => "inputfilename" ,
                         -format => 'fasta');
$out = Bio::AlignIO->new(-file => ">outputfilename",
                         -format => 'pfam');
while ( my $aln = $in->next_aln() ) { $out->write_aln($aln); }

The only difference is that the returned object reference, $aln, is to a SimpleAlign object rather than to a Seq object.

AlignIO also supports the tied filehandle syntax described above for SeqIO. See Bio::AlignIO, Bio::SimpleAlign, and section "III.5" on SimpleAlign for more information.

Manipulating sequences

Bioperl contains many modules with functions for sequence analysis. And if you cannot find the function you want in Bioperl you may be able to find it in EMBOSS or PISE , which are accessible through the bioperl-run auxiliary library (see "IV.2.1").

Manipulating sequence data with Seq methods

OK, so we know how to retrieve sequences and access them as sequence objects. Let's see how we can use sequence objects to manipulate our sequence data and retrieve information. Seq provides multiple methods for performing many common (and some not-so-common) tasks of sequence manipulation and data retrieval. Here are some of the most useful:

These methods return strings or may be used to set values:

$seqobj->display_id();       # the human read-able id of the sequence
$seqobj->seq();              # string of sequence
$seqobj->subseq(5,10);       # part of the sequence as a string
$seqobj->accession_number(); # when there, the accession number
$seqobj->alphabet();         # one of 'dna','rna','protein'
$seqobj->primary_id();       # a unique id for this sequence regardless
                             # of its display_id or accession number
$seqobj->description();      # a description of the sequence 

It is worth mentioning that some of these values correspond to specific fields of given formats. For example, the display_id() method returns the LOCUS name of a Genbank entry, the (\S+) following the > character in a Fasta file, the ID from a SwissProt file, and so on. The description() method will return the DEFINITION line of a Genbank file, the line following the display_id in a Fasta file, and the DE field in a SwissProt file.

The following methods return an array of Bio::SeqFeature objects:

$seqobj->get_SeqFeatures;      # The 'top level' sequence features
$seqobj->get_all_SeqFeatures;  # All sequence features, including sub-
                                # seq features 

For a comment annotation, you can use:

use Bio::Annotation::Comment;
 
$seq->annotation->add_Annotation('comment',
    Bio::Annotation::Comment->new(-text => 'some description');

For a reference annotation, you can use:

use Bio::Annotation::Reference;
 
$seq->annotation->add_Annotation('reference',
    Bio::Annotation::Reference->new(-authors  => 'author1,author2',
                                    -title    => 'title line',
                                    -location => 'location line',
                                    -medline  => 998122 );

Sequence features will be discussed further in section "III.7" on machine-readable sequence annotation. A general description of the object can be found in Bio::SeqFeature::Generic, and a description of related, top-level annotation is found in Bio::Annotation::Collection.

Additional sample code for obtaining sequence features can be found in the script gb2features.pl in the subdirectory examples/DB. Finally, there's a HOWTO on features and annotations ( http://bioperl.open-bio.org/wiki/HOWTO:Feature-Annotation ) and there's a section on features in the FAQ ( http://bioperl.org/Core/Latest/faq.html#5 ).

The following methods returns new sequence objects, but do not transfer the features from the starting object to the resulting feature:

$seqobj->trunc(5,10);  # truncation from 5 to 10 as new object
$seqobj->revcom;       # reverse complements sequence
$seqobj->translate;    # translation of the sequence 

Note that some methods return strings, some return arrays and some return objects. See Bio::Seq for more information.

Translating

Many of these methods are self-explanatory. However, the flexible translate() method needs some explanation. Translation in bioinformatics can mean two slightly different things,either translating a nucleotide sequence from start to end or translate the actual coding regions in mRNAs or cDNAs. The Bioperl implementation of sequence translation does both of these.

Any sequence object with alphabet 'dna' or 'rna' can be translated by simply using translate which will return a protein sequence object:

$prot_obj = $my_seq_object->translate;

All codons will be translated, including those before and after any initiation and termination codons. For example, ttttttatgccctaggggg will be translated to FFMP*G

However, the translate() method can also be passed several optional parameters to modify its behavior. For example, you can tell translate() to modify the characters used to represent terminator (default is *) and unknown amino acids (default is X).

$prot_obj = $my_seq_object->translate(-terminator => '-');
$prot_obj = $my_seq_object->translate(-unknown => '_');

You can also determine the frame of the translation. The default frame starts at the first nucleotide (frame 0). To get translation in the next frame we would write:

$prot_obj = $my_seq_object->translate(-frame => 1);

If we want to translate full coding regions (CDS) the way major nucleotide databanks EMBL, GenBank and DDBJ do it, the translate() method has to perform more checks. Specifically, translate() needs to confirm that the open reading frame has appropriate start and terminator codons at the very beginning and the very end of the sequence and that there are no terminator codons present within the sequence in frame 0. In addition, if the genetic code being used has an atypical (non-ATG) start codon, the translate() method needs to convert the initial amino acid to methionine. These checks and conversions are triggered by setting "complete" to 1:

$prot_obj = $my_seq_object->translate(-complete => 1);

If "complete" is set to true and the criteria for a proper CDS are not met, the method, by default, issues a warning. By setting "throw" to 1, one can instead instruct the program to die if an improper CDS is found, e.g.

$prot_obj = $my_seq_object->translate(-complete => 1,
                                      -throw => 1);

The codontable_id argument to translate() makes it possible to use alternative genetic codes. There are currently 16 codon tables defined, including 'Standard', 'Vertebrate Mitochondrial', 'Bacterial', 'Alternative Yeast Nuclear' and 'Ciliate, Dasycladacean and Hexamita Nuclear'. All these tables can be seen in Bio::Tools::CodonTable. For example, for mitochondrial translation:

$prot_obj = $seq_obj->translate(-codontable_id => 2);

You can also create a custom codon table and pass this to translate, the code will look something like this:

use Bio::Tools::CodonTable;
 
@custom_table =
    ( 'test1',
      'FFLLSSSSYY**CC*WLLLL**PPHHQQR*RRIIIFT*TT*NKKSSRRV*VVAA*ADDEE*GGG'
    );
 
$codon_table = Bio::Tools::CodonTable->new;
 
$id = $codon_table->add_table(@custom_table);
 
$prot_obj = $my_seq_object->translate(-codontable_id => $id);

See Bio::Tools::CodonTable for information on the format of a codon table.

translate() can also find the open reading frame (ORF) starting at the 1st initiation codon in the nucleotide sequence, regardless of its frame, and translate that:

$prot_obj = $my_seq_object->translate(-orf => 1);

Most of the codon tables, including the default codon table NCBI "Standard", have initiation codons in addition to ATG. To tell translate() to use only ATG or atg as the initiation codon set -start to "atg":

$prot_obj = $my_seq_object->translate(-orf => 1,
                                      -start => "atg" );

The -start argument only applies when -orf is set to 1.

Last trick. By default translate() will translate the termination codon to some special character (the default is *, but this can be reset using the -terminator argument).

When -complete is set to 1 this character is removed. So, with this:

$prot_obj = $my_seq_object->translate(-orf => 1,
                                      -complete => 1);

the sequence tttttatgccctaggggg will be translated to MP, not MP*.

See Bio::Tools::CodonTable and Bio::PrimarySeqI for more information on translation.

Obtaining basic sequence statistics (SeqStats,SeqWord)

In addition to the methods directly available in the Seq object, bioperl provides various helper objects to determine additional information about a sequence. For example, SeqStats object provides methods for obtaining the molecular weight of the sequence as well the number of occurrences of each of the component residues (bases for a nucleic acid or amino acids for a protein.) For nucleic acids, SeqStats also returns counts of the number of codons used. For example:

use Bio::Tools::SeqStats;
$seq_stats  = Bio::Tools::SeqStats->new($seqobj);
$weight = $seq_stats->get_mol_wt();
$monomer_ref = $seq_stats->count_monomers();
$codon_ref = $seq_stats->count_codons();  # for nucleic acid sequence 

Note: sometimes sequences will contain ambiguous codes. For this reason, get_mol_wt() returns a reference to a two element array containing a greatest lower bound and a least upper bound of the molecular weight.

The SeqWords object is similar to SeqStats and provides methods for calculating frequencies of "words" (e.g. tetramers or hexamers) within the sequence. See Bio::Tools::SeqStats and Bio::Tools::SeqWords for more information.

Identifying restriction enzyme sites (Bio::Restriction)

Another common sequence manipulation task for nucleic acid sequences is locating restriction enzyme cutting sites. Bioperl provides the Bio::Restriction::Enzyme, Bio::Restriction::EnzymeCollection, and Bio::Restriction::Analysis objects for this purpose. These modules replace the older module Bio::Tools::RestrictionEnzyme. A new collection of enzyme objects would be defined like this:

use Bio::Restriction::EnzymeCollection;
 my $all_collection = Bio::Restriction::EnzymeCollection;

Bioperl's default Restriction::EnzymeCollection object comes with data for more than 500 different Type II restriction enzymes. A list of the available enzyme names can be accessed using the available_list() method, but these are just the names, not the functional objects. You also have access to enzyme subsets. For example to select all available Enzyme objects with recognition sites that are six bases long one could write:

my $six_cutter_collection = $all_collection->cutters(6);
for my $enz ($six_cutter_collection){
   print $enz->name,"\t",$enz->site,"\t",$enz->overhang_seq,"\n";
   # prints name, recognition site, overhang
}

There are other methods that can be used to select sets of enzyme objects, such as unique_cutters() and blunt_enzymes(). You can also select a Enzyme object by name, like so:

my $ecori_enzyme = $all_collection->get_enzyme('EcoRI');

Once an appropriate enzyme has been selected, the sites for that enzyme on a given nucleic acid sequence can be obtained using the fragments() method. The syntax for performing this task is:

use Bio::Restriction::Analysis;
 my $analysis = Bio::Restriction::Analysis->new(-seq => $seq);
 # where $seq is the Bio::Seq object for the DNA to be cut
 @fragments =  $analysis->fragments($enzyme);
 # and @fragments will be an array of strings 

To get information on isoschizomers, methylation sites, microbe source, vendor or availability you will need to create your EnzymeCollection directly from a REBASE file, like this:

use Bio::Restriction::IO;
my $re_io = Bio::Restriction::IO->new(-file => $file,
                                      -format=>'withrefm');
my $rebase_collection = $re_io->read;

A REBASE file in the correct format can be found at ftp://ftp.neb.com/pub/rebase - it will have a name like "withrefm.308". If need be you can also create new enzymes, like this:

my $re = new Bio::Restriction::Enzyme(-enzyme => 'BioRI',
                                      -seq => 'GG^AATTCC');

For more informatation see Bio::Restriction::Enzyme, Bio::Restriction::EnzymeCollection, Bio::Restriction::Analysis, and Bio::Restriction::IO.

Identifying amino acid cleavage sites (Sigcleave)

For amino acid sequences we may be interested to know whether the amino acid sequence contains a cleavable signal sequence for directing the transport of the protein within the cell. SigCleave is a program (originally part of the EGCG molecular biology package) to predict signal sequences, and to identify the cleavage site based on the von Heijne algorithm.

The threshold setting controls the score reporting. If no value for threshold is passed in by the user, the code defaults to a reporting value of 3.5. SigCleave will only return score/position pairs which meet the threshold limit.

There are 2 accessor methods for this object. signals() will return a perl hash containing the sigcleave scores keyed by amino acid position. pretty_print() returns a formatted string similar to the output of the original sigcleave utility.

The syntax for using Sigcleave is as follows:

# create a Seq object, for example:
$seqobj = Bio::Seq->new(-seq => "AALLHHHHHHGGGGPPRTTTTTVVVVVVVVVVVVVVV");
 
use Bio::Tools::Sigcleave;
$sigcleave_object = new Bio::Tools::Sigcleave
    ( -seq          => $seqobj,
      -threshold    => 3.5,
      -description  => 'test sigcleave protein seq',
    );
%raw_results      = $sigcleave_object->signals;
$formatted_output = $sigcleave_object->pretty_print;

Please see Bio::Tools::Sigcleave for details.

Miscellaneous sequence utilities: OddCodes, SeqPattern

OddCodes:

For some purposes it's useful to have a listing of an amino acid sequence showing where the hydrophobic amino acids are located or where the positively charged ones are. Bioperl provides this capability via the module Bio::Tools::OddCodes.

For example, to quickly see where the charged amino acids are located along the sequence we perform:

use Bio::Tools::OddCodes;
$oddcode_obj = Bio::Tools::OddCodes->new($amino_obj);
$output = $oddcode_obj->charge();

The sequence will be transformed into a three-letter sequence (A,C,N) for negative (acidic), positive (basic), and neutral amino acids. For example the ACDEFGH would become NNAANNC.

For a more complete chemical description of the sequence one can call the chemical() method which turns sequence into one with an 8-letter chemical alphabet { A (acidic), L (aliphatic), M (amide), R (aromatic), C (basic), H (hydroxyl), I (imino), S (sulfur) }:

$output = $oddcode_obj->chemical();

In this case the sample sequence ACDEFGH would become LSAARAC.

OddCodes also offers translation into alphabets showing alternate characteristics of the amino acid sequence such as hydrophobicity, "functionality" or grouping using Dayhoff's definitions. See the documentation in Bio::Tools::OddCodes for further details.

SeqPattern:

The SeqPattern object is used to manipulate sequences using Perl regular expressions. A key motivation for SeqPattern is to have a way of generating a reverse complement of a nucleic acid sequence pattern that includes ambiguous bases and/or regular expressions. This capability leads to significant performance gains when pattern matching on both the sense and anti-sense strands of a query sequence are required. Typical syntax for using SeqPattern is shown below. For more information, there are several interesting examples in the script seq_pattern.pl in the examples/tools directory.

use Bio::Tools::SeqPattern;
$pattern     = '(CCCCT)N{1,200}(agggg)N{1,200}(agggg)';
$pattern_obj = new Bio::Tools::SeqPattern(-SEQ  => $pattern,
                                          -TYPE => 'dna');
$pattern_obj2 = $pattern_obj->revcom();
$pattern_obj->revcom(1); # returns expanded rev complement pattern. 

More detail can be found in Bio::Tools::SeqPattern.

Converting coordinate systems (Coordinate::Pair, RelSegment)

Coordinate system conversion is a common requirement, for example, when one wants to look at the relative positions of sequence features to one another and convert those relative positions to absolute coordinates along a chromosome or contig. Although coordinate conversion sounds pretty trivial it can get fairly tricky when one includes the possibilities of switching to coordinates on negative (i.e. Crick) strands and/or having a coordinate system terminate because you have reached the end of a clone or contig. Bioperl has two different approaches to coordinate-system conversion (based on the modules Bio::Coordinate::Pair and Bio::DB::GFF::RelSegment, respectively).

The Coordinate::Pair approach is somewhat more "low level". With it, you define an input coordinate system and an output coordinate system, where in each case a coordinate system is a triple of a start position, end position and strand. The end position is especially important when dealing with unfinished assemblies where the coordinate system ends when one reaches the end of the sequence of a clone or contig. Once one has defined the two coordinate systems, one defines a Coordinate::Pair to map between them. Then one can map positions between the coordinates systems with code such as this:

$input_coordinates = Bio::Location::Simple->new  
(-seq_id => 'propeptide', -start => 1000, -end => 2000, -strand=>1 );
 
$output_coordinates = Bio::Location::Simple->new  
(-seq_id => 'peptide', -start => 1100, -end => 2100, -strand=>1 );
 
$pair = Bio::Coordinate::Pair->new
(-in => $input_coordinates , -out => $output_coordinates );
 
$pos = Bio::Location::Simple->new (-start => 500, -end => 500 );
 
$res = $pair->map($pos);
$converted_start = $res->start;

In this example $res is also a Bio::Location object, as you'd expect. See the documentation for Bio::Coordinate::Pair and Bio::Coordinate::GeneMapper for more details.

The Bio::DB::GFF::RelSegment approach is designed more for handling coordinate transformations of sequence features rather than for transforming arbitrary coordinate systems. With Bio::DB::GFF::RelSegment you define a coordinate system relative to a specific feature (called the "refseq"). You also have access to the absolute coordinate system (typically of the entire chromosome.) You can determine the position of a feature relative to some other feature simply by redefining the relevant reference feature (i.e. the "refseq") with code like this:

$db = Bio::DB::GFF->new(-dsn     => 'dbi:mysql:elegans',
                        -adaptor => 'dbi:mysqlopt');
 
$segment = $db->segment('ZK909');
$relative_start = $segment->start;  # $relative_start = 1;
 
# Now retrieve the start position of ZK909 relative to feature ZK337
$segment->