HOWTO:Getting Genomic Sequences

From BioPerl

Jump to: navigation, search


Contents


Authors

Brian Osborne

bosborne11 at verizon.net

Chris Fields


Copyright

This document is copyright Brian Osborne. It can be copied and distributed under the terms of the Perl Artistic License.

Abstract

This is a HOWTO that talks about using Bioperl and tools related to Bioperl to get genomic sequence. There are a few different approaches, one uses files that you'll download to your own computer to query locally, others use remote, programmable interfaces or APIs. You should also see the EUtils Cookbook.

Using local Genbank and Entrez Gene files

You can download chromosomal, nucleotide files in FASTA format from NCBI (ftp://ftp.ncbi.nih.gov/genomes/) and get gene position data from Entrez Gene (see Using Bio::DB::EntrezGene to get genomic coordinates), then create indices of the fasta files using Bio::DB::Fasta. There is an example script, extract_genes.pl, that shows how this could be done. The query terms are limited to Gene id's in this example since the positional data is taken from Entrez Gene's gene2accession file.

Requirement: BioPerl.

Using the Perl API at ENSEMBL

You can connect remotely to the ENSEMBL database and query it using any name or identifier that's understood by ENSEMBL.

Requirements: BioPerl, the Ensembl Perl API, DBI, and DBD::mysql. See Ensembl API installation , and the Ensembl Perl API tutorials for information on installation and use.

Example script:

use strict;
use DBI;
use Getopt::Long;
use Bio::EnsEMBL::DBSQL::DBAdaptor;
 
my $identifier;
 
GetOptions( "n=s" => \$identifier );
 
my $db = Bio::EnsEMBL::DBSQL::DBAdaptor->new(
                       -user   => "anonymous",
                       -dbname => "homo_sapiens_core_49_36k",
                       -host   => "ensembldb.ensembl.org",
                       -pass   => "",			     
                       -driver => 'mysql',
                       -port   => 5306,      # use mysql 5 (mysql 4: port 3306)
                );
 
my $gene_adaptor = $db->get_GeneAdaptor;
my $slice_adaptor = $db->get_SliceAdaptor;
 
for my $gene (@{$gene_adaptor->fetch_all_by_external_name($identifier)}) {
  for my $trans (@{$gene->get_all_Transcripts}) {
	  my $seq = $slice_adaptor->fetch_by_region("chromosome",
		                           $trans->seq_region_name,
	                                   $trans->start,
		                           $trans->end);
	  print "\n",$seq->seq,"\n";
  }
}

You also have the option of using raw SQL when using the ENSEMBL API, the result is that this is a very powerful API for analyzing genomic data.

Notes on this example

  • This bit of code has not been extensively tested.
  • The fetch_all_by_external_name method does not accept a namespace or database name as an argument, so it lacks some precision. Be careful that your query returns just one sequence. Alternatively use a more precise SQL statement rather than fetch_all_by_external_name.
  • Make sure to use the latest version, not necessarily the -dbname argument homo_sapiens_core_49_36k used above. To get a listing of available databases using mysql:
   $ mysql -u anonymous -h ensembldb.ensembl.org
   Welcome to the MySQL monitor. ...

   mysql> show databases like "homo_sapiens_core%";
   mysql> show databases;

Using Bio::DB::GenBank when you have genomic coordinates

This is an example of how you can pull sequence chunks from GenBank, complete with GenBank's annotation, using Bio::SeqFeature::Generic objects generated by a Bio::Tools::RNAMotif factory.

Note that you could easily replace Bio::Tools::RNAMotif with anything that gives start, end, and strand information. While the last foreach loop just dumps sequence annotation information, it could be modified to add the sequence feature, determine the seqfeatures's genomic context to surrounding features, etc. For more information, see Bio::DB::GenBank.

Requirement: BioPerl.

#!perl 
 
use strict;
use Bio::Tools::RNAMotif; # or anything that gives start, end, strand info
use Bio::DB::GenBank;
use Bio::SeqIO;
 
my $factory = Bio::Tools::RNAMotif->new(-file=>'clean_RNAMotif.txt',
                                        -motiftag => 'protein_binding',
                                        -desctag => 'pyrR_BL'
                                        );
 
# array of Bio::SeqFeature::Generic objects generated by Bio::Tools::RNAMotif factory
my @motifs = ();
while(my $motif = $factory->next_prediction) {
    push @motifs, $motif;
}
# Start Bio::SeqIO factory
my $outfile = Bio::SeqIO->new(-file => '> temp.txt',
                           -format => 'genbank');
 
my @seqs = ();
foreach my $motif (@motifs) {
    my $strand = ($sf->strand == 1) ? 1 : 2;
    my $seqstart = $sf->start - 500;
    my $seqend = $sf->end + 500;    
    # Below is from Bio::DB::GenBank POD, with some modifications
    my $factory = Bio::DB::GenBank->new(-format => 'genbank',
                                   -seq_start => $seqstart,    # 500 bp upstream
                                   -seq_stop => $$seqend,     # 500 bp downstream
                                   -strand => $strand,     # 1=plus, 2=minus
                                  );
    my $seqin = $factory->get_Seq_by_acc($motif->seq_id);
    # store away files
    $outfile->write_seq($seqin);
    # may take lots of memory if you have many seqfeatures
    push @seqs, $seqin; 
    sleep 3;  # don't irritate NCBI
}
 
# from HOWTO:Feature-Annotation; gives seq annotation
foreach my $seq (@seqs) { 
    print $seq->accession_number,"\t",
           $seq->length,"\n"; 
    for my $feat_object ($seq->get_SeqFeatures) {
        next unless $feat_object->primary_tag eq "CDS";
        print "primary tag: ", $feat_object->primary_tag, "\n";          
            for my $tag ($feat_object->get_all_tags) {             
            print "  tag: ", $tag, "\n"; 
                for my $value ($feat_object->get_tag_values($tag)) {                
                print "    value: ", $value, "\n";             
            }          
        }       
    }
}

Using Bio::DB::EntrezGene to get genomic coordinates

You can get the coordinates of a given gene from Entrez Gene using the Bio::DB::EntrezGene module. This involves examining the Annotations associated with the gene (see the Feature-Annotation HOWTO for more information on Annotations) and finding the one labelled "Evidence Viewer", the data is found in a URL. The only identifier that the NCBI Entrez Gene API can use is a Gene id, formerly known as a LocusLink id.

Requirement: BioPerl.

Example code:

use strict;
use Bio::DB::EntrezGene;
 
my $id = shift or die "Id?\n"; # use a Gene id
 
my $db = new Bio::DB::EntrezGene;
 
my $seq = $db->get_Seq_by_id($id);
 
my $ac = $seq->annotation;
 
for my $ann ($ac->get_Annotations('dblink')) {
	if ($ann->database eq "Evidence Viewer") {
                # get the sequence identifier, the start, and the stop
		my ($contig,$from,$to) = $ann->url =~ 
		  /contig=([^&]+).+from=(\d+)&to=(\d+)/;
		print "$contig\t$from\t$to\n";
	}
}

This data is found in a Bio::Annotation::DBLink Annotation.

Once you have the coordinates you can use them to retrieve a sub-sequence either by using a local indexed file (e.g. Bio::DB::Fasta) or by retrieving the sequence from a remote database (e.g. Bio::DB::GenBank, Bio::DB::RefSeq and using subseq or trunc from Bio::PrimarySeq or Bio::PrimarySeqI (the first approach will give you the best performance).

Personal tools