get.seq: Resource temporarily unavailable (4).

Issue #215 resolved
Lars Skjærven created an issue

I've run into a few problems when fetching sequences from NR using get.seq. The other day download.file function returned an error every now and then for large numbers of sequences (terminating my get.seq() call). This was fixed with putting download.file into a tryCatch statement. Today there seems to be bigger problems with NR. Unfortunately download.file successfully fetches a file containing Resource temporarily unavailable (4).. If you do this for one sequence it's fine and an error message is printed.

> get.seq("Q877G8", db="nr")
trying URL 'http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?db=protein&val=Q877G8&report=fasta&retmode=text'
Content type 'text/plain' length unknown
opened URL

downloaded 38 bytes

Error in read.fasta(outfile) : 
  read.fasta: no '>' id lines found, check file format
In addition: Warning message:
In get.seq("Q877G8", db = "nr") : Removing existing file: seqs.fasta

For multiple sequences you'll end up with the following if the first download is successfull and the second is a failure:

> get.seq(c("Q877G8", "Q32L40"), db="nr")
trying URL 'http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?db=protein&val=Q877G8&report=fasta&retmode=text'
Content type 'text/plain' length unknown
opened URL

downloaded 645 bytes

trying URL 'http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?db=protein&val=Q32L40&report=fasta&retmode=text'
Content type 'text/plain' length unknown
opened URL

downloaded 38 bytes

                               1        .         .         .         .         50 
[Truncated_Name:1]gi|7458010   MSQQPGVLPENMKRYMGRDAQRMNILAGRIIAETVRSTLGPKGMDKMLVD
                               1        .         .         .         .         50 

                             501        .         .         .         .         550 
[Truncated_Name:1]gi|7458010   SAAESTEMLLRIDDVIAAEKLRGAPDMGDMGGMPGMGGMPGMMResourc
                             501        .         .         .         .         550 

                             551        .         .         580 
[Truncated_Name:1]gi|7458010   e temporarily unavailable (4)-
                             551        .         .         580 

Call:
  read.fasta(file = outfile)

Class:
  fasta

Alignment dimensions:
  1 sequence rows; 580 position columns (579 non-gap, 1 gap) 

+ attr: id, ali, call
Warning message:
In get.seq(c("Q877G8", "Q32L40"), db = "nr") :
  Removing existing file: seqs.fasta

Hopeless ... Is this something to pursue? have you experiences this or similar?

Comments (12)

  1. Barry Grant

    Nice alignment there!

    Probably a different mechanism is required for large multi-sequence requests when NCBI (or other resources that are being finickity). We should try to catch these errors and also having the ability to sensibly restart a failed get.seq() job might indeed be worth investigating.

    For PDB and SwissProt (and NCBI RefSeq) you could also try http://www.ebi.ac.uk/Tools/dbfetch/ e.g.

    http://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=pdb&id=1bg2&format=fasta http://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=pdb&id=1bg2%2C+2kin_A&format=fasta (note that ID 2kin_A can restrict to chain here).

    The NCBI http://eutils.ncbi.nlm.nih.gov/entrez/eutils/ is meant to have this ability but I have not used it for several years now.

  2. Xinqiu Yao

    I think I have had replied this issue but actually not. For the problem Lars mentioned, the latest update of get.seq() should have already solved it (See this commit). Basically, instead of writing directly to output, the updated function first writes a temporary file after each successful downloading, reads the file by 'read.fasta()' and then writes it to output if it is in valid FASTA format. In the following example, the first fetch succeeds and the second fails.

    get.seq(c("Q877G8", "Q32L4"), db="nr")
    
     Q877G8  Q32L4 
     FALSE   TRUE 
    Warning messages:
    1: In get.seq(c("Q877G8", "Q32L4"), db = "nr") :
      Failed to get sequence for Q32L4 (check the ID)
    2: In get.seq(c("Q877G8", "Q32L4"), db = "nr") :
      Not all downloads were sucesfull, see returned values
    

    For the download errors in large multi-sequence requests, I guess the reason is that each individual sequence is downloaded independently, resulting in an overall very frequent access to the servers. We probably need to think about downloading all sequences by accessing servers just once (something like we have done in 'pdb.annotate()'). Does any database mentioned above support such an interface?

    Also, keeping address info in the return for a future restart after each failure is very sensibly (a similar mechanism as 'blast.pdb()' and 'get.blast()'). Will keep investigating on it.

  3. Xinqiu Yao

    Hi,

    I have attempted on a new version of get.seq() using EBI server (See this commit). The major improvement is that, instead of repeatedly download each individual file, the new function fetches all sequences at once and therefore reduce the frequency to ping the server. This will solve the Resource temporarily unavailable problem reported above due to the too frequent access to the server when downloading many sequences.

    The returned value is the same as before, i.e. a 'fasta' object when all downloads are successful, or a 'logic' vector indicating which "IDs" do not have returned sequences.

    Some tests:

    source('~/bio3d/new_funs/get.seq.R')
    seq0 <- bio3d::get.seq(c("Q877G8", "Q32L40"))
    seq <- get.seq(c("Q877G8", "Q32L40"))
    identical(seq0, seq)
    [1] TRUE
    
    get.seq(c("Q877G8", "Q32L4"), db="nr")
    Q877G8  Q32L4 
     FALSE   TRUE 
    Warning messages:
    1: In get.seq(c("Q877G8", "Q32L4")) : Removing existing file: seqs.fasta
    2: In get.seq(c("Q877G8", "Q32L4")) :
      Not all downloads were successful. See returned values.
    
    # use 'uniprot'
    get.seq('PH4H_HUMAN', db='uniprot')
    sp|P00439|PH4H_HUMAN   MSTAVLENPGLGRKLSDFGQETSYIEDNCNQNGAISLIFSLKEEVGALAK
                           1        .         .         .         .         50 
    
                          51        .         .         .         .         100 
    sp|P00439|PH4H_HUMAN   VLRLFEENDVNLTHIESRPSRLKKDEYEFFTHLDKRSLPALTNIIKILRH
                          51        .         .         .         .         100 
    ...
    Call:
      read.fasta(file = outfile)
    
    Class:
      fasta
    
    Alignment dimensions:
      1 sequence rows; 452 position columns (452 non-gap, 0 gap) 
    
    + attr: id, ali, call
    
    # use 'pdb'
    get.seq(c('1tag', '1tnd_B'), db='pdb')
    PDB:1TAG_A   ARTVKLLLLGAGESGKSTIVKQMKIIHQDGYSLEECLEFIAIIYGNTLQSILAIVRAMTT
    PDB:1TND_B   ARTVKLLLLGAGESGKSTIVKQMKIIHQDGYSLEECLEFIAIIYGNTLQSILAIVRAMTT
                 ************************************************************ 
                 1        .         .         .         .         .         60 
    
                61        .         .         .         .         .         120 
    PDB:1TAG_A   LNIQYGDSARQDDARKLMHMADTIEEGTMPKEMSDIIQRLWKDSGIQACFDRASEYQLND
    PDB:1TND_B   LNIQYGDSARQDDARKLMHMADTIEEGTMPKEMSDIIQRLWKDSGIQACFDRASEYQLND
                 ************************************************************ 
                61        .         .         .         .         .         120 
    ...
    Call:
      read.fasta(file = outfile)
    
    Class:
      fasta
    
    Alignment dimensions:
      2 sequence rows; 324 position columns (324 non-gap, 0 gap) 
    
    + attr: id, ali, call
    

    Let me know what you think.

  4. Xinqiu Yao

    Not yet. The new function is on feature_ebi. Will have a look shortly and merge into master if possible.

  5. Xinqiu Yao

    I've merged the branch; See here.

    The new get.seq() has exactly the same interface as previous, but internally it uses both EBI (for 'uniprot' and 'pdb') and NCBI (for 'nr') web services and supports massive sequence downloading. Including NCBI server is to continuously support the 'nr' database, from which user can fetch sequences based on e.g. gi number or RefSeq accession number. Note that EBI does not support 'nr' and does not return FASTA sequences for 'RefSeq'.

    Please have a test also on the new function. Let me know if you found any problem!

  6. Log in to comment