lepbase-import is a set of scripts and modules to simplify the process of loading genome data into lepbase, an ensembl database for the lepidoptera. The only data requirements are a Fasta format assembly file and a gff format annotation file, optional additional files can be used as sources of gene/transcript names and descriptions.


  1. Create a database user with appropriate privileges
  2. Create a config file 'assembly_name.ini', using core/general_import.ini as a template
    • define remote and local code locations
    • define database connection details
    • define file names, types and locations
    • define meta information
  3. Run the script 'core/' which will
    • import files
    • extract fasta from gff
    • calculate and tabulate/plot CONTIG, SCAFFOLD and GFF summary statistics
  4. Run the script 'core/' which will
    • setup core db
    • load sequences
    • generate and load seq_region_synonyms
    • count loaded sequences to check against the CONTIG/SCAFFOLD stats
  5. Edit the config file as appropriate based on looking at downloaded files/stats
    • add patterns to find gene names/descriptions in input files
    • add patterns to find transcript names/descriptions in input files
    • add patterns to find translation stable_ids (or transcript stable_ids will be used)
    • add patterns to handle dbxrefs in the gff file
    • set basic options for GFFTree in ini file - TODO: support more options
  6. Run the script 'core/'
    • validate the input gff
    • weave in gene/transcript names/descriptions and write to new file
    • warn if duplicate stable_ids have been introduced
  7. Run the script 'core/'
    • fetch gene descriptions
    • read, validate, add descriptions to and rewrite the input gff file
    • load gene models into the ensembl database
    • load dbxrefs into the ensembl database
    • load ontology xrefs into the ensembl database
  8. Run the script 'core/' - TODO: make compatible with ini parameter loading - which will
    • use the ensembl API to generate sequence files
    • save with SequenceServer-compatible headers
    • check if these sequences match up with provider's own protein and transcript files
  9. Run analyses to generate more xrefs with standardised commands - TODO: write a wrapper for this?
    • interproscan
    • blast protein sequences vs. uniprot
  10. Run scripts to load xrefs from additional files and add descriptions if desired
    • TODO: blast
    • core/
    • TODO: generic csv/tsv
  11. Re-run the script '' to fetch updated descriptions