Human genome coordinates wrong or not properly referenced

Issue #75 new
Sebastian Burgstaller created an issue

The references indicate that the genomic coordinates come from NCBI gene build 107. But the coordinates on the Wikidata gene items do not match the coordinates in NCBI gene build 107.

They seem to rather match some Ensembl annotation release. So either the references are wrong or the coordinates themselves. The data are useless if it is not exactly clear what they are. That's especially important as I use them in the manuscript.

Looks like a general problem. Examples (very prominent genes): TP53: https://www.wikidata.org/wiki/Q14818098 http://www.ncbi.nlm.nih.gov/gene?cmd=retrieve&dopt=default&list_uids=7157

FOXP3: https://www.wikidata.org/wiki/Q21163319 http://www.ncbi.nlm.nih.gov/gene?cmd=retrieve&dopt=default&list_uids=50943

SCL45A2: https://www.wikidata.org/wiki/Q18040047 http://www.ncbi.nlm.nih.gov/gene?cmd=retrieve&dopt=default&list_uids=51151

Please also check the mouse coordinates!!

Comments (6)

  1. Andra Waagmeester

    The coordinates are imported directly from mygene.info. Could it be that coordinates from ensembl and ncbi are merged? This might explain issue 44 as well. If this is the case, it is going to be difficult to solve, since the provenance is not stated on that level of detail. @newgene Where are the coordinates in mygene.info originally from?

  2. Sebastian Burgstaller reporter

    If you look at the data from mygene,info, several types of coordinates are returned, many of them based on ensembl. This requires you to chose the right ones and put the references in a way that they properly reflect what kind of coordinates they are.

  3. Sebastian Burgstaller reporter

    Priority changed to critical, as it negatively impacts the usefulness of our data and because of the data being a central part of our manuscript under review.

  4. Andra Waagmeester

    For ncbi: 7157, mygene.info reports:

    "genomic_pos": {
    "strand": -1,
    "chr": "17",
    "start": 7661779,
    "end": 7687550
    },
    "genomic_pos_hg19": {
    "strand": -1,
    "chr": "17",
    "start": 7565097,
    "end": 7590856
    }
    

    Looking at the metadata service of mygene.info I do see the following provenance:

        "genome_assembly": {
            "zebrafish": "zv9",
            "human": "hg38",
            "rat": "rn4",
            "fruitfly": "dm3",
            "mouse": "mm10",
            "frog": "xenTro3",
            "pig": "susScr2",
            "nematode": "ce10"
        },
        "source": "genedoc_mygene_allspecies_20160104_3eu8r6mj",
        "src_version": {
            "cpdb": 31,
            "refseq": "68",
            "netaffy": "na35",
            "pharmgkb": "20151205",
            "ensembl": 83,
            "entrez": "20160102",
            "ucsc": "20160104",
            "uniprot": "20151210"
        }
    

    Yes, the coordinates are incorrect, based on the json output by mygene.info I don't know how to decide what the source of a coordinate is.

    Changing the reference to Ensembl is a rather easy fix, but the question is how to infer which are based on ensembl and which are based on NCBI.)

    @newgene Are all coordinates in the fields "genomic_pos" and "genomic_pos_hg19" from Ensembl? Maybe more generic is there a way to know what the original source is of a specific field reported by mygene.info? (ensembl of ncbi)

  5. Chunlei Wu

    @andrawaag both "genomic_pos" and "genomic_pos_hg19" come from Ensembl. For human gene, "genomic_pos" is on GRCh38 (or hg38) and "genomic_pos_hg19" is on GRCh37 (or hg19).

    And "genome_assembly" from the metadata reports the genome assemblies for "genomic_pos" field.

  6. Log in to comment