mismatch of md5sums between http and sql interfaces

Issue #23 resolved
Roy Stead
created an issue

There's a large number of items which now generate different md5sum values when updating a db from the imdb lists as compared to retrieving the same movie from the imdb website using the imdb id.

So, using http I'd expect this code snippet to produce an xrefkey which matches the one generated in the db for the same movie:

        imdb_item = i.get_movie(imdbId)
        if imdb_item:
            title = build_title(imdb_item, ptdf=1, _emptyString='').encode('utf_8')
            xrefkey = md5(title).hexdigest()

In fact, I get different key pairs for over 1600 videos. Has the pre-md5-hashing character encoding changed somewhere?

I've attached a list of 1645 titles which show this issue (the attachment shows the http-generated key, which differs from the md5sum value generated when updating the imdb lists), but one example is:

imdb_id 995036 (Einstein and Eddington (2008))
md5sum in db: 2571f7dcb309b0ca92bf288e630456bf
md5sum from http interface: c076ccb380a96f3f7278af6c850b73dd

As far as I can tell, there various items don't seem to have anything in common - affected titles range from 1923 (imdb_id 14142, The Hunchback of Notre Dame) to 2013 (Jay and Silent Bob Get Irish: The Swearing O' the Green, imdb_id 2759112) and their kind_ids vary. Some are movies, some are tv episodes or tv series.

Comments (4)

  1. Davide Alberani repo owner
    • changed status to open

    IMDb changed (both on the web server and on the plain text data files) how titles are composed.

    Now, titles with an imdbIndex (the roman number used to differentiate between movies with the same title released in the same year), the format is:
    The Title (I) (2013)

    Previsouly it was:
    The Title (2013/I)

    Being the title used to calculate and match the md5sum, it's normal that all these movies will not match anymore.

    There's very little we can do about it, without complicating too much the current code.

  2. Roy Stead reporter

    Ah ok, fair enough. I've changed my code to generate my own xrefkey (equivalent to your md5sum) now, so that this issue will no longer arise.

    FYI, I now do the following after an update. However, I'm in the middle of testing this against my live data so I can't be certain how robust it is. But it worked fine with my smaller test dataset:

    Firstly, I standardise title of the form "(#n.m)" to "Episode #n.m", since the imdb lists use the former format but the imdb website uses the latter and I want to ensure that the database reflects the website:

    update title set title=CONCAT('Episode #',SUBSTRING(title,INSTR(title,'#')+1,INSTR(title,'.')-INSTR(title,'#')-1), '.', SUBSTRING(title,INSTR(title,'.')+1,INSTR(title,')')-INSTR(title,'.')-1)) where episode_of_id is not null and title regexp '\(#[0-9]+.[0-9]+\)';

    Next, I create md5 hashes in xrefkey field in the four core tables:

    update title set xrefkey=MD5(CONCAT(title,':',IFNULL(imdb_index,'None'),':',IFNULL(year,'None'),':None:')) where episode_of_id is null and xrefkey is null;

    update title t, title s set t.xrefkey=MD5(CONCAT(t.title,':',IFNULL(t.imdb_index,'None'),':',IFNULL(t.year,'None'),':',IFNULL(s.title,'None'),':')) where t.episode_of_id is not null and t.episode_of_id=s.movie_id and t.xrefkey is null;

    update name set xrefkey=MD5(CONCAT(name,':',IFNULL(imdb_index,'None'),':')) where xrefkey is null;

    update char_name set xrefkey=MD5(CONCAT(name,':',IFNULL(imdb_index,'None'),':')) where xrefkey is null;

    update company_name set xrefkey=MD5(CONCAT(name,':None:')) where xrefkey is null;

    Note that I do NOT generate xrefkey values for aka_title and aka_name but instead copy the values from title and name to equivalent rows in aka_title and aka_name. Thus, everything with the same movie_id has the same xrefkey also.

    Note also that I don't use season_nr or episode_nr in generating the xrefkey for title table entries because I found a few places where the imdb list contains values for those but the imdb website does not. Look, for example, at season 5 of the UK series "The Apprentice".

    Finally, python-side I do this to retrieve the imdb info (using the imdbPy http interface) and then return the equivalent xrefkey which I use to match up the items after an update. The xrefkeys are retrieved only once per item and then stored in a local db unless the imdb data changes to prevent a match, in which case my code will have to look it up again at a later date:

    itemTypeLookup = { 'movie': 1, 'person': 2, 'character': 3, 'company': 4, 1: 'movie', 2: 'person', 3: 'character', 4: 'company'}

        if itemType == itemTypeLookup['movie']:
            imdb_item = i.get_movie(imdb_id)
        elif itemType == itemTypeLookup['person']:
            imdb_item = i.get_person(imdb_id)
        elif itemType == itemTypeLookup['character']:
            imdb_item = i.get_character(imdb_id)
        elif itemType == itemTypeLookup['company']:
            imdb_item = i.get_company(imdb_id)
    except Exception, e:
        imdb_item = None
    movie_xreftags = ['title', 'imdbIndex', 'year', 'series title']
    name_xreftags = ['name', 'imdbIndex']
    xrefkey = None
    title = None
    if imdb_item:
        s = ''
        if itemType == itemTypeLookup['movie']:
            title = imdb_item.get('title')
            for tag in movie_xreftags:
                s += "%s:" % imdb_item.get(tag)
            title = imdb_item.get('name')
            for tag in name_xreftags:
                s += "%s:" % imdb_item.get(tag)
    xrefkey = md5(s.encode("utf_8")).hexdigest()

    Note the utf_8 encoding matches the char set encoding used in my db, to ensure both calculations produce the same xrefkey value.

    And that's it. This approach seems more robust to me, since it depends on the parsed snippets of imdb data rather than on the imdb data being in any particular format.

  3. Roy Stead reporter

    Oh, quick notes on the above snippets. Obviously, I first add a column xrefkey to each of the tables (actually, I rename md5sum to xrefkey and then set it to null for all rows). Also, I rename your title.production_year field to title.year, for ease of use. So adjust my SQL above accordingly when running it on a "raw" imdbPy db.

  4. Log in to comment