1. Davide Alberani
  2. imdbpy
  3. Issues
Issue #50 closed

errors with the splitting

Anonymous created an issue

I've got the latest version, just installed yesterday, and I was getting these splitting errors. I reduced the flush-size to just 10,000 thinking that it'd be less demanding and avoid possible memory problems (albeit slower), but I'm still getting these [below].

I'm doing a mysql backend, and running like this:

./imdbpy2sql.py --mysql-force-myisam -d /home/harlan/imdb_files/ -u 'mysql://root@localhost/imdb'

If it's getting duplicate entries, I wonder if the splitting code might be suspect, and overlapping the halves by 1 record?

Errors follow:

SCANNING movies: "Spangas" (2007) {Het geheime dansgenootschap (#4.159)} (movieID: 1750001)
 * TOO MANY DATA (10000 items in MoviesCache), recursion: 1
   * SPLITTING (run 1 of 2), recursion: 1
 * TOO MANY DATA (4999 items in MoviesCache), recursion: 2
   * SPLITTING (run 1 of 2), recursion: 2
WARNING: unknown exception caught committing the data
WARNING: to the database; report this as a bug, since
WARNING: many data (2499 items) were lost: (1062, "Duplicate entry '1759890' for key 'PRIMARY'")
   * SPLITTING (run 2 of 2), recursion: 2
WARNING: unknown exception caught committing the data
WARNING: to the database; report this as a bug, since
WARNING: many data (2500 items) were lost: (1062, "Duplicate entry '1752572' for key 'PRIMARY'")
   * SPLITTING (run 2 of 2), recursion: 1
WARNING: unknown exception caught committing the data
WARNING: to the database; report this as a bug, since
WARNING: many data (5001 items) were lost: (1062, "Duplicate entry '1755982' for key 'PRIMARY'")

Comments (4)

  1. Davide Alberani repo owner

    Hi, you are right, and it's suspicious that the two halves have different lengths (4999 and 5001). Maybe you can try changing line 964 from "for x in xrange(1 + originalLength/2):" to "for x in xrange(1, 1 + originalLength/2):"

    Anyway, the problem can be much more subtle; sometimes for a single entry we have to create two entries in the database: it's the case of tv series episodes, for example (if we first encounter an episode, we've to create both its entry and the one of the series itself). Given the fact that looking at the code we remove the first half of the data set and then we process firstly the second half, this may exactly be our scenario.

    Maybe we can invert the order with something like:

                     firstHalf = {}
                     poptmpd = self._tmpDict.popitem
                     originalLength = len(self._tmpDict)
    -                for x in xrange(1, 1 + originalLength/2):
    +                for x in xrange(1 + originalLength/2):
                         k, v = poptmpd()
                         firstHalf[k] = v
    -                self._secondHalf = self._tmpDict
    -                self._tmpDict = firstHalf
                     print ' * TOO MANY DATA (%s items in %s), recursion: %s' % \
                                                            (originalLength,
                                                             self.className,
                                                             _recursionLevel)
                     print '   * SPLITTING (run 1 of 2), recursion: %s' % \
                                                             _recursionLevel
                     self.flush(quiet=quiet, _recursionLevel=_recursionLevel)
    -                self._tmpDict = secondHalf
    +                self._tmpDict = firstHalf
                     print '   * SPLITTING (run 2 of 2), recursion: %s' % \
                                                             _recursionLevel
                     self.flush(quiet=quiet, _recursionLevel=_recursionLevel)
    

    (not tested, I admit it)

  2. Log in to comment