Tools for deduplicating WARC files. Largely based on the tools built by the Bibliotheca Alexandrina for this purpose, but modified for easier (or correct) use. See Youssef Eldakar's description of the tools for a little more information.
warcsum mostly worked correctly, but used
ints in a number of places where
they are inappropriate, mainly when dealing with file offsets. If you never use
a WARC larger than 2GB, there's no problem. My changes are just hacks to make it
work; it needs some cleanup.
warccollres has been reimplemented completely, since the arcalex
implementation seems to assume existing archive infrastructure so it can simply
make HTTP requests for WARC members, finding the URLs from a mysql database. I
don't have any of that, so my version (
warccollres.py) uses a sqlite database
and reads WARC files directly.
warcrefs suffered from the same problem as
warcsum where it used 32-bit file
offsets. I've probably fixed that, but haven't tested and it might be horrible.
warc package for Python was designed with Python 2 in mind, so I had to
hack it to work on Python 3. Might be better to replace with
warcat, but that
has its own issues (regarding extremely slow reading mostly).
First build everything:
$ cd gzmulti $ ./configure && make $ sudo make install $ cd ../warcsum $ ./configure && make $ sudo make install $ cd ../warcrefs $ mvn package
The following libraries will likely be required, in addition to a working C
build-essential on Debian-like Linux distributions):
- zlib (zlib1g-dev package on Debian, Ubuntu or similar Linux distributions)
- openssl (libssl-dev)
- mysqlclient (libmysqlclient-dev)
- curl (libcurl4-openssl-dev)
- libconfig (libconfig-dev)
You may need to add the library install directory to your load path so warcsum and friends can find gzmulti:
Now you can dedup. If you want to combine smaller WARCs into a big one, try
First generate hash digests of your input file's contents. We'll assume the
warcsum -i mega.warc.gz -o mega.warcsum
warccollres to determine which hash collisions are identical records,
and which are not. This can take a while.
python3 warccollres.py mega.warcsum
It will import the digest file into a database, then go to work finding
duplicates. If interrupted while finding duplicates, you can resume where it
left off by running
warccollres.py without any arguments.
When finished, it will write the manifest back out with duplicate pointers to
warccollres.txt. Then we run
warcrefs to rewrite the WARC.
java -jar warcrefs/target/warcrefs-1.0-SNAPSHOT-jar-with-dependencies.jar \ 8129 warccollres.txt `pwd`
..and hopefully that's it.
For performance reasons
mysqlclient, which implies
you need a database server for it. Configure connection information in
MYSQL_PARAMS. You'll want to ensure the database server is configured for
reasonable performance. In particular, ensure
large. Ideally, at least as large as your dataset so it can all live in RAM.
The code will automatically create indexes after importing data, but somebody more skilled with SQL than I am might be able to improve that.
You can use sqlite too, but the queries in the code will need rewriting since the dialects used are incompatible and it tends to be painfully slow when writing.