Netezza Unicode Scrubber
Getting The Code
The most recent version should be available from bitbucket here:
For inserting Unicode data into the Netezza database which has relatively limited unicode support, it became necessary to quickly scan and fix unicode text so that it could be safely imported into Netezza.
Specifically, this utility:
- strips out any utf8 u+FEFF characters
- turns any unicode version 5.0 and later into '?' (or the char of choice)
- normalizes the unicode into Normalized Form C (NFC)
Since we usually operate on tons of data and potentially large unicode strings, efficiency is decently important. We wrote this in C to make it as fast as possible. (TODO: add python bindings?)
NOTE: this was built with linux in mind (though it has been compiled on MacOSX). it may or may not work on your machine without some porting.
- hg clone https://www.bitbucket.org/lindenlab/nzscrub-utf8
- icu-config --version
You should have version 4.4 or later. If you don't have icu-config installed, see BUILDING below for more info.
- cd src && make
1. The Byte Order Mark, U+FEFF, is usually used at the beginning of unicode files to decide byte ordering for utf-16. However, it was also used in the early unicode standard as the 'ZERO WIDTH NO-BREAK SPACE' character. This use has been deprecated but we've noticed it scattered throughout our user data (possibly on purpose, possibly due to old/bad software.)
Netezza only supports U+FEFF as a byte order mark at the beginning of a string and balks if the BOM is somewhere else in the string.
2. Netezza 6.0 and earlier only supported the Unicode standard 4.1.0 and earlier. As a result, importing any string with a unicode ode point in Unicode 5.0 or later causes the record to error out.
3. For collation efficiency, Netezza requires Unicode data to be in the Normalized Form C (NFC). So, instead of Á being encoded as <LATIN CAPITAL LETTER A>,<COMBINING ACUTE ACCENT> (U+0041,U+0301), it is instead encoded as <LATIN CAPITAL LETTER A WITH ACUTE> (U+00C1).
This was developed in straight C using gcc on Debian Squeeze. It should work with most versions of Linux, and I've compiled it without issue on MacOSX. If you run into errors compiling on your nix flavor, please let me know what you needed to do to compile.
NOTE: I have NOT developed this for Windows at all, but I'd be happy to accept patches!
This code uses IBM's libicu (version 4.4 or later) so you'll need that installed. You can install that using your linux distributions pacakage manager, or get it from source here: http://site.icu-project.org/download
- make sure libicu is installed and icu-config is available in your path.
- cd src && make
By default, this will create two binaries:
- nzscrub-utf8 (dynamically linked)
- nzscrub-utf8-static (icu-statically linked)
If you need a completely static binary for older machines, run, make nzscrub-static-full.
Copy the desired src/nzscrub-utf8 binary to wherever you want to use it.
$ cd src/ $ ./nzscrub-utf8 test.txt -o fixed.txt $ ./nzscrub-utf8 -h nzscrub-utf8 - scrub unicode data for safe import into Netezza USAGE: nzscrub-utf8 [opts] infile --help|-h show this screen --version|-v print version information and exit --outfile|-o FILE put scrubbed data in FILE instead of stdout. --char|-c CHAR use CHAR as a replacement char for bad unicode chars. set to 0 to delete bad characters. (default = 0) --buff|-b BUFF read files in chunks of BUFF characters. If this is not set or is 0, the entire file (or NZS_MAX_READ_CHARS, whichever is smaller) will be read into memory. NZS_MAX_READ_CHARS = 1048576 -x CONVERSION specify one or more conversions to do. If unespecified, ALL conversions will be run. CONVERSION must be one of 'unsupp'- filter out unsupported unicode versions (see -u) 'feff' - filter out U+feff characters 'nfc' - normalize into Normalization Form C multiple options can be given -x feff -x nfc --uvers|-u VER specify a cutoff for the 'unsupp' conversion above. this should be a decimal delimited string of versions representing the version from which you want to start removing. For example -u 5.0 will replace all characters from Unicode 5.0 and later with the desired replacement character. (default = '5.0') --summary|-s print summary of changes If infile is omitted, input is expected on STDIN All data is expected and written in UTF-8
This code was developed by Steve Layland (aka Cru Linden) for Linden Lab, makers of Second Life. It is released AS IS under a derivative of the MIT license. See LICENSE.txt for specifics.