Bitbucket is a code hosting site with unlimited public and private repositories. We're also free for small teams!

Close

Netezza Unicode Scrubber

Getting The Code

The most recent version should be available from bitbucket here:

https://www.bitbucket.org/lindenlab/nzscrub-utf8

Introduction

For inserting Unicode data into the Netezza database which has relatively limited unicode support, it became necessary to quickly scan and fix unicode text so that it could be safely imported into Netezza.

Specifically, this utility:

  • strips out any utf8 u+FEFF characters
  • turns any unicode version 5.0 and later into '?' (or the char of choice)
  • normalizes the unicode into Normalized Form C (NFC)

Since we usually operate on tons of data and potentially large unicode strings, efficiency is decently important. We wrote this in C to make it as fast as possible. (TODO: add python bindings?)

Quick Start

NOTE: this was built with linux in mind (though it has been compiled on MacOSX). it may or may not work on your machine without some porting.

  1. hg clone https://www.bitbucket.org/lindenlab/nzscrub-utf8
  2. icu-config --version

You should have version 4.4 or later. If you don't have icu-config installed, see BUILDING below for more info.

  1. cd src && make

Nitty Gritty

1. The Byte Order Mark, U+FEFF, is usually used at the beginning of unicode files to decide byte ordering for utf-16. However, it was also used in the early unicode standard as the 'ZERO WIDTH NO-BREAK SPACE' character. This use has been deprecated but we've noticed it scattered throughout our user data (possibly on purpose, possibly due to old/bad software.)

Netezza only supports U+FEFF as a byte order mark at the beginning of a string and balks if the BOM is somewhere else in the string.

2. Netezza 6.0 and earlier only supported the Unicode standard 4.1.0 and earlier. As a result, importing any string with a unicode ode point in Unicode 5.0 or later causes the record to error out.

3. For collation efficiency, Netezza requires Unicode data to be in the Normalized Form C (NFC). So, instead of Á being encoded as <LATIN CAPITAL LETTER A>,<COMBINING ACUTE ACCENT> (U+0041,U+0301), it is instead encoded as <LATIN CAPITAL LETTER A WITH ACUTE> (U+00C1).

Compatibility

This was developed in straight C using gcc on Debian Squeeze. It should work with most versions of Linux, and I've compiled it without issue on MacOSX. If you run into errors compiling on your nix flavor, please let me know what you needed to do to compile.

NOTE: I have NOT developed this for Windows at all, but I'd be happy to accept patches!

Building

This code uses IBM's libicu (version 4.4 or later) so you'll need that installed. You can install that using your linux distributions pacakage manager, or get it from source here: http://site.icu-project.org/download

To build,

  1. make sure libicu is installed and icu-config is available in your path.
  2. cd src && make

By default, this will create two binaries:

  1. nzscrub-utf8 (dynamically linked)
  2. nzscrub-utf8-static (icu-statically linked)

If you need a completely static binary for older machines, run, make nzscrub-static-full.

Installation

Copy the desired src/nzscrub-utf8 binary to wherever you want to use it.

Example Usage

$ cd src/
$ ./nzscrub-utf8 test.txt -o fixed.txt
$ ./nzscrub-utf8 -h
nzscrub-utf8 - scrub unicode data for safe import into Netezza

USAGE:
    nzscrub-utf8 [opts] infile

    --help|-h          show this screen
    --version|-v       print version information and exit
    --outfile|-o FILE  put scrubbed data in FILE instead of stdout.
    --char|-c CHAR     use CHAR as a replacement char for bad unicode chars.
                       set to 0 to delete bad characters.  (default = 0)
    --buff|-b BUFF     read files in chunks of BUFF characters. If this is not set
                       or is 0, the entire file (or NZS_MAX_READ_CHARS, whichever is smaller)
                       will be read into memory.  NZS_MAX_READ_CHARS = 1048576
    -x CONVERSION      specify one or more conversions to do.  If unespecified, ALL
                       conversions will be run.
                       CONVERSION must be one of
                           'unsupp'- filter out unsupported unicode versions (see -u)
                           'feff'  - filter out U+feff characters
                           'nfc'   - normalize into Normalization Form C
                       multiple options can be given -x feff -x nfc
    --uvers|-u VER     specify a cutoff for the 'unsupp' conversion above.
                       this should be a decimal delimited string of versions
                       representing the version from which you want to start
                       removing.  For example -u 5.0 will replace all characters
                       from Unicode 5.0 and later with the desired replacement
                       character.  (default = '5.0')
   --summary|-s        print summary of changes

    If infile is omitted, input is expected on STDIN
    All data is expected and written in UTF-8

License

This code was developed by Steve Layland (aka Cru Linden) for Linden Lab, makers of Second Life. It is released AS IS under a derivative of the MIT license. See LICENSE.txt for specifics.

Recent activity

Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.