# Netezza Unicode Scrubber

## Getting The Code

https://www.bitbucket.org/lindenlab/nzscrub-utf8

## Introduction

For inserting Unicode data into the Netezza database which has relatively limited unicode support, it became necessary to quickly scan and fix unicode text so that it could be safely imported into Netezza.

Specifically, this utility:

• strips out any utf8 u+FEFF characters
• turns any unicode version 5.0 and later into '?' (or the char of choice)
• normalizes the unicode into Normalized Form C (NFC)

Since we usually operate on tons of data and potentially large unicode strings, efficiency is decently important. We wrote this in C to make it as fast as possible. (TODO: add python bindings?)

## Quick Start

NOTE: this was built with linux in mind (though it has been compiled on MacOSX). it may or may not work on your machine without some porting.

1. hg clone https://www.bitbucket.org/lindenlab/nzscrub-utf8
2. icu-config --version

You should have version 4.4 or later. If you don't have icu-config installed, see BUILDING below for more info.

1. cd src && make

## Nitty Gritty

1. The Byte Order Mark, U+FEFF, is usually used at the beginning of unicode files to decide byte ordering for utf-16. However, it was also used in the early unicode standard as the 'ZERO WIDTH NO-BREAK SPACE' character. This use has been deprecated but we've noticed it scattered throughout our user data (possibly on purpose, possibly due to old/bad software.)

Netezza only supports U+FEFF as a byte order mark at the beginning of a string and balks if the BOM is somewhere else in the string.

2. Netezza 6.0 and earlier only supported the Unicode standard 4.1.0 and earlier. As a result, importing any string with a unicode ode point in Unicode 5.0 or later causes the record to error out.

3. For collation efficiency, Netezza requires Unicode data to be in the Normalized Form C (NFC). So, instead of Á being encoded as <LATIN CAPITAL LETTER A>,<COMBINING ACUTE ACCENT> (U+0041,U+0301), it is instead encoded as <LATIN CAPITAL LETTER A WITH ACUTE> (U+00C1).

## Compatibility

This was developed in straight C using gcc on Debian Squeeze. It should work with most versions of Linux, and I've compiled it without issue on MacOSX. If you run into errors compiling on your nix flavor, please let me know what you needed to do to compile.

NOTE: I have NOT developed this for Windows at all, but I'd be happy to accept patches!

## Building

This code uses IBM's libicu (version 4.4 or later) so you'll need that installed. You can install that using your linux distributions pacakage manager, or get it from source here: http://site.icu-project.org/download

To build,

1. make sure libicu is installed and icu-config is available in your path.
2. cd src && make

By default, this will create two binaries:

If you need a completely static binary for older machines, run, make nzscrub-static-full.

## Installation

Copy the desired src/nzscrub-utf8 binary to wherever you want to use it.

## Example Usage

$cd src/$ ./nzscrub-utf8 test.txt -o fixed.txt
\$ ./nzscrub-utf8 -h
nzscrub-utf8 - scrub unicode data for safe import into Netezza

USAGE:
nzscrub-utf8 [opts] infile

--help|-h          show this screen
--version|-v       print version information and exit
--outfile|-o FILE  put scrubbed data in FILE instead of stdout.
--char|-c CHAR     use CHAR as a replacement char for bad unicode chars.
set to 0 to delete bad characters.  (default = 0)
--buff|-b BUFF     read files in chunks of BUFF characters. If this is not set
or is 0, the entire file (or NZS_MAX_READ_CHARS, whichever is smaller)
-x CONVERSION      specify one or more conversions to do.  If unespecified, ALL
conversions will be run.
CONVERSION must be one of
'unsupp'- filter out unsupported unicode versions (see -u)
'feff'  - filter out U+feff characters
'nfc'   - normalize into Normalization Form C
multiple options can be given -x feff -x nfc
--uvers|-u VER     specify a cutoff for the 'unsupp' conversion above.
this should be a decimal delimited string of versions
representing the version from which you want to start
removing.  For example -u 5.0 will replace all characters
from Unicode 5.0 and later with the desired replacement
character.  (default = '5.0')
--summary|-s        print summary of changes

If infile is omitted, input is expected on STDIN
All data is expected and written in UTF-8