P4D3::MacHg is forcing all files to utf8 before calling FileMerge

Issue #291 open
John Gee
created an issue

MacHg appears to be forcing the extended file attributes to utf8 encoding, whether or not the file is that encoding, and whether or not the file already has extended file attributes. This is breaking FileMerge on files which are not in that encoding, and (after MacHg changes file attributes) also breaking processing of utf16 string files.

I notice this problem because our existing codebase is a historical mixture of ascii, MacRoman, and utf16.

To demonstrate the problem I create three files on disk with Xcode, one in MacRoman encoding, one in utf16 encoding, and one in utf8 encoding. The files all have "splots" in the file contents, so not plain ascii.

{{{ $ ls -l@ total 24 -rw-r--r--@ 1 john staff 23 14 Oct 23:54 MacRoman com.apple.TextEncoding 11 -rw-r--r--@ 1 john staff 48 14 Oct 23:54 utf16 com.apple.TextEncoding 10 -rw-r--r--@ 1 john staff 33 14 Oct 23:54 utf8 com.apple.TextEncoding 15 }}}

At this point FileMerge works when invoked directly on all three files, albeit with a warning about the files not being ascii. e.g. {{{ $ opendiff MacRoman utf16 }}}

I start up MacHg and add and commit the files, make a change to each in Xcode, and try double clicking on each of the files in MacHg to launch FileMerge and display the changes. The utf8 file displays ok, but the other two show no content in FileMerge. In addition, the extended file attributes have changed.

{{{ $ ls -l@ total 24 -rw-r--r--@ 1 john staff 23 14 Oct 23:54 MacRoman com.apple.TextEncoding 15 -rw-r--r--@ 1 john staff 48 14 Oct 23:54 utf16 com.apple.TextEncoding 15 -rw-r--r--@ 1 john staff 33 14 Oct 23:54 utf8 com.apple.TextEncoding 15 }}}

FileMerge is unable to display the files because the file contents are inconsistent with a utf8 encoding, for the MacRoman and utf16 files get errors of: 2011-10-15 00:08:10.237 FileMerge[3616:1203] The file “MacRoman” couldn’t be opened using text encoding Unicode (UTF-8). 2011-10-15 00:08:10.239 FileMerge[3616:1203] Incorrect NSStringEncoding value 0x0000 detected. Assuming NSASCIIStringEncoding. Will stop this compatiblity mapping behavior in the near future.

Clearing the attributes entirely allows FileMerge to work again. e.g. {{{ $ xattr -c * $ opendiff MacRoman utf16 }}}

So in summary, MacHg appears to be forcing the extended attributes to utf8. This isn't obviously helping with files which are ascii or utf8, and is breaking files which are actually in other encodings.

Comments (9)

  1. John Gee reporter

    Ah, probably all the action is happening in the helper script, and not in the MacHg code.

    MacHg.app/Contents/Resources/fmdiff.sh ends with this text:

    # Find the com.apple.TextEncoding extended attributes of the files
    leftattributes=`xattr -p com.apple.TextEncoding "$leftfile" 2>/dev/null`
    rightattributes=`xattr -p com.apple.TextEncoding "$rightfile" 2>/dev/null`
    
    # if the encodings are not UTF-8, then make them UTF-8
    shopt -s nocasematch
    if [ -z "$leftattributes" ] || [ "$leftattributes" != "UTF-8;134217984" ]; then
            xattr -w com.apple.TextEncoding "UTF-8;134217984" "$leftfile"
    fi
    if [ -z "$rightattributes" ] || [ "$rightattributes" != "UTF-8;134217984" ]; then
            xattr -w com.apple.TextEncoding "UTF-8;134217984" "$rightfile"
    fi
    shopt -u nocasematch
    
    exec /usr/bin/opendiff "$leftfile" "$rightfile" -merge "$rightfile"
    
    
  2. Jason Harris repo owner
    • changed status to open

    Ahh... Cool you found the script... Weellllll.... it just so happens that File Merge with international characters without this setting was giving garbage littered diffs. This *massively* cleaned things up.

    Happily, you sound like a very advanced OSX user / developer so it's probably no problem for you to just change this script inside the MacHg bundle itself.

    Moving forward we cold have a preference option which would call one script or another, or pass an option to this script. (BTW if you do the latest checkout you will see 378b77f648e5 changes this script to look for file merge's open-diff a little harder.)

    So patches on your or someone else's part to add a preference item are extremely welcome! :)

    Path instructions --------------------------------------------------------------------------------- Probably add an advanced preference item like: "FileMerge force UTF-8: <CheckBox>"

    this would go in the pane MacHg > Preferences > Advanced Options. Inside this pane we / you would add a new group box "FileMerge configuration:"

    We could even put another item in there: search path for OpenDiff to really improve 378b77f648e5 :)

    Thanks! Jas

  3. Jason Harris repo owner

    Ahh... Cool you found the script... Weellllll.... it just so happens that File Merge with international characters without this setting was giving garbage littered diffs. This *massively* cleaned things up.

    Happily, you sound like a very advanced OSX user / developer so it's probably no problem for you to just change this script inside the MacHg bundle itself.

    Moving forward we cold have a preference option which would call one script or another, or pass an option to this script. (BTW if you do the latest checkout you will see 378b77f648e5 changes this script to look for file merge's open-diff a little harder.)

    So patches on your or someone else's part to add a preference item are extremely welcome! :)

    Patch Instructions:


    Probably add an advanced preference item like: "FileMerge force UTF-8: <CheckBox>"

    this would go in the pane MacHg > Preferences > Advanced Options. Inside this pane we / you would add a new group box "FileMerge configuration:"

    We could even put another item in there: search path for OpenDiff to really improve 378b77f648e5 :)


    Thanks, Jas

  4. John Gee reporter

    Ideally I would like a solution that works robustly, but determining file encodings involves some guesswork so not going to be bullet proof. In my simple tests, I have not seen a file that forcing the utf8 encoding helps with, but I do not doubt they exist in other environments! Are they produced by a particular editor, or large, or have certain characters that trigger the problem, or unknown trigger for the FileMerge garbage display? (There is some double guessing since we do not know how FileMerge guesses the file encoding.)

    I'll experiment with script ideas using the "other" support in MacHg.

    Side note: I wrote über into a file with three encodings, and examine files with hexdump and "file" to confirm they are encoded differently.

    $ hexdump -C MacRoman
    00000000  9f 62 65 72 0a                                    |.ber.|
    00000005
    $ hexdump -C utf8
    00000000  c3 bc 62 65 72                                    |..ber|
    00000005
    $ hexdump -C utf16
    00000000  ff fe fc 00 62 00 65 00  72 00                    |....b.e.r.|
    0000000a
    
    $ file *
    MacRoman:      Non-ISO extended-ASCII text
    utf16:         Little-endian UTF-16 Unicode text, with no line terminators
    utf8:          UTF-8 Unicode text, with no line terminators
    
  5. Log in to comment