1. Russ Cox
  2. plan9port


Issue #122 new

awk: wrong length of utf-8 strings

Petr Shevtsov
created an issue

It seems that awk counts lengths not in letters (or runes) but in bytes:

echo latin кириллица | awk '{printf("[%20s][%20s]\n", $1, $2)}'
[               latin][  кириллица]
echo latin кириллица | awk '{printf("%d %d\n", length($1), length($2))}'
5 18

Comments (6)

  1. Petr Shevtsov reporter

    I can't understand why counting bytes is reasonable. There is no way to format output with printf, count text length and substr also works incorrect:

    echo latin кириллица | awk '{printf("%s %s\n", substr($1, 1, 3), substr($2, 1, 3))}'
    lat кириллица

    but in GNU awk

    echo latin кириллица | gawk '{printf("%s %s\n", substr($1, 1, 3), substr($2, 1, 3))}'
    lat кир
  2. erik quanstrom

    russ, plan 9 awk counts runes, not bytes.

    here's the diff that i have. i haven't fully tested this, but i think it's about right:

    ; hg diff *.[chy] > /tmp/diff-o-rama
    diff -r 876a744bdedb src/cmd/awk/awk.h
    --- a/src/cmd/awk/awk.h Mon Sep 24 15:43:33 2012 -0400
    +++ b/src/cmd/awk/awk.h Wed Jan 23 00:26:06 2013 -0500
    @@ -181,5 +181,12 @@
     /* #define freeable(p) (!((p)->tval & DONTFREE)) */
     #define freeable(p)    ( ((p)->tval & (STR|DONTFREE)) == STR )
    +#define    wchar_t unsigned int
    +int    awkmblen(char *s, size_t n);
    +int    awkwctomb(char *s, wchar_t);
    +#define    mblen(x, y) awkmblen(x, y)
    +#define    wctomb(x, y)    awkwctomb(x, y)
     #include "proto.h"
    diff -r 876a744bdedb src/cmd/awk/re.c
    --- a/src/cmd/awk/re.c  Mon Sep 24 15:43:33 2012 -0400
    +++ b/src/cmd/awk/re.c  Wed Jan 23 00:26:06 2013 -0500
    @@ -323,3 +323,29 @@
        FATAL("%s", "regular expression too big");
    + * i think these are necessary to ensure we use utf-8,
    + * and to play by the quite strange rules for these functions
    + */
    +awkmblen(char *s, size_t n)
    +   Rune r;
    +   if(s == nil || *s == 0)
    +       return 0;
    +   if(!fullrune(s, n))
    +       return -1;
    +   return chartorune(&r, s);
    +awkwctomb(char *s, wchar_t wc)
    +   Rune r;
    +   if(s == nil)
    +       return 0;
    +   r = (Rune)wc;
    +   return runetochar(s, &r);
  3. Ethan Grammatikidis

    (Reply via eek...@fastmail.fm):

    I just yesterday found awk doesn't seem to be utf8-aware at all. I used it on some text which had a dash character which looked quite ordinary but was evidently a 3-byte char, and one or two of its components were special to regexps. In the end I used tr to change it to the ascii all-purpose dash. That was on 9front, but awk on 9front and original Plan 9 is not actually a Plan 9 program, it's (still) built on APE, the POSIX compatibility layer. I don't believe 9front has touched it apart from applying fixes to APE. Erik Quanstrom (quanstro on contrib) may have a fully ported, utf8-aware awk in 9atom, I'm not sure.

  4. erik quanstrom

    then 9front has broken awk. awk from the distribution and awk from 9atom both pass this test. i have only fixed one floating point bug with printing in awk that is somewhat plan9 specific. (unix doesn't seem to mind fp loss of precision errors.) otherwise it's a port of version 20070501 with the normal plan 9 changes applied.

  5. Log in to comment