Zurmo email archiving chokes on certain (NON-UTF8) characters

Issue #65 on hold
Anonymous created an issue

When a character that MySQL can't store as UTF-8 is present within an emailbody, saving that email to the "textbody", "htmlBody" and "serializedbody" fields all fail with a ' General error: 1366 Incorrect string value' PDOException. You should either sanitize the input with iconv or change the field types to BLOB. I'm going to implement proper sanitation now and may send a PR later.

Comments (7)

  1. Anonymous

    UPDATE: Upon further inspection, it appears that your code is completely blind to character sets. This shouldn't be the case, especially since email gives you a convienent header describing the charset. windows-1252 input will fail, etc. You should convert all input to UTF-8.

    Note that this issue can be fixed by modifying the EmailArchivingJob's saveEmailMessage method to sanitize the string.

    8/17/12 9:43 AM America/New_York Starting job type: EmailArchiving `[HY000] - SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\x92t se e...' for column 'textcontent' at row 1`

  2. Anonymous

    I modified the ZurmoIMAP class to properly handle the content-type header. I will post this code on Monday. Note that there is another issue: for large emails, the data is too long for the "serializedData" column which appears to be in the auditevent table.

  3. Ivica Nedeljkovic

    Thanks for reporting/fixing this issue, it would be great if you create Pull Request for this fix.

    About another issue, is problem with 'content' field of EmailMessage? We can specify list of fields whose changes will not be logged in AuditEvent(you can search for noAudit to see how it works), but there must be better solution. We will check this issue.

  4. Anonymous

    You shouldn't log the content of the message in the audit events, and you should make the e-mail body field a longtext. Also, you shouldn't store attachments in the database. They should be stored on the file system with a ref to them, or possibly not stored at all as a user option. As for the charset handling, I'm not particularly familiar with bitbucket and it's forking and don't have time to learn it now so I'll just post my modifications to ZurmoIMAP below [it was only one method, getPart]. It should replace the code beginning at the old `if( $mimeType`. Sorry for the inelegancy of this submission, but you should at least get the idea from it (:

                if ($mimeType == $this->getMimeType($structure))
                    $partNumber = ($partNumber > 0) ? $partNumber : 1;
                    $headers = imap_fetchmime($this->imapStream, $msgNumber, $partNumber);
                    $retval = imap_fetchbody($this->imapStream, $msgNumber, $partNumber);
                    // check for content-type header and convert
                    if( preg_match('/^Content-Type:\s+([^;]+)(?:;\s*charset=(.*))?/im', $headers, $match) )
                        $encoding = trim( trim($match[2], '"\'') );
                        $retval = iconv(strtoupper($encoding), 'UTF-8//IGNORE', $retval);
                    // try forcing to utf-8 no matter what
                    $retval = iconv( 'UTF-8', 'UTF-8//IGNORE', $retval );
                    // characters which are invalid in UTF-8 and might be leftover because of wacky windows character sets and an iconv issue
                    $badchars = array("\x21","\x22","\x23","\x24","\x25","\x26","\x27","\x28","\x29","\x2A","\x2B","\x2C","\x2D","\x2E","\x2F","\x3A","\x3B","\x3C","\x3D","\x3E","\x3F","\x30","\x31","\x32","\x33","\x34","\x35","\x36","\x37","\x38","\x39","\x5B","\x5C","\x5D","\x5E","\x5F","\x7B","\x7C","\x7D","\x7E","\x7F","\x91","\x92","\x93","\x94","\x95","\x20","\xA0");
                    return str_replace($badchars , ' ', $retval);
  5. Ivica Nedeljkovic

    I started to investigate this issue, but can you provide me more details about characters that are problematic. I tried to insert characters from few languages, but I was not able to trigger error. Also there is issue with your code,, some of bad characters that you listed are html open and close tags("<" and ">"), and those tags were stripped from email content, which doesn't have sense.

    Can you please send me content that is problematic, so I can try to fix this issue?

    Thanks, Ivica

  6. Log in to comment