No sphinx results with long email addresses

Issue #1082 closed
Martin Nadvornik created an issue

Hello Janos,

we found out that as soon as a piler account (via LDAP in our case) has access to an email address that is 42 characters or longer - including @ and the domain part - we are not able to see any results within the WebGUI even with an empty search.

One example for our problem:
The email address concerned is aaaaaaaa.bbbbbbbbbbbb-cccccccccccc@dddddddd.ee. I have anonymized the address keeping the amount of characters the same.
When logging in with this address I can see in the account settings within the piler UI that the account has permission to view it's own e-mails The specific account does not have access to any additional alias or mail groups. It does not matter which search query is beeing entered - wether it is advanced search or normal search - there are always no results. Looking at the log the following sphinx query is shown as beeing executed:

SELECT id FROM main1,dailydelta1,delta1
   WHERE MATCH('( @from aaaaaaaaXbbbbbbbbbbbbXccccccccccccXddddddddXee @to aaaaaaaaXbbbbbbbbbbbbXccccccccccccXddddddddXee ) & (@from aaaaaaaaXbbbbbbbbbbbbXccccccccccccXddddddddXee | @to aaaaaaaaXbbbbbbbbbbbbXccccccccccccXddddddddXee) ')
   ORDER BY `sent` DESC
   LIMIT 0,20
   OPTION max_matches=1000
   ;

This is the result:
in 0.00 s, 0 hits, 0 total found

However when looking for mails that were sent from the concerned address to the concerned address while being logged in as auditor@local the following sphinx query is being executed

SELECT id FROM main1,dailydelta1,delta1
   WHERE MATCH('@from aaaaaaaaXbbbbbbbbbbbbXccccccccccccXddddddddXee @to aaaaaaaaXbbbbbbbbbbbbXccccccccccccXddddddddXee ')
   ORDER BY `sent` DESC
   LIMIT 0,20
   OPTION max_matches=1000
   ;

This is the result:
in 0.00 s, 20 hits, 1835 total found

So the mails are correctly archived, otherwise I would not find them as auditor.

It does not matter if the query is being executed by piler or directly via the mysql command line connected to sphinx.
I have tried many combinations of e-mail addresses and my testing concluded that the bug arises as soon as the account has access to a address longer than 41 characters. The query length itself does not seem to be an issue because everything is working fine if I give an account access to 3 addresses with a combined length of far more than 41 characters.

I have looked through our sphinx.conf file but I did not find any limitation that would obviously cause this issue. As you can see below we are using the default values of sphinx.conf as shipped with piler. We did not adjust any config values besides enabling Sphinx 3.1.X.
We are currently running Sphinx 3.1.1 and already tried Sphinx 3.2.1 but it did not resolve our problem and lead to a deprecation warning so we downgraded to 3.1.1 again.

Here is the Sphinx config we are using:

#!/usr/bin/php
<?php

define('SPHINX_VERSION', 311); // If you have sphinx-3.1.1, then set SPHINX_VERSION to 311
define('LOCALSTATEDIR', '/var');
define('NGRAM_CONFIG', "        #ngram_len               = 1\n        #ngram_chars             = U+3000..U+2FA1F\n");

# See http://sphinxsearch.com/wiki/doku.php?id=charset_tables for more on the charset_table settings
# The following settings contains English and some Latin extras
define('SPHINX_CHARSET_TABLE', "0..9, english, _, \
                                  U+C1->U+E1, U+C4->U+E4, U+C5->U+E5, U+C6->U+E6, U+C9->U+E9, U+CD->U+ED, U+D3->U+F3, U+D6->U+F6, U+D8->U+F8, \
                                  U+DA->U+FA, U+DC->U+FC, U+0150->U+0151, U+0152->U+0153, U+0170->U+0171, U+01E2->U+E6, U+01E3->U+E6, U+01FC->U+E6, \
                                  U+01FD->U+E6, U+1D01->U+E6, U+1D02->U+E6, U+1D2D->U+E6, U+1D46->U+E6, \
                                  U+DF, U+E1, U+E4, U+E5, U+E6, U+E9, U+ED, U+00F3, U+F6, U+F8, U+FA, U+FC, U+0151, U+0153, U+0171\n");

?>

#
# minimal sphinx configuration suited to piler
#

source base
{
   type = mysql
   sql_host = localhost
   sql_db = piler
   sql_user = piler
   sql_pass = XXXXXXXXXXXXXXXXXXXXXXXXX

   sql_attr_uint = size
   sql_attr_uint = sent
   sql_attr_uint = attachments
}

source delta : base
{
   sql_query_pre = SET NAMES utf8mb4
   sql_query_pre  = REPLACE INTO sph_counter SELECT 1, IFNULL(MAX(id), 0) FROM sph_index
   sql_query_post_index  = DELETE FROM sph_index WHERE id<=(SELECT max_doc_id FROM sph_counter WHERE counter_id=1)
   sql_query = SELECT id, `from`, `to`, `fromdomain`, `todomain`, `subject`, `arrived`, `sent`, `body`, `size`, `direction`, `folder`, `attachments`, `attach
ment_types` FROM sph_index \
        WHERE id <= (SELECT max_doc_id FROM sph_counter WHERE counter_id=1)

   sql_query_killlist = SELECT `id` FROM `metadata` WHERE `deleted`=1
}

source main1 : base
{
   sql_query_pre = SET NAMES utf8mb4
   sql_query = SELECT id, `from`, `to`, `fromdomain`, `todomain`, `subject`, `arrived`, `sent`, `body`, `size`, `direction`, `folder`, `attachments`, `attach
ment_types` FROM sph_index WHERE id=-1
}

source main2 : base
{
   sql_query_pre = SET NAMES utf8mb4
   sql_query = SELECT id, `from`, `to`, `fromdomain`, `todomain`, `subject`, `arrived`, `sent`, `body`, `size`, `direction`, `folder`, `attachments`, `attach
ment_types` FROM sph_index WHERE id=-1
}

source main3 : base
{
   sql_query_pre = SET NAMES utf8mb4
   sql_query = SELECT id, `from`, `to`, `fromdomain`, `todomain`, `subject`, `arrived`, `sent`, `body`, `size`, `direction`, `folder`, `attachments`, `attach
ment_types` FROM sph_index WHERE id=-1
}

source main4 : base
{
   sql_query_pre = SET NAMES utf8mb4
   sql_query = SELECT id, `from`, `to`, `fromdomain`, `todomain`, `subject`, `arrived`, `sent`, `body`, `size`, `direction`, `folder`, `attachments`, `attach
ment_types` FROM sph_index WHERE id=-1
}

source dailydelta : base
{
   sql_query_pre = SET NAMES utf8mb4
   sql_query = SELECT id, `from`, `to`, `fromdomain`, `todomain`, `subject`, `arrived`, `sent`, `body`, `size`, `direction`, `folder`, `attachments`, `attach
ment_types` FROM sph_index WHERE id=-1
}

source tag : base
{
   sql_query_pre = SET NAMES utf8mb4
   sql_query  = SELECT `_id`, `id` AS iid, `uid`, `tag` FROM `tag`

   sql_attr_uint = iid
   sql_attr_uint = uid

}

source note : base
{
   sql_query_pre = SET NAMES utf8mb4
   sql_query  = SELECT `_id`, `id` AS iid, `uid`, `note` FROM `note`

   sql_attr_uint = iid
   sql_attr_uint = uid

}


index main1
{
        source                  = main1
        path                    = <?php print LOCALSTATEDIR; ?>/piler/sphinx/main1
<?php if(SPHINX_VERSION < 300) { ?>
        docinfo                 = extern
        dict                    = keywords
<?php } ?>
        min_prefix_len          = 5
        min_word_len            = 1
        charset_table           = <?php print SPHINX_CHARSET_TABLE; ?>
<?php print NGRAM_CONFIG; ?>
}

index main2
{
        source                  = main2
        path                    = <?php print LOCALSTATEDIR; ?>/piler/sphinx/main2
<?php if(SPHINX_VERSION < 300) { ?>
        docinfo                 = extern
        dict                    = keywords
<?php } ?>
        min_prefix_len          = 5
        min_word_len            = 1
        charset_table           = <?php print SPHINX_CHARSET_TABLE; ?>
<?php print NGRAM_CONFIG; ?>
}

index main3
{
        source                  = main3
        path                    = <?php print LOCALSTATEDIR; ?>/piler/sphinx/main3
<?php if(SPHINX_VERSION < 300) { ?>
        docinfo                 = extern
        dict                    = keywords
<?php } ?>
        min_prefix_len          = 5
        min_word_len            = 1
        charset_table           = <?php print SPHINX_CHARSET_TABLE; ?>
<?php print NGRAM_CONFIG; ?>
}

index main4
{
        source                  = main4
        path                    = <?php print LOCALSTATEDIR; ?>/piler/sphinx/main4
<?php if(SPHINX_VERSION < 300) { ?>
        docinfo                 = extern
        dict                    = keywords
<?php } ?>
        min_prefix_len          = 5
        min_word_len            = 1
        charset_table           = <?php print SPHINX_CHARSET_TABLE; ?>
<?php print NGRAM_CONFIG; ?>
}

index dailydelta1
{
        source                  = dailydelta
        path                    = <?php print LOCALSTATEDIR; ?>/piler/sphinx/dailydelta1
<?php if(SPHINX_VERSION < 300) { ?>
        docinfo                 = extern
        dict                    = keywords
<?php } ?>
        min_prefix_len          = 5
        min_word_len            = 1
        charset_table           = <?php print SPHINX_CHARSET_TABLE; ?>
<?php print NGRAM_CONFIG; ?>
}

index delta1
{
        source                  = delta
        path                    = <?php print LOCALSTATEDIR; ?>/piler/sphinx/delta1
<?php if(SPHINX_VERSION < 300) { ?>
        docinfo                 = extern
        dict                    = keywords
<?php } ?>
        min_prefix_len          = 5
        min_word_len            = 1
        charset_table           = <?php print SPHINX_CHARSET_TABLE; ?>
<?php print NGRAM_CONFIG; ?>
<?php if(SPHINX_VERSION >= 310) { ?>
        kbatch = main1, main2, main3, main4, dailydelta1
<?php } ?>
}


index tag1
{
        source                  = tag
        path                    = <?php print LOCALSTATEDIR; ?>/piler/sphinx/tag1
<?php if(SPHINX_VERSION < 300) { ?>
        docinfo                 = extern
        dict                    = keywords
<?php } ?>
        min_prefix_len          = 5
        min_word_len            = 1
        charset_table           = <?php print SPHINX_CHARSET_TABLE; ?>
<?php print NGRAM_CONFIG; ?>
}


index note1
{
        source                  = note
        path                    = <?php print LOCALSTATEDIR; ?>/piler/sphinx/note1
<?php if(SPHINX_VERSION < 300) { ?>
        docinfo                 = extern
        dict                    = keywords
<?php } ?>
        min_prefix_len          = 5
        min_word_len            = 1
        charset_table           = <?php print SPHINX_CHARSET_TABLE; ?>
<?php print NGRAM_CONFIG; ?>
}


indexer
{
        mem_limit               = 256M
}


searchd
{
        listen                  = 127.0.0.1:9312
        listen                  = 127.0.0.1:9306:mysql41
        log                     = /dev/null
        binlog_path             =
        ##query_log             =
        read_timeout            = 5
        max_children            = 30
        pid_file                = /var/run/piler/searchd.pid
        seamless_rotate         = 1
        preopen_indexes         = 1
        unlink_old              = 1
        thread_stack            = 512k
        workers                 = threads # for RT to work
}

It seems to me that piler might not be the problem but rather sphinx itself Do you know of any sphinx limitation we could run into with our queries?

Since I did not find any Bitbucket issues for our problem yet I hope you are happy with me opening this one 🙂

Comments (14)

  1. Janos SUTO repo owner

    Hello Martin, I’m happy with the opened issue. And I especially like that you used the formatting options of bitbucket. 🙂

    Please try an empty search criteria for this user and for another user with a shorter email address, and show me the sphinx queries. Though I don’t think it depends on the length of the email.

  2. Martin Nadvornik reporter

    Hello Janos,

    as requested here are the search queries. In the table every line is an empty search request I made with the WebGUI.

    For the tests I created mail accounts on our mailserver, logged in via the WebGUI (which accesses the accounts via LDAP) and verified in the access settings that everything was correctly recognized. Before testing I sent e-mails to every address, waited for the sphinx cron job and verified with the auditor account that they are indeed correctly indexed.

    The domain part of the addresses is of course anonymized, but the length is kept identical. Please note that the many digits in the addresses are not the issue altough such address are of course not really found in the real world. We do have the same issue also with addresses containing just characters from the latin alphabet, I just wasn’t very creative while testing 🙂

    primary e-mail address aliases sphinx query successful? longest e-mail address
    sys-26696@abcdefgh.ij 12345678.12345678@abcdefgh.ij, 12345678.123456789012@abcdefgh.ij, 12345678.123456789012-123456789012@abcdefgh.ij Query 1 No 46 chars
    sys-26696@abcdefgh.ij 12345678.12345678@abcdefgh.ij, 12345678.123456789012@abcdefgh.ij, 12345678.123456789012-12345678@abcdefgh.ij Query 2 No 42 chars
    sys-26696@abcdefgh.ij 12345678.123456789012-12345678@abcdefgh.ij Query 3 No 42 chars
    sys-26696@abcdefgh.ij 12345678.123456789012-1234567@abcdefgh.ij Query 4 Yes 41 chars
    sys-26696@abcdefgh.ij 1234567890.1234567890.1234567@abcdefgh.ij Query 5 Yes 41 chars
    26696@abcdefgh.ij 1234567890.1234567890.1234567890@abcdefgh.ij Query 6 No 44 chars
    26696@abcdefgh.ij 1234567890.1234567890.1234567@abcdefgh.ij Query 7 Yes 41 chars
    sys-26696@abcdefgh.ij 12345678.123456789012.12345678@abc.def Query 8 Yes 38 chars

    Within the file mail.log I did not find any error messages below or above the logged queries.

    These are the queries:

    Query 1:

    Jun 29 18:54:19 mailarchiv-2019 piler-webui[15815]: sphinx query: 'SELECT id FROM main1,dailydelta1,delta1 WHERE        MATCH(' (@from sysX26696XabcdefghXij| 12345678X12345678XabcdefghXij| 12345678X123456789012XabcdefghXij| 12345678X123456789012X123456789012XabcdefghXij | @to sysX26696XabcdefghXij| 12345678X12345678XabcdefghXij| 12345678X123456789012XabcdefghXij| 12345678X123456789012X123456789012XabcdefghXij) ') ORDER BY sent DESC LIMIT 0,20 OPTION max_matches=1000' in 0.00 s, 0 hits, 0 total found
    

    Query 2:

    Jun 29 18:54:59 mailarchiv-2019 piler-webui[23832]: sphinx query: 'SELECT id FROM main1,dailydelta1,delta1 WHERE        MATCH(' (@from sysX26696XabcdefghXij| 12345678X12345678XabcdefghXij| 12345678X123456789012XabcdefghXij| 12345678X123456789012X1234567890XabcdefghXij | @to sysX26696XabcdefghXij| 12345678X12345678XabcdefghXij| 12345678X123456789012XabcdefghXij| 12345678X123456789012X1234567890XabcdefghXij) ') ORDER BY sent DESC LIMIT 0,20 OPTION max_matches=1000' in 0.00 s, 0 hits, 0 total found
    

    Query 3:

    Jun 29 18:58:06 mailarchiv-2019 piler-webui[23833]: sphinx query: 'SELECT id FROM main1,dailydelta1,delta1 WHERE        MATCH(' (@from sysX26696XabcdefghXij| 12345678X123456789012X12345678XabcdefghXij | @to sysX26696XabcdefghXij| 12345678X123456789012X12345678XabcdefghXij) ') ORDER BY sent DESC LIMIT 0,20 OPTION max_matches=1000' in 0.00 s, 0 hits, 0 total found
    

    Query 4:

    Jun 29 19:01:49 mailarchiv-2019 piler-webui[23831]: sphinx query: 'SELECT id FROM main1,dailydelta1,delta1 WHERE        MATCH(' (@from sysX26696XabcdefghXij| 12345678X123456789012X1234567XabcdefghXij | @to sysX26696XabcdefghXij| 12345678X123456789012X1234567XabcdefghXij) ') ORDER BY `sent` DESC LIMIT 0,20 OPTION max_matches=1000' in 0.00 s, 1 hits, 1 total found
    

    Query 5:

    Jun 29 19:02:47 mailarchiv-2019 piler-webui[23837]: sphinx query: 'SELECT id FROM main1,dailydelta1,delta1 WHERE        MATCH(' (@from sysX26696XabcdefghXij| 1234567890X1234567890X1234567XabcdefghXij | @to sysX26696XabcdefghXij| 1234567890X1234567890X1234567XabcdefghXij) ') ORDER BY `sent` DESC LIMIT 0,20 OPTION max_matches=1000' in 0.00 s, 1 hits, 1 total found
    

    Query 6:

    Jun 29 19:08:44 mailarchiv-2019 piler-webui[23836]: sphinx query: 'SELECT id FROM main1,dailydelta1,delta1 WHERE        MATCH(' (@from 26696XabcdefghXij| 1234567890X1234567890X1234567890XabcdefghXij | @to 26696XabcdefghXij| 1234567890X1234567890X1234567890XabcdefghXij) ') ORDER BY `sent` DESC LIMIT 0,20 OPTION max_matches=1000' in 0.00 s, 0 hits, 0 total found
    

    Query 7:

    Jun 29 19:10:17 mailarchiv-2019 piler-webui[23837]: sphinx query: 'SELECT id FROM main1,dailydelta1,delta1 WHERE        MATCH(' (@from 26696XabcdefghXij| 1234567890X1234567890X1234567XabcdefghXij | @to 26696XabcdefghXij| 1234567890X1234567890X1234567XabcdefghXij) ') ORDER BY `sent` DESC LIMIT 0,20 OPTION max_matches=1000' in 0.00 s, 1 hits, 1 total found
    

    Query 8:

    Jun 29 19:13:20 mailarchiv-2019 piler-webui[15815]: sphinx query: 'SELECT id FROM main1,dailydelta1,delta1 WHERE        MATCH(' (@from sysX26696XabcdefghXij| 12345678X123456789012X12345678XabcXdef | @to sysX26696XabcdefghXij| 12345678X123456789012X12345678XabcXdef) ') ORDER BY `sent` DESC LIMIT 0,20 OPTION max_matches=1000' in 0.00 s, 1 hits, 1 total found
    

    I know this is a realy weird issue. But as you can see in the table the queries magically work as soon as soon as the longest address is below 42 characters.

  3. Martin Nadvornik reporter

    Hello Janos,

    sorry to bug you, but did you have time to review the issue based on my answer? Let me know if you need any further information.

    Thanks, Martin

  4. Janos SUTO repo owner

    Sorry Martin, I was working on other stuff. I need more time to figure out what’s wrong.

  5. Patrick W

    as mentioned on the mailing list, we’ve got the exact same issue - Ubuntu 18.04, Sphinx 3.1.1, minimal sphinx config as shipped with the piler 1.3.5 DEB

    #
    # minimal sphinx configuration suited to piler
    #
    
    source base
    {
       type = mysql
       sql_host = localhost
       sql_db = pilerdb
       sql_user = pileruser
       sql_pass = pilerpw
       sql_sock = /var/run/mysqld/mysqld.sock
    
       sql_attr_uint = size
       sql_attr_uint = sent
       sql_attr_uint = attachments
    }
    
    source delta : base
    {
       sql_query_pre = SET NAMES utf8mb4
       sql_query_pre  = REPLACE INTO sph_counter SELECT 1, IFNULL(MAX(id), 0) FROM sph_index
       sql_query_post_index  = DELETE FROM sph_index WHERE id<=(SELECT max_doc_id FROM sph_counter WHERE counter_id=1)
       sql_query = SELECT id, `from`, `to`, `fromdomain`, `todomain`, `subject`, `arrived`, `sent`, `body`, `size`, `direction`, `folder`, `attachments`, `attachment_types` FROM sph_index \
            WHERE id <= (SELECT max_doc_id FROM sph_counter WHERE counter_id=1)
    
       sql_query_killlist = SELECT `id` FROM `metadata` WHERE `deleted`=1
    }
    
    source main1 : base
    {
       sql_query_pre = SET NAMES utf8mb4
       sql_query = SELECT id, `from`, `to`, `fromdomain`, `todomain`, `subject`, `arrived`, `sent`, `body`, `size`, `direction`, `folder`, `attachments`, `attachment_types` FROM sph_index WHERE id=-1
    }
    
    source main2 : base
    {
       sql_query_pre = SET NAMES utf8mb4
       sql_query = SELECT id, `from`, `to`, `fromdomain`, `todomain`, `subject`, `arrived`, `sent`, `body`, `size`, `direction`, `folder`, `attachments`, `attachment_types` FROM sph_index WHERE id=-1
    }
    
    source main3 : base
    {
       sql_query_pre = SET NAMES utf8mb4
       sql_query = SELECT id, `from`, `to`, `fromdomain`, `todomain`, `subject`, `arrived`, `sent`, `body`, `size`, `direction`, `folder`, `attachments`, `attachment_types` FROM sph_index WHERE id=-1
    }
    
    source main4 : base
    {
       sql_query_pre = SET NAMES utf8mb4
       sql_query = SELECT id, `from`, `to`, `fromdomain`, `todomain`, `subject`, `arrived`, `sent`, `body`, `size`, `direction`, `folder`, `attachments`, `attachment_types` FROM sph_index WHERE id=-1
    }
    
    source dailydelta : base
    {
       sql_query_pre = SET NAMES utf8mb4
       sql_query = SELECT id, `from`, `to`, `fromdomain`, `todomain`, `subject`, `arrived`, `sent`, `body`, `size`, `direction`, `folder`, `attachments`, `attachment_types` FROM sph_index WHERE id=-1
    }
    
    source tag : base
    {
       sql_query_pre = SET NAMES utf8mb4
       sql_query  = SELECT `_id`, `id` AS iid, `uid`, `tag` FROM `tag`
    
       sql_attr_uint = iid
       sql_attr_uint = uid
    
    }
    
    source note : base
    {
       sql_query_pre = SET NAMES utf8mb4
       sql_query  = SELECT `_id`, `id` AS iid, `uid`, `note` FROM `note`
    
       sql_attr_uint = iid
       sql_attr_uint = uid
    
    }
    
    index main1
    {
            source                  = main1
            path                    = /var/piler/sphinx/main1
            #docinfo                        = extern
            #dict                   = keywords
            min_prefix_len          = 5
            min_word_len            = 1
            #ngram_len               = 1
            #ngram_chars             = U+3000..U+2FA1F
    }
    
    index main2
    {
            source                  = main2
            path                    = /var/piler/sphinx/main2
            #docinfo                 = extern
            #dict                   = keywords
            min_prefix_len          = 5
            min_word_len            = 1
            #ngram_len               = 1
            #ngram_chars             = U+3000..U+2FA1F
    }
    
    index main3
    {
            source                  = main3
            path                    = /var/piler/sphinx/main3
            #docinfo                 = extern
            #dict                   = keywords
            min_prefix_len          = 5
            min_word_len            = 1
            #ngram_len               = 1
            #ngram_chars             = U+3000..U+2FA1F
    }
    
    index main4
    {
            source                  = main4
            path                    = /var/piler/sphinx/main4
            #docinfo                 = extern
            #dict                   = keywords
            min_prefix_len          = 5
            min_word_len            = 1
            #ngram_len               = 1
            #ngram_chars             = U+3000..U+2FA1F
    }
    
    index dailydelta1
    {
            source                  = dailydelta
            path                    = /var/piler/sphinx/dailydelta1
            #docinfo                 = extern
            #dict                   = keywords
            min_prefix_len          = 5
            min_word_len            = 1
            #ngram_len               = 1
            #ngram_chars             = U+3000..U+2FA1F
    }
    
    index delta1
    {
            source                  = delta
            path                    = /var/piler/sphinx/delta1
            #docinfo                 = extern
            #dict                   = keywords
            min_prefix_len          = 5
            min_word_len            = 1
            #ngram_len               = 1
            #ngram_chars             = U+3000..U+2FA1F
    }
    
    
    index tag1
    {
            source                  = tag
            path                    = /var/piler/sphinx/tag1
            #docinfo                 = extern
            #dict                   = keywords
            min_prefix_len          = 5
            min_word_len            = 1
            #ngram_len               = 1
            #ngram_chars             = U+3000..U+2FA1F
    }
    
    
    index note1
    {
            source                  = note
            path                    = /var/piler/sphinx/note1
            #docinfo                 = extern
            #dict                   = keywords
            min_prefix_len          = 5
            min_word_len            = 1
            #ngram_len               = 1
            #ngram_chars             = U+3000..U+2FA1F
    }
    
    
    indexer
    {
            mem_limit               = 256M
    }
    
    
    searchd
    {
            listen                  = 127.0.0.1:9312
            listen                  = 127.0.0.1:9306:mysql41
            log                     = /dev/null
            binlog_path             = 
            ##query_log             =
            read_timeout            = 5
            max_children            = 30
            pid_file                = /var/run/piler/searchd.pid
            seamless_rotate         = 1
            preopen_indexes         = 1
            unlink_old              = 1
            thread_stack            = 512k
            workers                 = threads # for RT to work
    }
    

  6. Janos SUTO repo owner

    I can confirm that the issue exists even with sphinx-3.3.1. It seems that it might be a sphinx parser issue.

    I tried the following query:

    SELECT * FROM fictive WHERE MATCH('(  test ) &  (@sender sjXactsXhu| thisisareallylongaddresXsomedomainnameXcom | @rcpt sjXactsXhu| thisisareallylongaddresXsomedomainnameXcom) ');
    ERROR 1064 (42000): index 'fictive': agent worker0:9312: remote query error: index fictive_dailydelta1,fictive_delta1,fictive_main1: syntax error, unexpected $end near '';
    index 'fictive': agent worker1:9312: remote query error: index fictive_dailydelta1,fictive_delta1,fictive_main1: syntax error, unexpected $end near ''
    

    However, if I reduce the long address in the @rcpt (or @to field in your case) to 41 characters, the query works. I’ll check the docs to see if piler’s sphinx config can be improved.

    Interesting that if I remove the parenthesis from the user’s email addresses, then the query works with the 42+ email address:

    SELECT * FROM fictive WHERE MATCH('(  test ) &  @sender sjXactsXhu| thisisareallylongaddresXsomedomainnameXcom | @rcpt sjXactsXhu| thisisareallylongaddresXsomedomainnameXcom ');
    +------+------------+------+-------------+
    | id   | sent       | size | attachments |
    +------+------------+------+-------------+
    | 1429 | 1460476202 | 5641 |           0 |
    +------+------------+------+-------------+
    

    Note that it’s up to the sphinx parser how to evaluate condition1 AND condition2 OR condition3. I believe that it should be condition1 AND (condition2 OR condition3)

  7. Janos SUTO repo owner

    I’ve opened a topic on the sphinx forum. I’ll keep you posted if anything usable comes up.

  8. Martin Nadvornik reporter

    Hello Janos, thank you for taking this to the sphinx forum. I already thought that error would lie within sphinx itself and is not something piler could cause.

  9. Janos SUTO repo owner

    If there’s no timely response from sphinx developers, then it might be worth to give a shot to the manticore project. It was forked from sphinx 2.x, and should be compatible with sphinx 3. If you have the resources, be sure to check out https://manticoresearch.com/downloads/

    Note that in src/sphinx.h SPH_MAX_WORD_LEN is still 42. Their reasoning is that any UTF-8 word fits in 127 bytes. I’m not sure if it can be safely increased to 45 or 50. I may ask on the manticore forums. I hope they have one, too.

  10. Martin Nadvornik reporter

    Manticore sounds interesting. Since I have no expertise in search engines I am unfortunately currently not able to invest the necessary time to look into it. Manticore Search does also have a community forum here: https://forum.manticoresearch.com/

    However regarding the character limit, looking at RFC 3696 section 3 (https://tools.ietf.org/html/rfc3696), an e-mail address can be as long as 320 characters. If we want Piler to be compliant / compatible to all possibilites you could encounter based on the current RFCs I think the goal should be to support all valid e-mail addresses.

  11. Janos SUTO repo owner

    I see that a valid email address can be up to 320 characters. However, for sphinx there is no such concept of email addresses, only tokenized words. Sphinx and manticore apparently have their internal limitations on the token size to <42. Unfortunately piler can’t do much other than to exclude such long email addresses from the non-auditor users' detected email addresses.

  12. Log in to comment