No sphinx results with long email addresses
Hello Janos,
we found out that as soon as a piler account (via LDAP in our case) has access to an email address that is 42 characters or longer - including @
and the domain part - we are not able to see any results within the WebGUI even with an empty search.
One example for our problem:
The email address concerned is aaaaaaaa.bbbbbbbbbbbb-cccccccccccc@dddddddd.ee
. I have anonymized the address keeping the amount of characters the same.
When logging in with this address I can see in the account settings within the piler UI that the account has permission to view it's own e-mails The specific account does not have access to any additional alias or mail groups. It does not matter which search query is beeing entered - wether it is advanced search or normal search - there are always no results. Looking at the log the following sphinx query is shown as beeing executed:
SELECT id FROM main1,dailydelta1,delta1
WHERE MATCH('( @from aaaaaaaaXbbbbbbbbbbbbXccccccccccccXddddddddXee @to aaaaaaaaXbbbbbbbbbbbbXccccccccccccXddddddddXee ) & (@from aaaaaaaaXbbbbbbbbbbbbXccccccccccccXddddddddXee | @to aaaaaaaaXbbbbbbbbbbbbXccccccccccccXddddddddXee) ')
ORDER BY `sent` DESC
LIMIT 0,20
OPTION max_matches=1000
;
This is the result:
in 0.00 s, 0 hits, 0 total found
However when looking for mails that were sent from the concerned address to the concerned address while being logged in as auditor@local
the following sphinx query is being executed
SELECT id FROM main1,dailydelta1,delta1
WHERE MATCH('@from aaaaaaaaXbbbbbbbbbbbbXccccccccccccXddddddddXee @to aaaaaaaaXbbbbbbbbbbbbXccccccccccccXddddddddXee ')
ORDER BY `sent` DESC
LIMIT 0,20
OPTION max_matches=1000
;
This is the result:
in 0.00 s, 20 hits, 1835 total found
So the mails are correctly archived, otherwise I would not find them as auditor.
It does not matter if the query is being executed by piler or directly via the mysql command line connected to sphinx.
I have tried many combinations of e-mail addresses and my testing concluded that the bug arises as soon as the account has access to a address longer than 41 characters. The query length itself does not seem to be an issue because everything is working fine if I give an account access to 3 addresses with a combined length of far more than 41 characters.
I have looked through our sphinx.conf
file but I did not find any limitation that would obviously cause this issue. As you can see below we are using the default values of sphinx.conf
as shipped with piler. We did not adjust any config values besides enabling Sphinx 3.1.X
.
We are currently running Sphinx 3.1.1
and already tried Sphinx 3.2.1
but it did not resolve our problem and lead to a deprecation warning so we downgraded to 3.1.1
again.
Here is the Sphinx config we are using:
#!/usr/bin/php
<?php
define('SPHINX_VERSION', 311); // If you have sphinx-3.1.1, then set SPHINX_VERSION to 311
define('LOCALSTATEDIR', '/var');
define('NGRAM_CONFIG', " #ngram_len = 1\n #ngram_chars = U+3000..U+2FA1F\n");
# See http://sphinxsearch.com/wiki/doku.php?id=charset_tables for more on the charset_table settings
# The following settings contains English and some Latin extras
define('SPHINX_CHARSET_TABLE', "0..9, english, _, \
U+C1->U+E1, U+C4->U+E4, U+C5->U+E5, U+C6->U+E6, U+C9->U+E9, U+CD->U+ED, U+D3->U+F3, U+D6->U+F6, U+D8->U+F8, \
U+DA->U+FA, U+DC->U+FC, U+0150->U+0151, U+0152->U+0153, U+0170->U+0171, U+01E2->U+E6, U+01E3->U+E6, U+01FC->U+E6, \
U+01FD->U+E6, U+1D01->U+E6, U+1D02->U+E6, U+1D2D->U+E6, U+1D46->U+E6, \
U+DF, U+E1, U+E4, U+E5, U+E6, U+E9, U+ED, U+00F3, U+F6, U+F8, U+FA, U+FC, U+0151, U+0153, U+0171\n");
?>
#
# minimal sphinx configuration suited to piler
#
source base
{
type = mysql
sql_host = localhost
sql_db = piler
sql_user = piler
sql_pass = XXXXXXXXXXXXXXXXXXXXXXXXX
sql_attr_uint = size
sql_attr_uint = sent
sql_attr_uint = attachments
}
source delta : base
{
sql_query_pre = SET NAMES utf8mb4
sql_query_pre = REPLACE INTO sph_counter SELECT 1, IFNULL(MAX(id), 0) FROM sph_index
sql_query_post_index = DELETE FROM sph_index WHERE id<=(SELECT max_doc_id FROM sph_counter WHERE counter_id=1)
sql_query = SELECT id, `from`, `to`, `fromdomain`, `todomain`, `subject`, `arrived`, `sent`, `body`, `size`, `direction`, `folder`, `attachments`, `attach
ment_types` FROM sph_index \
WHERE id <= (SELECT max_doc_id FROM sph_counter WHERE counter_id=1)
sql_query_killlist = SELECT `id` FROM `metadata` WHERE `deleted`=1
}
source main1 : base
{
sql_query_pre = SET NAMES utf8mb4
sql_query = SELECT id, `from`, `to`, `fromdomain`, `todomain`, `subject`, `arrived`, `sent`, `body`, `size`, `direction`, `folder`, `attachments`, `attach
ment_types` FROM sph_index WHERE id=-1
}
source main2 : base
{
sql_query_pre = SET NAMES utf8mb4
sql_query = SELECT id, `from`, `to`, `fromdomain`, `todomain`, `subject`, `arrived`, `sent`, `body`, `size`, `direction`, `folder`, `attachments`, `attach
ment_types` FROM sph_index WHERE id=-1
}
source main3 : base
{
sql_query_pre = SET NAMES utf8mb4
sql_query = SELECT id, `from`, `to`, `fromdomain`, `todomain`, `subject`, `arrived`, `sent`, `body`, `size`, `direction`, `folder`, `attachments`, `attach
ment_types` FROM sph_index WHERE id=-1
}
source main4 : base
{
sql_query_pre = SET NAMES utf8mb4
sql_query = SELECT id, `from`, `to`, `fromdomain`, `todomain`, `subject`, `arrived`, `sent`, `body`, `size`, `direction`, `folder`, `attachments`, `attach
ment_types` FROM sph_index WHERE id=-1
}
source dailydelta : base
{
sql_query_pre = SET NAMES utf8mb4
sql_query = SELECT id, `from`, `to`, `fromdomain`, `todomain`, `subject`, `arrived`, `sent`, `body`, `size`, `direction`, `folder`, `attachments`, `attach
ment_types` FROM sph_index WHERE id=-1
}
source tag : base
{
sql_query_pre = SET NAMES utf8mb4
sql_query = SELECT `_id`, `id` AS iid, `uid`, `tag` FROM `tag`
sql_attr_uint = iid
sql_attr_uint = uid
}
source note : base
{
sql_query_pre = SET NAMES utf8mb4
sql_query = SELECT `_id`, `id` AS iid, `uid`, `note` FROM `note`
sql_attr_uint = iid
sql_attr_uint = uid
}
index main1
{
source = main1
path = <?php print LOCALSTATEDIR; ?>/piler/sphinx/main1
<?php if(SPHINX_VERSION < 300) { ?>
docinfo = extern
dict = keywords
<?php } ?>
min_prefix_len = 5
min_word_len = 1
charset_table = <?php print SPHINX_CHARSET_TABLE; ?>
<?php print NGRAM_CONFIG; ?>
}
index main2
{
source = main2
path = <?php print LOCALSTATEDIR; ?>/piler/sphinx/main2
<?php if(SPHINX_VERSION < 300) { ?>
docinfo = extern
dict = keywords
<?php } ?>
min_prefix_len = 5
min_word_len = 1
charset_table = <?php print SPHINX_CHARSET_TABLE; ?>
<?php print NGRAM_CONFIG; ?>
}
index main3
{
source = main3
path = <?php print LOCALSTATEDIR; ?>/piler/sphinx/main3
<?php if(SPHINX_VERSION < 300) { ?>
docinfo = extern
dict = keywords
<?php } ?>
min_prefix_len = 5
min_word_len = 1
charset_table = <?php print SPHINX_CHARSET_TABLE; ?>
<?php print NGRAM_CONFIG; ?>
}
index main4
{
source = main4
path = <?php print LOCALSTATEDIR; ?>/piler/sphinx/main4
<?php if(SPHINX_VERSION < 300) { ?>
docinfo = extern
dict = keywords
<?php } ?>
min_prefix_len = 5
min_word_len = 1
charset_table = <?php print SPHINX_CHARSET_TABLE; ?>
<?php print NGRAM_CONFIG; ?>
}
index dailydelta1
{
source = dailydelta
path = <?php print LOCALSTATEDIR; ?>/piler/sphinx/dailydelta1
<?php if(SPHINX_VERSION < 300) { ?>
docinfo = extern
dict = keywords
<?php } ?>
min_prefix_len = 5
min_word_len = 1
charset_table = <?php print SPHINX_CHARSET_TABLE; ?>
<?php print NGRAM_CONFIG; ?>
}
index delta1
{
source = delta
path = <?php print LOCALSTATEDIR; ?>/piler/sphinx/delta1
<?php if(SPHINX_VERSION < 300) { ?>
docinfo = extern
dict = keywords
<?php } ?>
min_prefix_len = 5
min_word_len = 1
charset_table = <?php print SPHINX_CHARSET_TABLE; ?>
<?php print NGRAM_CONFIG; ?>
<?php if(SPHINX_VERSION >= 310) { ?>
kbatch = main1, main2, main3, main4, dailydelta1
<?php } ?>
}
index tag1
{
source = tag
path = <?php print LOCALSTATEDIR; ?>/piler/sphinx/tag1
<?php if(SPHINX_VERSION < 300) { ?>
docinfo = extern
dict = keywords
<?php } ?>
min_prefix_len = 5
min_word_len = 1
charset_table = <?php print SPHINX_CHARSET_TABLE; ?>
<?php print NGRAM_CONFIG; ?>
}
index note1
{
source = note
path = <?php print LOCALSTATEDIR; ?>/piler/sphinx/note1
<?php if(SPHINX_VERSION < 300) { ?>
docinfo = extern
dict = keywords
<?php } ?>
min_prefix_len = 5
min_word_len = 1
charset_table = <?php print SPHINX_CHARSET_TABLE; ?>
<?php print NGRAM_CONFIG; ?>
}
indexer
{
mem_limit = 256M
}
searchd
{
listen = 127.0.0.1:9312
listen = 127.0.0.1:9306:mysql41
log = /dev/null
binlog_path =
##query_log =
read_timeout = 5
max_children = 30
pid_file = /var/run/piler/searchd.pid
seamless_rotate = 1
preopen_indexes = 1
unlink_old = 1
thread_stack = 512k
workers = threads # for RT to work
}
It seems to me that piler might not be the problem but rather sphinx itself Do you know of any sphinx limitation we could run into with our queries?
Since I did not find any Bitbucket issues for our problem yet I hope you are happy with me opening this one
Comments (14)
-
repo owner -
reporter Hello Janos,
as requested here are the search queries. In the table every line is an empty search request I made with the WebGUI.
For the tests I created mail accounts on our mailserver, logged in via the WebGUI (which accesses the accounts via LDAP) and verified in the access settings that everything was correctly recognized. Before testing I sent e-mails to every address, waited for the sphinx cron job and verified with the auditor account that they are indeed correctly indexed.
The domain part of the addresses is of course anonymized, but the length is kept identical. Please note that the many digits in the addresses are not the issue altough such address are of course not really found in the real world. We do have the same issue also with addresses containing just characters from the latin alphabet, I just wasn’t very creative while testing
primary e-mail address aliases sphinx query successful? longest e-mail address sys-26696@abcdefgh.ij
12345678.12345678@abcdefgh.ij, 12345678.123456789012@abcdefgh.ij, 12345678.123456789012-123456789012@abcdefgh.ij
Query 1 No 46 chars sys-26696@abcdefgh.ij
12345678.12345678@abcdefgh.ij
,12345678.123456789012@abcdefgh.ij
,12345678.123456789012-12345678@abcdefgh.ij
Query 2 No 42 chars sys-26696@abcdefgh.ij
12345678.123456789012-12345678@abcdefgh.ij
Query 3 No 42 chars sys-26696@abcdefgh.ij
12345678.123456789012-1234567@abcdefgh.ij
Query 4 Yes 41 chars sys-26696@abcdefgh.ij
1234567890.1234567890.1234567@abcdefgh.ij
Query 5 Yes 41 chars 26696@abcdefgh.ij
1234567890.1234567890.1234567890@abcdefgh.ij
Query 6 No 44 chars 26696@abcdefgh.ij
1234567890.1234567890.1234567@abcdefgh.ij
Query 7 Yes 41 chars sys-26696@abcdefgh.ij
12345678.123456789012.12345678@abc.def
Query 8 Yes 38 chars Within the file
mail.log
I did not find any error messages below or above the logged queries.These are the queries:
Query 1:
Jun 29 18:54:19 mailarchiv-2019 piler-webui[15815]: sphinx query: 'SELECT id FROM main1,dailydelta1,delta1 WHERE MATCH(' (@from sysX26696XabcdefghXij| 12345678X12345678XabcdefghXij| 12345678X123456789012XabcdefghXij| 12345678X123456789012X123456789012XabcdefghXij | @to sysX26696XabcdefghXij| 12345678X12345678XabcdefghXij| 12345678X123456789012XabcdefghXij| 12345678X123456789012X123456789012XabcdefghXij) ') ORDER BY sent DESC LIMIT 0,20 OPTION max_matches=1000' in 0.00 s, 0 hits, 0 total found
Query 2:
Jun 29 18:54:59 mailarchiv-2019 piler-webui[23832]: sphinx query: 'SELECT id FROM main1,dailydelta1,delta1 WHERE MATCH(' (@from sysX26696XabcdefghXij| 12345678X12345678XabcdefghXij| 12345678X123456789012XabcdefghXij| 12345678X123456789012X1234567890XabcdefghXij | @to sysX26696XabcdefghXij| 12345678X12345678XabcdefghXij| 12345678X123456789012XabcdefghXij| 12345678X123456789012X1234567890XabcdefghXij) ') ORDER BY sent DESC LIMIT 0,20 OPTION max_matches=1000' in 0.00 s, 0 hits, 0 total found
Query 3:
Jun 29 18:58:06 mailarchiv-2019 piler-webui[23833]: sphinx query: 'SELECT id FROM main1,dailydelta1,delta1 WHERE MATCH(' (@from sysX26696XabcdefghXij| 12345678X123456789012X12345678XabcdefghXij | @to sysX26696XabcdefghXij| 12345678X123456789012X12345678XabcdefghXij) ') ORDER BY sent DESC LIMIT 0,20 OPTION max_matches=1000' in 0.00 s, 0 hits, 0 total found
Query 4:
Jun 29 19:01:49 mailarchiv-2019 piler-webui[23831]: sphinx query: 'SELECT id FROM main1,dailydelta1,delta1 WHERE MATCH(' (@from sysX26696XabcdefghXij| 12345678X123456789012X1234567XabcdefghXij | @to sysX26696XabcdefghXij| 12345678X123456789012X1234567XabcdefghXij) ') ORDER BY `sent` DESC LIMIT 0,20 OPTION max_matches=1000' in 0.00 s, 1 hits, 1 total found
Query 5:
Jun 29 19:02:47 mailarchiv-2019 piler-webui[23837]: sphinx query: 'SELECT id FROM main1,dailydelta1,delta1 WHERE MATCH(' (@from sysX26696XabcdefghXij| 1234567890X1234567890X1234567XabcdefghXij | @to sysX26696XabcdefghXij| 1234567890X1234567890X1234567XabcdefghXij) ') ORDER BY `sent` DESC LIMIT 0,20 OPTION max_matches=1000' in 0.00 s, 1 hits, 1 total found
Query 6:
Jun 29 19:08:44 mailarchiv-2019 piler-webui[23836]: sphinx query: 'SELECT id FROM main1,dailydelta1,delta1 WHERE MATCH(' (@from 26696XabcdefghXij| 1234567890X1234567890X1234567890XabcdefghXij | @to 26696XabcdefghXij| 1234567890X1234567890X1234567890XabcdefghXij) ') ORDER BY `sent` DESC LIMIT 0,20 OPTION max_matches=1000' in 0.00 s, 0 hits, 0 total found
Query 7:
Jun 29 19:10:17 mailarchiv-2019 piler-webui[23837]: sphinx query: 'SELECT id FROM main1,dailydelta1,delta1 WHERE MATCH(' (@from 26696XabcdefghXij| 1234567890X1234567890X1234567XabcdefghXij | @to 26696XabcdefghXij| 1234567890X1234567890X1234567XabcdefghXij) ') ORDER BY `sent` DESC LIMIT 0,20 OPTION max_matches=1000' in 0.00 s, 1 hits, 1 total found
Query 8:
Jun 29 19:13:20 mailarchiv-2019 piler-webui[15815]: sphinx query: 'SELECT id FROM main1,dailydelta1,delta1 WHERE MATCH(' (@from sysX26696XabcdefghXij| 12345678X123456789012X12345678XabcXdef | @to sysX26696XabcdefghXij| 12345678X123456789012X12345678XabcXdef) ') ORDER BY `sent` DESC LIMIT 0,20 OPTION max_matches=1000' in 0.00 s, 1 hits, 1 total found
I know this is a realy weird issue. But as you can see in the table the queries magically work as soon as soon as the longest address is below 42 characters.
-
reporter Hello Janos,
sorry to bug you, but did you have time to review the issue based on my answer? Let me know if you need any further information.
Thanks, Martin
-
repo owner Sorry Martin, I was working on other stuff. I need more time to figure out what’s wrong.
-
reporter no worries
-
as mentioned on the mailing list, we’ve got the exact same issue - Ubuntu 18.04, Sphinx 3.1.1, minimal sphinx config as shipped with the piler 1.3.5 DEB
# # minimal sphinx configuration suited to piler # source base { type = mysql sql_host = localhost sql_db = pilerdb sql_user = pileruser sql_pass = pilerpw sql_sock = /var/run/mysqld/mysqld.sock sql_attr_uint = size sql_attr_uint = sent sql_attr_uint = attachments } source delta : base { sql_query_pre = SET NAMES utf8mb4 sql_query_pre = REPLACE INTO sph_counter SELECT 1, IFNULL(MAX(id), 0) FROM sph_index sql_query_post_index = DELETE FROM sph_index WHERE id<=(SELECT max_doc_id FROM sph_counter WHERE counter_id=1) sql_query = SELECT id, `from`, `to`, `fromdomain`, `todomain`, `subject`, `arrived`, `sent`, `body`, `size`, `direction`, `folder`, `attachments`, `attachment_types` FROM sph_index \ WHERE id <= (SELECT max_doc_id FROM sph_counter WHERE counter_id=1) sql_query_killlist = SELECT `id` FROM `metadata` WHERE `deleted`=1 } source main1 : base { sql_query_pre = SET NAMES utf8mb4 sql_query = SELECT id, `from`, `to`, `fromdomain`, `todomain`, `subject`, `arrived`, `sent`, `body`, `size`, `direction`, `folder`, `attachments`, `attachment_types` FROM sph_index WHERE id=-1 } source main2 : base { sql_query_pre = SET NAMES utf8mb4 sql_query = SELECT id, `from`, `to`, `fromdomain`, `todomain`, `subject`, `arrived`, `sent`, `body`, `size`, `direction`, `folder`, `attachments`, `attachment_types` FROM sph_index WHERE id=-1 } source main3 : base { sql_query_pre = SET NAMES utf8mb4 sql_query = SELECT id, `from`, `to`, `fromdomain`, `todomain`, `subject`, `arrived`, `sent`, `body`, `size`, `direction`, `folder`, `attachments`, `attachment_types` FROM sph_index WHERE id=-1 } source main4 : base { sql_query_pre = SET NAMES utf8mb4 sql_query = SELECT id, `from`, `to`, `fromdomain`, `todomain`, `subject`, `arrived`, `sent`, `body`, `size`, `direction`, `folder`, `attachments`, `attachment_types` FROM sph_index WHERE id=-1 } source dailydelta : base { sql_query_pre = SET NAMES utf8mb4 sql_query = SELECT id, `from`, `to`, `fromdomain`, `todomain`, `subject`, `arrived`, `sent`, `body`, `size`, `direction`, `folder`, `attachments`, `attachment_types` FROM sph_index WHERE id=-1 } source tag : base { sql_query_pre = SET NAMES utf8mb4 sql_query = SELECT `_id`, `id` AS iid, `uid`, `tag` FROM `tag` sql_attr_uint = iid sql_attr_uint = uid } source note : base { sql_query_pre = SET NAMES utf8mb4 sql_query = SELECT `_id`, `id` AS iid, `uid`, `note` FROM `note` sql_attr_uint = iid sql_attr_uint = uid } index main1 { source = main1 path = /var/piler/sphinx/main1 #docinfo = extern #dict = keywords min_prefix_len = 5 min_word_len = 1 #ngram_len = 1 #ngram_chars = U+3000..U+2FA1F } index main2 { source = main2 path = /var/piler/sphinx/main2 #docinfo = extern #dict = keywords min_prefix_len = 5 min_word_len = 1 #ngram_len = 1 #ngram_chars = U+3000..U+2FA1F } index main3 { source = main3 path = /var/piler/sphinx/main3 #docinfo = extern #dict = keywords min_prefix_len = 5 min_word_len = 1 #ngram_len = 1 #ngram_chars = U+3000..U+2FA1F } index main4 { source = main4 path = /var/piler/sphinx/main4 #docinfo = extern #dict = keywords min_prefix_len = 5 min_word_len = 1 #ngram_len = 1 #ngram_chars = U+3000..U+2FA1F } index dailydelta1 { source = dailydelta path = /var/piler/sphinx/dailydelta1 #docinfo = extern #dict = keywords min_prefix_len = 5 min_word_len = 1 #ngram_len = 1 #ngram_chars = U+3000..U+2FA1F } index delta1 { source = delta path = /var/piler/sphinx/delta1 #docinfo = extern #dict = keywords min_prefix_len = 5 min_word_len = 1 #ngram_len = 1 #ngram_chars = U+3000..U+2FA1F } index tag1 { source = tag path = /var/piler/sphinx/tag1 #docinfo = extern #dict = keywords min_prefix_len = 5 min_word_len = 1 #ngram_len = 1 #ngram_chars = U+3000..U+2FA1F } index note1 { source = note path = /var/piler/sphinx/note1 #docinfo = extern #dict = keywords min_prefix_len = 5 min_word_len = 1 #ngram_len = 1 #ngram_chars = U+3000..U+2FA1F } indexer { mem_limit = 256M } searchd { listen = 127.0.0.1:9312 listen = 127.0.0.1:9306:mysql41 log = /dev/null binlog_path = ##query_log = read_timeout = 5 max_children = 30 pid_file = /var/run/piler/searchd.pid seamless_rotate = 1 preopen_indexes = 1 unlink_old = 1 thread_stack = 512k workers = threads # for RT to work }
-
repo owner I can confirm that the issue exists even with sphinx-3.3.1. It seems that it might be a sphinx parser issue.
I tried the following query:
SELECT * FROM fictive WHERE MATCH('( test ) & (@sender sjXactsXhu| thisisareallylongaddresXsomedomainnameXcom | @rcpt sjXactsXhu| thisisareallylongaddresXsomedomainnameXcom) '); ERROR 1064 (42000): index 'fictive': agent worker0:9312: remote query error: index fictive_dailydelta1,fictive_delta1,fictive_main1: syntax error, unexpected $end near ''; index 'fictive': agent worker1:9312: remote query error: index fictive_dailydelta1,fictive_delta1,fictive_main1: syntax error, unexpected $end near ''
However, if I reduce the long address in the @rcpt (or @to field in your case) to 41 characters, the query works. I’ll check the docs to see if piler’s sphinx config can be improved.
Interesting that if I remove the parenthesis from the user’s email addresses, then the query works with the 42+ email address:
SELECT * FROM fictive WHERE MATCH('( test ) & @sender sjXactsXhu| thisisareallylongaddresXsomedomainnameXcom | @rcpt sjXactsXhu| thisisareallylongaddresXsomedomainnameXcom '); +------+------------+------+-------------+ | id | sent | size | attachments | +------+------------+------+-------------+ | 1429 | 1460476202 | 5641 | 0 | +------+------------+------+-------------+
Note that it’s up to the sphinx parser how to evaluate condition1 AND condition2 OR condition3. I believe that it should be condition1 AND (condition2 OR condition3)
-
repo owner I’ve opened a topic on the sphinx forum. I’ll keep you posted if anything usable comes up.
-
reporter Hello Janos, thank you for taking this to the sphinx forum. I already thought that error would lie within sphinx itself and is not something piler could cause.
-
repo owner Just found a historic issue with sphinx: http://sphinxsearch.com/bugs/view.php?id=1113
Indeed it looks like SPH_MAX_WORD_LEN is set to 42. Since sphinx 3.x is not open source apparently, we just can’t recompile sphinx with the adjusted value.
-
repo owner If there’s no timely response from sphinx developers, then it might be worth to give a shot to the manticore project. It was forked from sphinx 2.x, and should be compatible with sphinx 3. If you have the resources, be sure to check out https://manticoresearch.com/downloads/
Note that in src/sphinx.h SPH_MAX_WORD_LEN is still 42. Their reasoning is that any UTF-8 word fits in 127 bytes. I’m not sure if it can be safely increased to 45 or 50. I may ask on the manticore forums. I hope they have one, too.
-
reporter Manticore sounds interesting. Since I have no expertise in search engines I am unfortunately currently not able to invest the necessary time to look into it. Manticore Search does also have a community forum here: https://forum.manticoresearch.com/
However regarding the character limit, looking at RFC 3696 section 3 (https://tools.ietf.org/html/rfc3696), an e-mail address can be as long as 320 characters. If we want Piler to be compliant / compatible to all possibilites you could encounter based on the current RFCs I think the goal should be to support all valid e-mail addresses.
-
repo owner I see that a valid email address can be up to 320 characters. However, for sphinx there is no such concept of email addresses, only tokenized words. Sphinx and manticore apparently have their internal limitations on the token size to <42. Unfortunately piler can’t do much other than to exclude such long email addresses from the non-auditor users' detected email addresses.
-
repo owner - changed status to closed
I've added this unfortunate finding to the sphinx 3 docs, see http://www.mailpiler.org/wiki/current:sphinx3 fore more.
- Log in to comment
Hello Martin, I’m happy with the opened issue. And I especially like that you used the formatting options of bitbucket.
Please try an empty search criteria for this user and for another user with a shorter email address, and show me the sphinx queries. Though I don’t think it depends on the length of the email.