Open Bug 774188 Opened 12 years ago Updated 2 years ago

gloda global search for parts of attachment file names with underscores doesn't work correctly (e.g. file_name.xyz)

Categories

(Thunderbird :: Search, defect)

13 Branch
x86_64
All
defect

Tracking

(Not tracked)

REOPENED

People

(Reporter: mozilla, Unassigned)

References

(Blocks 1 open bug)

Details

User Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:13.0) Gecko/20100101 Firefox/13.0.1
Build ID: 20120615085229

Steps to reproduce:

i searched for "20111205_133456"


Actual results:

No messages match your search


Expected results:

find the email with this attachment file name:
somefile.at_20111205_133456.pdf
subset of bug 523183
Status: UNCONFIRMED → RESOLVED
Closed: 12 years ago
Resolution: --- → DUPLICATE
what i wanted is to find words, that are intersected by underscore.

the search works fine for attachments that are separated by minus, dots or other word-borders

so this search 
"20111205_133456"

would find 
somefile.at-20111205_133456.pdf
but not 
somefile.at_20111205_133456.pdf
andrew, is underscore considered to be punctuation for gloda?  http://en.wikipedia.org/wiki/Underscore

reference: https://developer.mozilla.org/en/Thunderbird/gloda

"Tokens are broken on whitespace and punctuation.  
High-unicode punctuation that does not get folded may be misinterpreted as part of a string.
**Tokens that are made up of entirely ASCII characters (after folding) are run through the Porter stemmer.  
Tokens must be at least 2 characters long to be emitted.  We are thinking of upping this limit because of it being dangerously low and not requiring too much work."

I confirm this behavior with underscore
OS: Linux → All
It's news to me, but underscore is not punctuation.

If you take a look at porterIdChar
http://mxr.mozilla.org/comm-central/source/mailnews/extensions/fts3/src/fts3_porter.c#781
specifically, this table:

/**
 * Indicate whether characters in the 0x30 - 0x7f region can be part of a token.
 * Letters and numbers can; punctuation (and 'del') can't.
 */
static const char porterIdChar[] = {
/* x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 xA xB xC xD xE xF */
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,  /* 3x */
    0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  /* 4x */
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1,  /* 5x */
    0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  /* 6x */
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,  /* 7x */
};

and correlate with http://en.wikipedia.org/wiki/Ascii#ASCII_printable_characters

We find that '_' = 0x5f, and has a value of 1 in the table, which means it is treated as part of the string.  This is inherited from the original FTS3 porter implementation in SQLite and was not a design decision on our part.

So in the (excellent!) example cases, the tokens are:
["somefile", "at", "20111205_133456", "pdf"]
and
["somefile", "at_20111205_133456", "pdf"]

If underscores were not treated as part of an identifier, then for the latter case we would have:
["somefile", "at", 20111205", "133456", "pdf"]

although the "at" is less than 3 characters so it will be gobbled.


I'm fine with making the underscore be treated as punctuation, but the change needs to be accompanied by bumping the gloda schema revision in order to trigger reindexing, and I'm less fine about that since that tends to suck for users.
agreed, schema bump should be minimized.  perhaps we could coalesce it with something else worthwhile, like bug 681754  (unfortunately still no takers there)
Status: RESOLVED → REOPENED
Ever confirmed: true
Resolution: DUPLICATE → ---
Summary: search for parts of file names of an attachment doesn't work correctly → search for parts of file names of an attachment doesn't work correctly for some seperators
so, is there a workaround for me then to find my (precious) pdf file with the name "somefile.at_20111205_133456.pdf"
although i only remember the date part "20111205" of the filename?
(In reply to Ruben from comment #6)
> so, is there a workaround for me then to find my (precious) pdf file with
> the name "somefile.at_20111205_133456.pdf"
> although i only remember the date part "20111205" of the filename?

Unfortunately, not without an extension helping you out.  A gloda plugin could transform the attachment names.

Or you could use an extension that will let you just search for messages with (pdf) attachments, and then visually filter those down.  An example of such an extension is :squib's https://bitbucket.org/squib/attachment-tab/ that unfortunately is not up on AMO currently...

The way the token index works is that we have a sorted list of all the tokens, and for each token, we have a list of all the documents containing that token.  While we can do wildcarded suffix searches (ex: "at_*"), there is no way to do a wildcarded prefix search in SQLite (other than having a tokenizer that emits its tokens backwards).


We could also just play this as 'things with underscores in them already being broken' and the mismatch between what has already been indexed and what is being searched for being acceptable, and not bump the schema.  So in this specific example, you would have to delete your gloda database manually to trigger a reindexing.  Otherwise, if you searched for "20111205" it would not work because the token would still be "at_20111205".  But then if you searched for "at_20111205" that wouldn't work either because it would still end up searching for "20111205".

I'll send an e-mail to tb-planning about this to see if there's strong opposition, because I don't see any upcoming schema bumps.
I sent an e-mail to tb-planning:
https://mail.mozilla.org/pipermail/tb-planning/2012-July/001832.html

If someone would care to ping me in a week or so to remind me, I can create a fix following discussion on that assuming no one has a huge problem with it.
Andrew, with testcase somefile.at_20111205_133456.pdf, would you know why the following gloda search fails?

*20111205*
Summary: search for parts of file names of an attachment doesn't work correctly for some seperators → gloda global search for parts of attachment file names with underscores doesn't work correctly (e.g. file_name.xyz)
(In reply to Thomas D. from comment #9)
> Andrew, with testcase somefile.at_20111205_133456.pdf, would you know why
> the following gloda search fails?
> 
> *20111205*

The SQLite query engine doesn't support "*suffix", just "prefix*".  "*suffix" would require traversing the entire b-tree and is rather inefficient which is why they decided not to do that.  If you have queries that look like they work, you are probably just witnessing that tokenizer ignoring your asterisk.
by the way:
the search isn't failing only on attachement filenames, but also in subject and body

for example an email like:
"look at this cool video at http://bla.com/video_20120707_132456.mov"
wouldn't be found if you search for 
"video_20120707"
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.