Closed Bug 1358347 Opened 7 years ago Closed 5 years ago

[jsplugins] Implement back-end part of find whole-word function

Categories

(Core Graveyard :: Plug-ins, enhancement, P3)

enhancement

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: lochang, Assigned: lochang)

References

Details

Attachments

(1 file, 1 obsolete file)

This bug implements the back-end part of find whole-word function. Either just bypass the string that is not whole-word or find the similar algorithm of finding whole-word in Gecko code base, e.g. pdf.js.
Assignee: nobody → lochang
Group: mozilla-employee-confidential
Attached patch [WIP] entire_word.patch (obsolete) — Splinter Review
I am still investigating the implementation of finding entire word in gecko. It is a bit more complicated than I think originally, since there are many different languages and side cases that we need to consider. However, I have a WIP patch of finding entire word by using regexp. Maybe we can use it if we do not have any concern with regexp.
I think the problem here is about performance and, as you mentioned, multiple language support. How does the WIP work? Do you feel significant delay on searching? Will it work on non-English text?

On the other hand, could we use ICU library to help on string matching?
(In reply to Evelyn Hung [:evelyn] from comment #2)
> I think the problem here is about performance and, as you mentioned,
> multiple language support. How does the WIP work? Do you feel significant
> delay on searching? Will it work on non-English text?
> 
> On the other hand, could we use ICU library to help on string matching?

I have tested find string (non-entire word) on optimized, non-debug build. And the performance with or without regex seems no significant difference [1]. So I will first keep working on WIP patch (regex) to make it supports finding Chinese.

About ICU library, I do find that twitter has a open source ICU library [2] that we might be able to use. But I am not sure if we could just check them into m-c and use it, or how it will improve our string matching.

[1] Both take 78 sec to find "for" on 300+ pages pdf:
https://doc.lagout.org/programmation/C/Addison.Wesley.Effective.CPP.3rd.Edition.May.2005.pdf
[2] https://github.com/twitter/twitter-cldr-js
It seems current Firefox doesn't support finding entire word in non-English language [1]. I think we should first clarify the coverage we want to support here.


[1] Find for entire word "委" in pdf http://www.ia.nctu.edu.tw//ezfiles/0/1000/attach/94/pta_921_9439026_07440.pdf
Hi Evelyn,

Could you feedback the patch for me? Thanks.

Since currently FireFox does not support finding entire word in Chinese. I guess maybe the patch is good enough for now? I simply check the character before and after the search string to see if they are ASCII digit or ASCII alphabet when finding entire word.

BTW do we have any spec for finding entire word, such as which punctuation should we escape?
Attachment #8865331 - Attachment is obsolete: true
Attachment #8868058 - Flags: feedback?(ehung)
(In reply to Louis Chang [:louis] from comment #5)
> Since currently FireFox does not support finding entire word in Chinese. I
> guess maybe the patch is good enough for now? 

The patch can support finding entire word in English.
Comment on attachment 8868058 [details] [diff] [review]
[WIP] entire_word.patch

If I understand your logic correctly, this approach is wrong. It's possible that the user try to find a string that completely non-alphabet non-digit. Try to use the text in your patch and find ' == ', then you will see a match in |if (n == 0) {| expression, which won't be detected by your way.

Compare to regular expression, this approach is less efficient. You can use \b in regular expression for detecting word boundary without making your own heuristic(alphabet/digit).
Attachment #8868058 - Flags: feedback?(ehung) → feedback-
Per our offline discussion, it's worth studying how icu library can be leveraged to solve this problem for multi-languages, and how we do word match for web pages in Firefox. If it's useful, we should consider exposing it to chrome's scripts. I'm fine to firstly address just English, but English-only doesn't meet our shipping criteria.

I'm closing this bug as WONTFIX per:

"The Mortar experiment has concluded. Mozilla does not consider the PDF use case justifies the burden of implementing and maintaining PDFium and a Pepper API implementation in Gecko."

Source: https://wiki.mozilla.org/Mortar_Project

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: