[jsplugins] Implement back-end part of find whole-word function

NEW
Assigned to

Status

()

Core
Plug-ins
P3
normal
10 months ago
9 months ago

People

(Reporter: louis, Assigned: louis)

Tracking

(Depends on: 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment, 1 obsolete attachment)

(Assignee)

Description

10 months ago
This bug implements the back-end part of find whole-word function. Either just bypass the string that is not whole-word or find the similar algorithm of finding whole-word in Gecko code base, e.g. pdf.js.
(Assignee)

Updated

10 months ago
Assignee: nobody → lochang

Updated

10 months ago
Group: mozilla-employee-confidential
(Assignee)

Comment 1

10 months ago
Created attachment 8865331 [details] [diff] [review]
[WIP] entire_word.patch

I am still investigating the implementation of finding entire word in gecko. It is a bit more complicated than I think originally, since there are many different languages and side cases that we need to consider. However, I have a WIP patch of finding entire word by using regexp. Maybe we can use it if we do not have any concern with regexp.

Comment 2

10 months ago
I think the problem here is about performance and, as you mentioned, multiple language support. How does the WIP work? Do you feel significant delay on searching? Will it work on non-English text?

On the other hand, could we use ICU library to help on string matching?
(Assignee)

Comment 3

9 months ago
(In reply to Evelyn Hung [:evelyn] from comment #2)
> I think the problem here is about performance and, as you mentioned,
> multiple language support. How does the WIP work? Do you feel significant
> delay on searching? Will it work on non-English text?
> 
> On the other hand, could we use ICU library to help on string matching?

I have tested find string (non-entire word) on optimized, non-debug build. And the performance with or without regex seems no significant difference [1]. So I will first keep working on WIP patch (regex) to make it supports finding Chinese.

About ICU library, I do find that twitter has a open source ICU library [2] that we might be able to use. But I am not sure if we could just check them into m-c and use it, or how it will improve our string matching.

[1] Both take 78 sec to find "for" on 300+ pages pdf:
https://doc.lagout.org/programmation/C/Addison.Wesley.Effective.CPP.3rd.Edition.May.2005.pdf
[2] https://github.com/twitter/twitter-cldr-js
(Assignee)

Comment 4

9 months ago
It seems current Firefox doesn't support finding entire word in non-English language [1]. I think we should first clarify the coverage we want to support here.


[1] Find for entire word "委" in pdf http://www.ia.nctu.edu.tw//ezfiles/0/1000/attach/94/pta_921_9439026_07440.pdf
(Assignee)

Comment 5

9 months ago
Created attachment 8868058 [details] [diff] [review]
[WIP] entire_word.patch

Hi Evelyn,

Could you feedback the patch for me? Thanks.

Since currently FireFox does not support finding entire word in Chinese. I guess maybe the patch is good enough for now? I simply check the character before and after the search string to see if they are ASCII digit or ASCII alphabet when finding entire word.

BTW do we have any spec for finding entire word, such as which punctuation should we escape?
Attachment #8865331 - Attachment is obsolete: true
Attachment #8868058 - Flags: feedback?(ehung)
(Assignee)

Comment 6

9 months ago
(In reply to Louis Chang [:louis] from comment #5)
> Since currently FireFox does not support finding entire word in Chinese. I
> guess maybe the patch is good enough for now? 

The patch can support finding entire word in English.

Comment 7

9 months ago
Comment on attachment 8868058 [details] [diff] [review]
[WIP] entire_word.patch

If I understand your logic correctly, this approach is wrong. It's possible that the user try to find a string that completely non-alphabet non-digit. Try to use the text in your patch and find ' == ', then you will see a match in |if (n == 0) {| expression, which won't be detected by your way.

Compare to regular expression, this approach is less efficient. You can use \b in regular expression for detecting word boundary without making your own heuristic(alphabet/digit).
Attachment #8868058 - Flags: feedback?(ehung) → feedback-

Comment 8

9 months ago
Per our offline discussion, it's worth studying how icu library can be leveraged to solve this problem for multi-languages, and how we do word match for web pages in Firefox. If it's useful, we should consider exposing it to chrome's scripts. I'm fine to firstly address just English, but English-only doesn't meet our shipping criteria.
You need to log in before you can comment on or make changes to this bug.