Closed Bug 157534 Opened 22 years ago Closed 16 years ago

Edit->Find in Page found substring in Thai display cell, but it shouldn't be

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: arthit, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: intl)

Attachments

(1 file)

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; SunOS sun4u; en-US; rv:1.1a+) Gecko/20020714
BuildID:    202071422

in Thai writing system, there're 4 display-levels of character.
from highest to lowest: top, above, base, below.

several Thai characters with different display-levels can be composed together.
we call the composed unit a "display cell".

when do searching, this "display cell" should be treated as an atomic unit,
searching any seperated "substring" within "display cell" is not allowed.

for example,
in "U+0E2A U+0E35 U+0E48" string,
U+0E48 can't be seperated from U+0E2A U+0E35,
all of them should be treated as only one single atom.

so, the search of "U+0E2A U+0E35" or "U+0E35 U+0E48" or any one character alone
is expected to be NOT FOUND.

Reproducible: Always

Steps to Reproduce:
1. open attached test case (findthai.html) in browser,
   we will see only one string,  "Ъеш"  (U+0E2A U+0E35 U+0E48)
2. select Edit -> Find in this page
3. in "Find text" field,
   type "Ъе" (U+0E2A U+0E35) and click "Find"

Actual Results:
found a result.
it founds "Ъе" (U+0E2A U+0E35).

Expected Results:
found no result.
"Ъе" (U+0E2A U+0E35) should not be found.


fyi,
from the testcase

"Ъеш"  (U+0E2A U+0E35 U+0E48), means "four" in Thai.
"Ъе" (U+0E2A U+0E35), means "color" in Thai.

they have different meaning.
search for one, should not found another.
note: attachment is tis-620 encoded html.

tested and found it occurs with
Mozilla 1.0, 1.1alpha, build 2002071422
NS7 PR1 on Windows, NS7.0_03 on Solaris
Blocks: thai
-->akkana
Assignee: sgehani → akkana
It sounds like solving this will require quite a bit of knowledge of Thai (as
well as the ability to type it in order to test).  I don't think my mozilla
build (on linux) is even showing me the Thai characters -- in your examples I'm
seeing one character per unicode number, and all the characters look the same,
while from your description it sounds like they should probably each be one
character, and each different.

I can't see this getting fixed unless we make a push to get search (that would
be me) and I18n people together, discuss the issues and procedures for how to
test it, and come up with a design for how it should work.  If anyone who
understands the Thai issue wants to work on this, I'll be glad to help with it,
explain how find works or some tricks for debugging it, or meet to discuss
solutions.

For now, I'm reassigning to I18n in the hope that they'll have a better handle
on it.  If that's not appropriate, send it back -- I'm willing to own it but
it'll probably have to be Future/helpwanted.
Assignee: akkana → yokoyama
Component: Search → Internationalization
QA Contact: claudius → ruixu
Keywords: intl
QA Contact: ruixu → ylong
hi akkanna,

Could you please explain how search works and tricks to debug? This is not just
a Thai bug but also Arabic. Indian scripts will have the same problem. For all
CTL languages searching needs to be by display cell or cluster. A simple code
point based search will not work since characters are shaped when they are
displayed.
I am aware of Thai and Indic search issues but do not know how mozilla handles it.
prabhat.
Here's a start.  I'll also html-ify this description (probably with the
references to this specific bug edited out) and put it up at mozilla.org,
probably under editor.

-----------------------
Understanding the Find algorithm

The algorithm is located in embedding/components/find/src/nsFind.cpp, and the
core of it is in nsFind::Find.  I'm going to assume that these clusters are
entirely located within one text node, so you shouldn't need to worrk about
routines like NextNode or SkipNode -- those should just work.

Find() searches forward or backward through the tree, looking at nodes (almost
always a text node, though there's an RFE to be able to search in other types of
nodes) and comparing character by character with our pattern string.  In
reverse, we compare characters starting from the end of the pattern string, not
the beginning (that might be important for this bug).

When we start a match (i.e. match one or more characters), we save the point
where the match started then continue through the string.  If we reach the end
of the pattern string and we're still matching, we have succeeded; if we hit a
non-match before the end of the pattern string, then we have to go back to the
next character in the tree after the place where this match started, and reset
our position in the pattern string to the beginning (or end, if backwards).

ome important variables within Find():

mIterNode is the current node we're examining.

mIterOffset is the current offset into the node: the place at which the current
match started, or, if we're not in a partial match, the character we're
comparing right now.

pindex is the current offset into the pattern string, and findex is the current
offset into the current node (so these two point to the characters we're
currently comparing).

You may have noticed that findex and mIterOffset seem to do the same thing. 
They basically do, but findex points to a character, whereas mIterOffset is a
range offset that points between characters: see the note around line 680.

Since a match may (and often does) span multiple nodes in the tree,
matchAnchorNode and matchAnchorOffset store the node and offset (range offset)
where the current partial match began.

You can #define DEBUG_find 1 and get lots of spew on standard output about what
the algorithm is doing; I initially put these in because gdb doesn't always work
reliably, but it's also useful in cases where you have to search a lot of text
before the first match, so stopping/continuing past a breakpoint would take a
long time.

Fragments of text nodes in the dom can be either single or double byte: see the
use of the pointers t1b and t2b.  This may be the right place to add support for
languages which need more than two bytes -- perhaps we need a tMultiByte option
everywhere we now deal with t1b and t2b.  In fact, that might be all you need,
assuming it's possible to iterate backwards through such a multibyte DOM text
node fragment.  If that's not possible, then you might have to define iterators
which do what we currently do now in the single and double byte case, but do
more work to implement forward and backward iteration in the multibyte case.

Given this info, the comments in the code may be enough to take you through the
algorithm.  Most of it is fairly straightforward; some of the tricky logic has
to do with matching whitespace (any space in the pattern string matches one or
more spaces in the document, so for example a double space in the pattern string
must match two or more characters of whitespace -- which includes space, dom
newline (\n), and non-breaking space ( ).  There's also some tricky logic
regarding when to advance findex and pindex.  With any luck you won't have to
change any of that.

A slightly more general version of that last long comment is now up at:
http://www.mozilla.org/editor/find-algorithm.html
If you have other questions or anything needs clarifying, please ask!
I think this is an invalid bug
>when do searching, this "display cell" should be treated as an atomic unit,
Who said that? I don't think WTT2.0 specify that. Unicode 3.0 does not mention
that also. I think that is your personal opinion. 
Assignee: yokoyama → ftang
> I think this is an invalid bug

No. And this is not just a bug for Thai but an i18n bug.

>>when do searching, this "display cell" should be treated as an atomic unit,
>Who said that? I don't think WTT2.0 specify that. Unicode 3.0 does not mention
that also. I think that is your personal opinion. 

WTT2.0 didn't said that. It's mentioned in Unicode 3.0, in the section on
text boundaries. It's said that everything must be done on "grapheme cluster" 
boundary, i.e. caret movement and string search MUST take cluster boundary
into account. 

A simple example will make it easier to understand:-
If you want to search for an 'A', and the text is "B~A^D"
(where ~ and ^ are non-spacing diacritics).
If you consider that 'A' (which is a part of 'A^') is a match
and make selection around it, users can just delete that
'A' (by BS or DEL) and make 'B~^D', a bug.
Or if you want to search for an 'A' and replace any
occurence with 'A~', the result will be "B~A~^D", another
kind of bug.
So selection that doesn't fit in cluster boundaries must
be forbidden. This is why the Unicode 3.0 standard says so.

----
Back to our bug, the idea is easy to implement :-
1) when a substring is found
2) see if the "end" of match is on cluster boundary
3) if no, consider it a non-match
>It's mentioned in Unicode 3.0, in the section on
>text boundaries. It's said that everything must be done on "grapheme cluster" 
>boundary, i.e. caret movement and string search MUST take cluster boundary
>into account. 

I don't think that is what the Unicode 3.0 say about.
Read page 116 session 5.12 Editin and Selection. It mention the following:

Three types of boundaries are gnerally useful in editing and selecting within words.

Cluster Boundaries
Stacked Boundaries
Atomic Chacter Boundaries
Linear Bundaries
Nolinear Boundaries

Status: NEW → ASSIGNED
Right, I simply used the term (new in Unicode 3.2) ambiguously to define
the default unit for caret movement.

Unicode 3.0 say in page 118, section 5.13, heading "Other Processes" :
"When searching string, remember to check for additional nonspacing marks
in the target string that may affect the interpretation of the last matching
character."

Isn't this bug the exact case that the sentence was written to prevent?
Blocks: 283271
what a hack. I have not touch mozilla code for 2 years. I didn't read these bugs
for 2 years. And they are still there. Just close them as won't fix to clean up.
Status: ASSIGNED → RESOLVED
Closed: 19 years ago
Resolution: --- → WONTFIX
Mass Bug Re-Open of bugs Frank Tang Closed with no good reason. Spam is his
fault not my own
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
Mass Re-assinging Frank Tangs old bugs that he closed won't fix and had to be
re-open. Spam is his fault not my own
Assignee: ftang → nobody
Status: REOPENED → NEW
BugAThon Bangkok

We (7-8 native Thais) discussed about this and all agreed
that while searching substring is expected behavior,
searching substring starting with non-base Thai character is a non-expected behavior.


As a result, we should not allow user to type non-base Thai character at the start of string (e.g. search word, url) at the first place.
i.e. we should do character sequence checking in find and address bar

If we can strict that input, anomalies in Bug 157534 and Bug 421275 will no longer appear. So both will be closed.


New Bug 425900 open "Should not allow non-base Thai character as first character in textfield/textarea".
Status: NEW → RESOLVED
Closed: 19 years ago16 years ago
Resolution: --- → WONTFIX
I think it's not a bug but it's a feature.
I use Firefox 4b6 on Windows 7. This bug has not fix.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: