157534 - Edit->Find in Page found substring in Thai display cell, but it shouldn't be

Reporter

Description

•

23 years ago

From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; SunOS sun4u; en-US; rv:1.1a+) Gecko/20020714 BuildID: 202071422 in Thai writing system, there're 4 display-levels of character. from highest to lowest: top, above, base, below. several Thai characters with different display-levels can be composed together. we call the composed unit a "display cell". when do searching, this "display cell" should be treated as an atomic unit, searching any seperated "substring" within "display cell" is not allowed. for example, in "U+0E2A U+0E35 U+0E48" string, U+0E48 can't be seperated from U+0E2A U+0E35, all of them should be treated as only one single atom. so, the search of "U+0E2A U+0E35" or "U+0E35 U+0E48" or any one character alone is expected to be NOT FOUND. Reproducible: Always Steps to Reproduce: 1. open attached test case (findthai.html) in browser, we will see only one string, "Ъеш" (U+0E2A U+0E35 U+0E48) 2. select Edit -> Find in this page 3. in "Find text" field, type "Ъе" (U+0E2A U+0E35) and click "Find" Actual Results: found a result. it founds "Ъе" (U+0E2A U+0E35). Expected Results: found no result. "Ъе" (U+0E2A U+0E35) should not be found. fyi, from the testcase "Ъеш" (U+0E2A U+0E35 U+0E48), means "four" in Thai. "Ъе" (U+0E2A U+0E35), means "color" in Thai. they have different meaning. search for one, should not found another.

Arthit Suriyawongkul

Reporter

Comment 1

•

23 years ago

Attached file testcase#1: for reproduce the bug — Details

Arthit Suriyawongkul

Reporter

Comment 2

•

23 years ago

note: attachment is tis-620 encoded html. tested and found it occurs with Mozilla 1.0, 1.1alpha, build 2002071422 NS7 PR1 on Windows, NS7.0_03 on Solaris

Arthit Suriyawongkul

Reporter

Updated

•

23 years ago

Blocks: thai

Kathleen :Brade

Comment 3

•

23 years ago

-->akkana

Assignee: sgehani → akkana

Akkana Peck

Comment 4

•

23 years ago

It sounds like solving this will require quite a bit of knowledge of Thai (as well as the ability to type it in order to test). I don't think my mozilla build (on linux) is even showing me the Thai characters -- in your examples I'm seeing one character per unicode number, and all the characters look the same, while from your description it sounds like they should probably each be one character, and each different. I can't see this getting fixed unless we make a push to get search (that would be me) and I18n people together, discuss the issues and procedures for how to test it, and come up with a design for how it should work. If anyone who understands the Thai issue wants to work on this, I'll be glad to help with it, explain how find works or some tricks for debugging it, or meet to discuss solutions. For now, I'm reassigning to I18n in the hope that they'll have a better handle on it. If that's not appropriate, send it back -- I'm willing to own it but it'll probably have to be Future/helpwanted.

Assignee: akkana → yokoyama

Component: Search → Internationalization

QA Contact: claudius → ruixu

Rui Xu

Updated

•

23 years ago

Keywords: intl

QA Contact: ruixu → ylong

Prabhat Hegde

Comment 5

•

23 years ago

hi akkanna, Could you please explain how search works and tricks to debug? This is not just a Thai bug but also Arabic. Indian scripts will have the same problem. For all CTL languages searching needs to be by display cell or cluster. A simple code point based search will not work since characters are shaped when they are displayed. I am aware of Thai and Indic search issues but do not know how mozilla handles it. prabhat.

Akkana Peck

Comment 6

•

23 years ago

Here's a start. I'll also html-ify this description (probably with the references to this specific bug edited out) and put it up at mozilla.org, probably under editor. ----------------------- Understanding the Find algorithm The algorithm is located in embedding/components/find/src/nsFind.cpp, and the core of it is in nsFind::Find. I'm going to assume that these clusters are entirely located within one text node, so you shouldn't need to worrk about routines like NextNode or SkipNode -- those should just work. Find() searches forward or backward through the tree, looking at nodes (almost always a text node, though there's an RFE to be able to search in other types of nodes) and comparing character by character with our pattern string. In reverse, we compare characters starting from the end of the pattern string, not the beginning (that might be important for this bug). When we start a match (i.e. match one or more characters), we save the point where the match started then continue through the string. If we reach the end of the pattern string and we're still matching, we have succeeded; if we hit a non-match before the end of the pattern string, then we have to go back to the next character in the tree after the place where this match started, and reset our position in the pattern string to the beginning (or end, if backwards). ome important variables within Find(): mIterNode is the current node we're examining. mIterOffset is the current offset into the node: the place at which the current match started, or, if we're not in a partial match, the character we're comparing right now. pindex is the current offset into the pattern string, and findex is the current offset into the current node (so these two point to the characters we're currently comparing). You may have noticed that findex and mIterOffset seem to do the same thing. They basically do, but findex points to a character, whereas mIterOffset is a range offset that points between characters: see the note around line 680. Since a match may (and often does) span multiple nodes in the tree, matchAnchorNode and matchAnchorOffset store the node and offset (range offset) where the current partial match began. You can #define DEBUG_find 1 and get lots of spew on standard output about what the algorithm is doing; I initially put these in because gdb doesn't always work reliably, but it's also useful in cases where you have to search a lot of text before the first match, so stopping/continuing past a breakpoint would take a long time. Fragments of text nodes in the dom can be either single or double byte: see the use of the pointers t1b and t2b. This may be the right place to add support for languages which need more than two bytes -- perhaps we need a tMultiByte option everywhere we now deal with t1b and t2b. In fact, that might be all you need, assuming it's possible to iterate backwards through such a multibyte DOM text node fragment. If that's not possible, then you might have to define iterators which do what we currently do now in the single and double byte case, but do more work to implement forward and backward iteration in the multibyte case. Given this info, the comments in the code may be enough to take you through the algorithm. Most of it is fairly straightforward; some of the tricky logic has to do with matching whitespace (any space in the pattern string matches one or more spaces in the document, so for example a double space in the pattern string must match two or more characters of whitespace -- which includes space, dom newline (\n), and non-breaking space ( ). There's also some tricky logic regarding when to advance findex and pindex. With any luck you won't have to change any of that.

Akkana Peck

Comment 7

•

23 years ago

A slightly more general version of that last long comment is now up at: http://www.mozilla.org/editor/find-algorithm.html If you have other questions or anything needs clarifying, please ask!

Frank Tang

Comment 8

•

23 years ago

I think this is an invalid bug >when do searching, this "display cell" should be treated as an atomic unit, Who said that? I don't think WTT2.0 specify that. Unicode 3.0 does not mention that also. I think that is your personal opinion.

Assignee: yokoyama → ftang

samphan

Comment 9

•

23 years ago

> I think this is an invalid bug No. And this is not just a bug for Thai but an i18n bug. >>when do searching, this "display cell" should be treated as an atomic unit, >Who said that? I don't think WTT2.0 specify that. Unicode 3.0 does not mention that also. I think that is your personal opinion. WTT2.0 didn't said that. It's mentioned in Unicode 3.0, in the section on text boundaries. It's said that everything must be done on "grapheme cluster" boundary, i.e. caret movement and string search MUST take cluster boundary into account. A simple example will make it easier to understand:- If you want to search for an 'A', and the text is "B~A^D" (where ~ and ^ are non-spacing diacritics). If you consider that 'A' (which is a part of 'A^') is a match and make selection around it, users can just delete that 'A' (by BS or DEL) and make 'B~^D', a bug. Or if you want to search for an 'A' and replace any occurence with 'A~', the result will be "B~A~^D", another kind of bug. So selection that doesn't fit in cluster boundaries must be forbidden. This is why the Unicode 3.0 standard says so. ---- Back to our bug, the idea is easy to implement :- 1) when a substring is found 2) see if the "end" of match is on cluster boundary 3) if no, consider it a non-match

Frank Tang

Comment 10

•

23 years ago

>It's mentioned in Unicode 3.0, in the section on >text boundaries. It's said that everything must be done on "grapheme cluster" >boundary, i.e. caret movement and string search MUST take cluster boundary >into account. I don't think that is what the Unicode 3.0 say about. Read page 116 session 5.12 Editin and Selection. It mention the following: Three types of boundaries are gnerally useful in editing and selecting within words. Cluster Boundaries Stacked Boundaries Atomic Chacter Boundaries Linear Bundaries Nolinear Boundaries

Status: NEW → ASSIGNED

samphan

Comment 11

•

23 years ago

Right, I simply used the term (new in Unicode 3.2) ambiguously to define the default unit for caret movement. Unicode 3.0 say in page 118, section 5.13, heading "Other Processes" : "When searching string, remember to check for additional nonspacing marks in the target string that may affect the interpretation of the last matching character." Isn't this bug the exact case that the sentence was written to prevent?

Samphan Raruenrom

Updated

•

20 years ago

Blocks: 283271

Frank Tang

Comment 12

•

20 years ago

what a hack. I have not touch mozilla code for 2 years. I didn't read these bugs for 2 years. And they are still there. Just close them as won't fix to clean up.

Status: ASSIGNED → RESOLVED

Closed: 20 years ago

Resolution: --- → WONTFIX

Travis Chase

Comment 13

•

20 years ago

Mass Bug Re-Open of bugs Frank Tang Closed with no good reason. Spam is his fault not my own

Status: RESOLVED → REOPENED

Resolution: WONTFIX → ---

Travis Chase

Comment 14

•

20 years ago

Mass Re-assinging Frank Tangs old bugs that he closed won't fix and had to be re-open. Spam is his fault not my own

Assignee: ftang → nobody

Status: REOPENED → NEW

Arthit Suriyawongkul

Reporter

Comment 15

•

17 years ago

BugAThon Bangkok We (7-8 native Thais) discussed about this and all agreed that while searching substring is expected behavior, searching substring starting with non-base Thai character is a non-expected behavior. As a result, we should not allow user to type non-base Thai character at the start of string (e.g. search word, url) at the first place. i.e. we should do character sequence checking in find and address bar If we can strict that input, anomalies in Bug 157534 and Bug 421275 will no longer appear. So both will be closed. New Bug 425900 open "Should not allow non-base Thai character as first character in textfield/textarea".

Status: NEW → RESOLVED

Closed: 20 years ago → 17 years ago

Resolution: --- → WONTFIX

Kamonpop Jarujit

Comment 16

•

15 years ago

I think it's not a bug but it's a feature.

Pongsak Sarapukdee

Comment 17

•

15 years ago

I use Firefox 4b6 on Windows 7. This bug has not fix.