Closed Bug 122584 Opened 23 years ago Closed 22 years ago

editor problem with surrogate pair

Categories

(Core :: DOM: Editor, defect, P4)

x86
Windows 2000
defect

Tracking

()

VERIFIED FIXED
mozilla1.4beta

People

(Reporter: ftang, Assigned: mozeditor)

References

Details

(Keywords: intl, topembed+, Whiteboard: [adt3])

Attachments

(3 obsolete files)

When we have surrogate pair or unicode character from plan 1 to 16.
We should process that two unicode code point as a single entity in
1. cursor movement (move forward and backward)
2. selection (select forward and backward)
3. delection (backspace and delete)
how to reproduce
1. visit http://bugzilla.mozilla.org/attachment.cgi?id=67079&action=view
the source of this file is
"𠀀𠀁𠀂𠀃"
it show 8 ? mark because you don't have a font for that AND bug 122582.
2. selection forward problem:
2.a click before the first "?"
2.b select toward the right, 
expect result:
select the first two "?" together before bug 122582 got fixed. 
actual result:
it could select only the first "?"
this is wrong because it only represent half of the character.
3. selection backward problem, try same as 2 but select from the last forward

4. cursor movement problem
4.1 select "File:Edit Page" to bring that page into composer
4.2 move your cursor  by using <- or -> key
expect result- move across two "?" every time you hit untill bug 122582 got fix
actual result- move across only one "?", which mean only half of the unicode
plane 1-16  character

5. deletion problem
click "Backspace" or "delete" it should remove the whole character (two
PRUnichar for plan 1-16)

here are more information about plan 1-16 characters or ncr >= &#x10000;
one plan 1-16 character is encoded by two PRUnichar (two 16 bits)
this is call surrogate pair in unicode.
The first PRUnichar is called high surrogate and is range from 0xD800-DBFF
and the second PRUnichar is called low surroaget and is range from 
0xDC00-0xDFFF

in the layout or editor, the surrogate pair should be treat as one single
character 
I know the cursor movement and selection code is in
nsTextFrame.cpp PeekOffset
probably the case for
case eSelectCharacter:

Not sure about the delection code. Maybe we should file a seperate bug for
delection and jfrancis can tell us where is that code. 
This problem exists even where bug 122582 doesn't apply. Steps to reproduce:

1) open http://www.geocities.com/i18nguy/unicode-example-plane1.html
2) click next to one of the Gothic or Deseret characters in the table.
3) drag across the character.
Expected result: the character will be selected
Actual result: as the highlight moves across the character it first becomes ??
with one selected and one not, and then returns to the correct glyph.

Similar behaviour with deletion.
should we fix this for GB18030 support?
Keywords: intl, nsbeta1
nsbeta1+ low priority
Assignee: smontagu → ftang
Keywords: nsbeta1nsbeta1+
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla1.0
P4
Priority: -- → P4
give to shanjian
Assignee: ftang → shanjian
Status: ASSIGNED → NEW
Whiteboard: adt3
reassign to ftang
Assignee: shanjian → ftang
it looks like the cursor movement code and selection code are in different
place. but looking at the SUN_CTL may help us find all of them
Status: NEW → ASSIGNED
Attached patch patch fix editor cursor movement (obsolete) — Splinter Review
this fix the cursor movement issue with surrogate. We still need to fix
1. mouse selection issue
2. backsapce and detele key should remove the whole surrogate pair
Attached patch v2 patch (obsolete) — Splinter Review
patch handle editor cursor movement AND mouse selection.
Attachment #76119 - Attachment is obsolete: true
smontagu, can you review the current patch. I still need to talk to jfrancis
about delete and backspace. But I may spin that off to a seperate bug. 
Now why would they go and do a thing like that to unicode?  Unicode was supposed 
to make things simpler, not harder.  Yuck.

Editor will have to change in a few places.  And that's assuming we fix selection 
to guarantee that selection boundaries are never "between" one of these funky 
characters.  

You can file a bug against me for that.

Are the 0xD800-DBFF and 0xDC00-0xDFFF unicode ranges used only for the pairs?  In 
other words, could you encounter one such unicode character (say, 0xDC00) without 
the other half of the pair being around?  Would that mean something different?  
Or is that just illegal?  I want to avoid having to walk around the string to 
figure out what is going on.
>Now why would they go and do a thing like that to unicode?  Unicode was supposed 
>to make things simpler, not harder.  Yuck.
Because 16 bits cannot encode all the characters people NEEDs and 32 bits is too
BIG for most of the people. 

>Editor will have to change in a few places.  And that's assuming we fix selection 
>to guarantee that selection boundaries are never "between" one of these funky 
>characters.  
>You can file a bug against me for that.
I don't want to increase your work load. I rather fix that by ourself and ask
for your review.

>Are the 0xD800-DBFF and 0xDC00-0xDFFF unicode ranges used only for the pairs?  
Yes. They are presereved by Unicode in the very early day. DB800-DBff is called
high surrogate and DC00-DFFF is called low surrogate. They have no other usage
of these values except surrogate.

>In other words, could you encounter one such unicode character (say, 0xDC00)
without 
>the other half of the pair being around? 
Not in Unicode standard. If such thing happen, the result is undefined. 

>Would that mean something different?  
No

>Or is that just illegal?  
YES
>I want to avoid having to walk around the string to figure out what is going on.

Comment on attachment 76133 [details] [diff] [review]
v2 patch

I think you should use |mContentOffset| instead of |ip[0]| throughout.
Attached patch v3 of patch (obsolete) — Splinter Review
Attachment #76133 - Attachment is obsolete: true
Comment on attachment 76276 [details] [diff] [review]
v3 of patch

The patch as is doesn't build, because IS_LOW_SURROGATE is not defined.
Assuming that you are aware of that and it is defined in another patch that you
will be checking in before this one, r=smontagu
Attachment #76276 - Flags: review+
I think I will have to do the editor work involved.  Deletion is the very hardest 
thing in the editor.  
Comment on attachment 76276 [details] [diff] [review]
v3 of patch

sr=kin@netscape.com

Assuming the file that defines IS_LOW_SURROGATE() macro (part of patch in bug
130441) is included in nsTextFrame.cpp.
Attachment #76276 - Flags: superreview+
Blocks: 104060
Comment on attachment 76276 [details] [diff] [review]
v3 of patch

a=asa (on behalf of drivers) for checkin to the 1.0 trunk
Attachment #76276 - Flags: approval+
fixed and check in 
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
*** Bug 131066 has been marked as a duplicate of this bug. ***
Yuying, please verify...thanks..changing qa_contact to ylong
QA Contact: sujay → ylong
On 03-27 trunk build / WinXP-SimpChinese:
Fixed with cursor movement and selection.
However, when press backspace or delete key to do deleting, only a half of one
character is deleted, and the rest of part is showed "?". 

Re-open it for now.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment on attachment 76276 [details] [diff] [review]
v3 of patch

patch landed. bug still open. obsoleting patch so it doesn't look like an
approved change waiting to land.
Attachment #76276 - Attachment is obsolete: true
I'm assuming this bug is only open for remaining editor work, so I'm taking 
ownership.  
Assignee: ftang → jfrancis
Status: REOPENED → NEW
Clearing nsbeta1+ and moving to Mozilla1.01 for remaining work
Keywords: nsbeta1+nsbeta1
Target Milestone: mozilla1.0 → mozilla1.0.1
Keywords: nsbeta1nsbeta1-
I agree the remaining issue are nsbeta1- bug. 
Blocks: 157673
The days of having a half dozen milestones out in front of us to divide bugs 
between seem to be gone, though I dont know why.  Lumping everything together as 
far out as I can.  I'll pull back things that I am working on as I go.
Target Milestone: mozilla1.0.1 → mozilla1.2beta
Keywords: nsbeta1-nsbeta1
[ushing these out as far as bugzilla will let me.  I'll pull them back as I work
on them.
Target Milestone: mozilla1.2beta → mozilla1.4beta
batch: adding topembed per Gecko2 document
http://rocknroll.mcom.com/users/marek/publish/Gecko/Gecko2Tasks.html
Keywords: topembed
nsbeta1- for the remaining work.
Keywords: nsbeta1nsbeta1-
can we split the remaining work out into another bug and close this one?
Keywords: topembedtopembed+
jfrancis: pls resolve this one as fixed.

ylong: can we split the remaining work out into another bug and close this one?
thanks!
Whiteboard: adt3 → [adt3]
Sorry I forgot the add comment after I filed a bug 173541 a few days ago for the
remain problem.

Mark as fixed for this one.
Status: NEW → RESOLVED
Closed: 22 years ago22 years ago
Resolution: --- → FIXED
Verified.
Status: RESOLVED → VERIFIED
Blocks: grouper
Depends on: 180372
No longer blocks: 157673
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: