Generic UCV buffering scheme byte misalignments

RESOLVED FIXED in mozilla56

Status

()

Core
Internationalization
P3
normal
RESOLVED FIXED
18 years ago
8 months ago

People

(Reporter: jbetak@netscape.com (away - not reading bugmail), Assigned: hsivonen)

Tracking

({intl})

Trunk
mozilla56
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [fixed by encoding_rs], URL)

Attachments

(1 attachment)

If you use an old build (e.g. 2000020310 or earlier), then you will notice that 
we "eat" some parts of the HTML and the file rendering is disrupted. Although 
this buffering scheme is not used in the UTF-8 decoder anymore, it is still 
needed for other decoders and should be revisited since there might be problems 
with buffer alignment. 

I looked at it in the debugger and my impression was just that - the buffer 
comes back with inappropriately aligned bytes and we lose some of them in the 
process...

Comment 1

18 years ago
jbetak- Can this bug be reproduceable by the current build w/ other charsets ?
Assignee: ftang → cata

Updated

18 years ago
Status: NEW → ASSIGNED

Comment 2

18 years ago
Being a rather random and difficult to reproduce bug, I'm moving it far away. No 
need to worry until it bites us harder or we have extra time.
Target Milestone: M20

Comment 3

18 years ago
Cata and Juraj, can you better characterize the nature of this bug?
Then IQA can do some testing to assure us that this is indeed a rare
case.  I don't want to find out that this is not rare after Beta1.

Comment 4

18 years ago
with no specific information. Mark it as invalid
Status: ASSIGNED → RESOLVED
Last Resolved: 18 years ago
Resolution: --- → INVALID

Comment 5

18 years ago
After I talked with Bob, we should reopen this bug.  I reopen this and mark as future.
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
Target Milestone: M20 → Future

Comment 6

18 years ago
Are you sure, guys? For all I know Juraj was the only one seeing this once or 
twice a long time ago. I never heard of other occurances, and I cannot 
reproduce it...

Why reopen?

Comment 7

18 years ago
Have your carefully reviewed the code where Juraj thought he saw the loss of
misaligned bytes?  How did you try to reproduce this?  What test cases do you
have?  Can we do any instrumentation (e.g., ASSERTs) to try to catch this?

Losing random bytes can be very hard to find.  Let's reassure ourselves that
there are no edge cases where bytes are being lost before invalidating this.

Comment 8

18 years ago
I tried to reproduce this with the URL from the bug report: 
http://people/ftang/demo/utf8all.html. That's the only test case I know of. And 
it worked just fine.

The reason I do not belive this bug is valid is the nature of that code. It is 
byte-processing code. Very deterministic. If there's a problem once, I expect it 
to be always there, every time we process that bytestream. Also, that code is 
shared by *all* converters. It is exercised for every multybyte page. Any "byte 
eating" should be *very* obvious (pretty much garbage on the rest of the whole 
page...). And yet this is the only report we have and I can't reproduce it.

So, I'll leave it up to you if you want to close the bug or leave it open.

Updated

17 years ago
Keywords: intl

Comment 9

17 years ago
move all cata's bug to ftang
Assignee: cata → ftang
Status: REOPENED → NEW

Updated

17 years ago
Status: NEW → ASSIGNED

Comment 10

17 years ago
I haven't seen the symptom of this bug before m 0.9.1, but with
m 0.9.1 I came across a lot of pages (Korean in EUC-KR) manifesting
what I believe to be a symptom of this bug. Sometimes, misalignment
in the converter seems to be get fixed by reloading, but most
of time, reloading pages doesn't help. 
This bug is very very serious. For instance, look at
http://www.hani.co.kr/section-005000000/2001/07/005000000200107021018305.html

Around the end of the 4th paragraph, misalignment occurred and
what should three Hangul syllables (U+AC83, U+C73C, U+B85C) 
is rendered as UNKNOWN("?" inside diamond), U+75FC, U+B9C9, UNKNOWN.
The sequence (in EUC-KR) is 

  (20) (B0,CD) (C0,B8) (B7,CE) (20)       : EUC-KR 
  (which should be converted to 

    U+0020 U+AC83 U+C73C U+B85C U+0020   : UCS-2 converted from 
                                            correct EUC-KR
   )
is interpreted as 

   (20) (B0) (CD,C0), (B8,B7), (CE) (20)   : misaligned EUC-KR

which, in turn, is converted to
   
   U+0020, UNKNOWN, U+75FC, UB9C9, UNKNOWN, U+0020 : UCS-2 (converted
                                                   from misaligned EUC-KR)


where a pair of parentheses denotes a sequence of octet(s)
for a single character. 

   I encountered this problem *every few* Korean pages, but it doesn't
seem to have any pattern(at least my casual inspection
hasn't given me any regularity). Even when there are two identical
strings in a single page, one of them gets corrupted while
the other doesn't. 
   
   As I wrote above, I guess this is very serious and fixing
this cannot be put off any longer. 

Comment 11

17 years ago
Created attachment 40858 [details]
screenshot of Mozilla displaying the URL in my prev. comment

Comment 12

17 years ago
The page I gave in my prev. comment may not get corrupted if you 
try to reproduce it. When I revisited the page after quiting and
restarting Mozilla (MS-Windows ME) 0.9.1, the page rendered all right
at my first attempt. However, when I reloaded the page, I was able
to reproduce the problem. Under Linux, there was no problem in the page.
This does not mean that there's no problem under Linux but just means
that it's very hard to find any regularity in this bug. There
are web pages(where I don't see any problem in MS-Windows) with this symptom
in Linux version of Mozilla m0.9.1. One of such pages is

http://www.hani.co.kr/section-003000000/2001/05/003000000200105131503349.html

where '(20) (C0,CF) (C1,A4) (C0,BB) (20)' in EUC-KR is misaligned
and is treated as '(20) (C0,CF) (C1) (A4,C0), (BB) (20)'. 

 

Comment 13

17 years ago
move it to m0.9.3 
Target Milestone: Future → mozilla0.9.3
ftang: if you want me to, I could do some investigation on this bug...
per ftang's comment, this has recently been improved, but still not completely 
fixed. Keeping opem and pushing out to 0.9.4.
Target Milestone: mozilla0.9.3 → mozilla0.9.4

Comment 16

17 years ago
I think we fix a lot of issue at m0.9.2 already. move this one to m0.9.7
Target Milestone: mozilla0.9.4 → mozilla0.9.7

Comment 17

17 years ago
future it for now.
Target Milestone: mozilla0.9.7 → Future

Updated

15 years ago
Blocks: 187812

Comment 18

13 years ago
what a hack. I have not touch mozilla code for 2 years. I didn't read these bugs
for 2 years. And they are still there. Just close them as won't fix to clean up.
Status: ASSIGNED → RESOLVED
Last Resolved: 18 years ago13 years ago
Resolution: --- → WONTFIX

Comment 19

13 years ago
Mass Bug Re-Open of bugs Frank Tang Closed with no good reason. Spam is his
fault not my own
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---

Comment 20

13 years ago
Mass Re-assinging Frank Tangs old bugs that he closed won't fix and had to be
re-open. Spam is his fault not my own
Assignee: ftang → nobody
Status: REOPENED → NEW
Filter on "Nobody_NScomTLD_20080620"
Assignee: nobody → smontagu
QA Contact: teruko → i18n
I believe this was already fixed but this was fixed by bug 1261841 at the latest.
Status: NEW → RESOLVED
Last Resolved: 13 years ago8 months ago
Resolution: --- → FIXED
Whiteboard: [fixed by encoding_rs]
Assignee: smontagu → hsivonen
Target Milestone: Future → mozilla56
You need to log in before you can comment on or make changes to this bug.