Closed Bug 26920 Opened 25 years ago Closed 7 years ago

Generic UCV buffering scheme byte misalignments

Categories

(Core :: Internationalization, defect, P3)

defect

Tracking

()

RESOLVED FIXED
mozilla56

People

(Reporter: jbetak, Assigned: hsivonen)

References

()

Details

(Keywords: intl, Whiteboard: [fixed by encoding_rs])

Attachments

(1 file)

If you use an old build (e.g. 2000020310 or earlier), then you will notice that 
we "eat" some parts of the HTML and the file rendering is disrupted. Although 
this buffering scheme is not used in the UTF-8 decoder anymore, it is still 
needed for other decoders and should be revisited since there might be problems 
with buffer alignment. 

I looked at it in the debugger and my impression was just that - the buffer 
comes back with inappropriately aligned bytes and we lose some of them in the 
process...
jbetak- Can this bug be reproduceable by the current build w/ other charsets ?
Assignee: ftang → cata
Status: NEW → ASSIGNED
Being a rather random and difficult to reproduce bug, I'm moving it far away. No 
need to worry until it bites us harder or we have extra time.
Target Milestone: M20
Cata and Juraj, can you better characterize the nature of this bug?
Then IQA can do some testing to assure us that this is indeed a rare
case.  I don't want to find out that this is not rare after Beta1.
with no specific information. Mark it as invalid
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → INVALID
After I talked with Bob, we should reopen this bug.  I reopen this and mark as future.
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
Target Milestone: M20 → Future
Are you sure, guys? For all I know Juraj was the only one seeing this once or 
twice a long time ago. I never heard of other occurances, and I cannot 
reproduce it...

Why reopen?
Have your carefully reviewed the code where Juraj thought he saw the loss of
misaligned bytes?  How did you try to reproduce this?  What test cases do you
have?  Can we do any instrumentation (e.g., ASSERTs) to try to catch this?

Losing random bytes can be very hard to find.  Let's reassure ourselves that
there are no edge cases where bytes are being lost before invalidating this.
I tried to reproduce this with the URL from the bug report: 
http://people/ftang/demo/utf8all.html. That's the only test case I know of. And 
it worked just fine.

The reason I do not belive this bug is valid is the nature of that code. It is 
byte-processing code. Very deterministic. If there's a problem once, I expect it 
to be always there, every time we process that bytestream. Also, that code is 
shared by *all* converters. It is exercised for every multybyte page. Any "byte 
eating" should be *very* obvious (pretty much garbage on the rest of the whole 
page...). And yet this is the only report we have and I can't reproduce it.

So, I'll leave it up to you if you want to close the bug or leave it open.
Keywords: intl
move all cata's bug to ftang
Assignee: cata → ftang
Status: REOPENED → NEW
Status: NEW → ASSIGNED
I haven't seen the symptom of this bug before m 0.9.1, but with
m 0.9.1 I came across a lot of pages (Korean in EUC-KR) manifesting
what I believe to be a symptom of this bug. Sometimes, misalignment
in the converter seems to be get fixed by reloading, but most
of time, reloading pages doesn't help. 
This bug is very very serious. For instance, look at
http://www.hani.co.kr/section-005000000/2001/07/005000000200107021018305.html

Around the end of the 4th paragraph, misalignment occurred and
what should three Hangul syllables (U+AC83, U+C73C, U+B85C) 
is rendered as UNKNOWN("?" inside diamond), U+75FC, U+B9C9, UNKNOWN.
The sequence (in EUC-KR) is 

  (20) (B0,CD) (C0,B8) (B7,CE) (20)       : EUC-KR 
  (which should be converted to 

    U+0020 U+AC83 U+C73C U+B85C U+0020   : UCS-2 converted from 
                                            correct EUC-KR
   )
is interpreted as 

   (20) (B0) (CD,C0), (B8,B7), (CE) (20)   : misaligned EUC-KR

which, in turn, is converted to
   
   U+0020, UNKNOWN, U+75FC, UB9C9, UNKNOWN, U+0020 : UCS-2 (converted
                                                   from misaligned EUC-KR)


where a pair of parentheses denotes a sequence of octet(s)
for a single character. 

   I encountered this problem *every few* Korean pages, but it doesn't
seem to have any pattern(at least my casual inspection
hasn't given me any regularity). Even when there are two identical
strings in a single page, one of them gets corrupted while
the other doesn't. 
   
   As I wrote above, I guess this is very serious and fixing
this cannot be put off any longer. 
The page I gave in my prev. comment may not get corrupted if you 
try to reproduce it. When I revisited the page after quiting and
restarting Mozilla (MS-Windows ME) 0.9.1, the page rendered all right
at my first attempt. However, when I reloaded the page, I was able
to reproduce the problem. Under Linux, there was no problem in the page.
This does not mean that there's no problem under Linux but just means
that it's very hard to find any regularity in this bug. There
are web pages(where I don't see any problem in MS-Windows) with this symptom
in Linux version of Mozilla m0.9.1. One of such pages is

http://www.hani.co.kr/section-003000000/2001/05/003000000200105131503349.html

where '(20) (C0,CF) (C1,A4) (C0,BB) (20)' in EUC-KR is misaligned
and is treated as '(20) (C0,CF) (C1) (A4,C0), (BB) (20)'. 

 
move it to m0.9.3 
Target Milestone: Future → mozilla0.9.3
ftang: if you want me to, I could do some investigation on this bug...
per ftang's comment, this has recently been improved, but still not completely 
fixed. Keeping opem and pushing out to 0.9.4.
Target Milestone: mozilla0.9.3 → mozilla0.9.4
I think we fix a lot of issue at m0.9.2 already. move this one to m0.9.7
Target Milestone: mozilla0.9.4 → mozilla0.9.7
future it for now.
Target Milestone: mozilla0.9.7 → Future
Blocks: 187812
what a hack. I have not touch mozilla code for 2 years. I didn't read these bugs
for 2 years. And they are still there. Just close them as won't fix to clean up.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago19 years ago
Resolution: --- → WONTFIX
Mass Bug Re-Open of bugs Frank Tang Closed with no good reason. Spam is his
fault not my own
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
Mass Re-assinging Frank Tangs old bugs that he closed won't fix and had to be
re-open. Spam is his fault not my own
Assignee: ftang → nobody
Status: REOPENED → NEW
Filter on "Nobody_NScomTLD_20080620"
Assignee: nobody → smontagu
QA Contact: teruko → i18n
Depends on: encoding_rs
I believe this was already fixed but this was fixed by bug 1261841 at the latest.
Status: NEW → RESOLVED
Closed: 19 years ago7 years ago
Resolution: --- → FIXED
Whiteboard: [fixed by encoding_rs]
Assignee: smontagu → hsivonen
Target Milestone: Future → mozilla56
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: