Closed Bug 67520 Opened 24 years ago Closed 24 years ago

EUC-TW converersion lost data, probably in block boundary

Tracking

()

Status:

VERIFIED FIXED

Milestone:

mozilla0.9

People

(Reporter: bstell, Assigned: tetsuroy)

References

(
URL
)

Details

Attachments

(4 files)

screen shot with a block missing at e1b0 24 years ago kill this account 34.30 KB, image/gif		Details
The same implementation was repeated throughout the UCONV. They may or may not be used; but nonetheless should be fixed. 24 years ago Roy Yokoyama 4.80 KB, patch		Details \| Diff \| Splinter Review
Adding printf() msg in case of unexpected errors. 24 years ago Roy Yokoyama 10.11 KB, patch		Details \| Diff \| Splinter Review
Using of Assert instead of printf(). Fixing the msg typo at the same time. 24 years ago Roy Yokoyama 10.32 KB, patch		Details \| Diff \| Splinter Review

kill this account

Reporter

Description

•

24 years ago

block of characters seem to be lost by the converter

kill this account

Reporter

Comment 1

•

24 years ago

Attached image screen shot with a block missing at e1b0 — Details

Frank Tang

Comment 2

•

24 years ago

I think the problem is our buffering mechanism have some problem . This should be a P1 bug. schedule moz0.9 Change the summary to "EUC-TW converersion lost data, probably in block boundary"

Priority: -- → P1

Summary: convert loses characters → EUC-TW converersion lost data, probably in block boundary

Frank Tang

Comment 3

•

24 years ago

I think this is a block bundary problem. In my window, I have to click super reload (Shift Reload) to recreate the problem. If I don't superreload but just reload, it looks fine (probably because the data is from the cache so the block is bigger). But if I do a super reload , it chunk after c7e0

Frank Tang

Comment 4

•

24 years ago

roy- I mention this to you last time. Can you work on this ASAP. THanks.

Assignee: ftang → yokoyama

Roy Yokoyama

Assignee

Updated

•

24 years ago

Status: NEW → ASSIGNED

msanz

Updated

•

24 years ago

QA Contact: teruko → ji

msanz

Comment 5

•

24 years ago

In order for QA to work on this, can you attach a test case or tell us how to reproduce it? Reassigning QA to ji

kill this account

Reporter

Comment 6

•

24 years ago

I get lost data when I shift-reload the http://village.infoweb.ne.jp/~katakai/mozilla/zh_TW_EUC_1.txt link. Here is a sample of what it should look like (do a cache reload): Note the lines are continous from c7c0 thru c8f0. c7c0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? c7d0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? c7e0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? c7f0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? c8a0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? c8b0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? c8c0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? c8d0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? c8e0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? c8f0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Here is the bad version. Note that we jump form line c7e0 to d5c0. c7c0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? c7d0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? c7e0 ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ? ? ? ? d5c0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? d5d0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? d5e0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

kill this account

Reporter

Comment 7

•

24 years ago

interesting, the data displayed correctly when I submitted the bug but when I redisplay the bug I see question marks (?) for all the asian characters. still one can see the jump in line numbers.

Roy Yokoyama

Assignee

Comment 8

•

24 years ago

bstell: thanks I was able to see the jump in my Win32 as well. :)

QA Contact: ji → teruko

msanz

Updated

•

24 years ago

QA Contact: teruko → ji

Comment 9

•

24 years ago

Frank, is there Korean code table somewhere so I can check this bug with Korean characters?

kill this account

Reporter

Comment 10

•

24 years ago

from a separate email from ftang: > Notice it apply to Traditional Chinese, Simplified Chinese > AND Korean but not Japanese.

bobj

Comment 11

•

24 years ago

Don't know if this is related, but the bug sounds similar: bug 26920 "Generic UCV buffering scheme byte misalignments"

Roy Yokoyama

Assignee

Comment 12

•

24 years ago

bstell: I see ? mark for xDEFF and other code points. Is this also a bug? This looks recursing every 96 points ( 16 x 6 )

kill this account

Reporter

Comment 13

•

24 years ago

this encoding is a 94x94 encoding (7bit, less control chars 00-1F, space 20, delete 7F) the 8 bit is set so each set goes from xxA1 to xxFE (without 8th bit = xx21 to xx7F) thus the xxFF should not be mapped.

Roy Yokoyama

Assignee

Updated

•

24 years ago

Target Milestone: --- → mozilla0.9

Comment 14

•

24 years ago

Checked linux 02/19 mtrunk build with euc-kr. The characters between 0xB080-0xB2FF and 0xBC80-0xBFFF are lost.

Roy Yokoyama

Assignee

Comment 15

•

24 years ago

I think I have a fix for this (I will post a patch in few minutes). Basically, the problem happens when the conveter decodes the internal buffer into Unicode and some extra data in the internal buffer was left unconverted. In this particular case, the original code puts back the unconverted byte(s) into the source buffer and set the internal buffer count to be 0. What the original code didn't do is to process the source buffer after. It moved on and processed the next chunk (2048 bytes) resulting the missing some data.

Masaki Katakai

Updated

•

24 years ago

Blocks: 67520

Depends on: 67520

Masaki Katakai

Updated

•

24 years ago

Blocks: 60916

No longer depends on: 67520

Roy Yokoyama

Assignee

Comment 16

•

24 years ago

Attached patch The same implementation was repeated throughout the UCONV. They may or may not be used; but nonetheless should be fixed. — Details — Splinter Review

bobj

Comment 17

•

24 years ago

If this was "repeated throughout the UCONV", why didn't this break for Japanese? From 2001-02-16 18:12 Comments: from a separate email from ftang: > Notice it apply to Traditional Chinese, Simplified Chinese > AND Korean but not Japanese.

Roy Yokoyama

Assignee

Comment 18

•

24 years ago

Some of the converters were re-written by Frank awhile back, and Japanese was one of them. I believe he didn't remove the old-unused code. He can identify list of coverters which no longer uses the same method.

Roy Yokoyama

Assignee

Comment 19

•

24 years ago

Attached patch Adding printf() msg in case of unexpected errors. — Details — Splinter Review

Roy Yokoyama

Assignee

Comment 20

•

24 years ago

Attached patch Using of Assert instead of printf(). Fixing the msg typo at the same time. — Details — Splinter Review

Roy Yokoyama

Assignee

Comment 21

•

24 years ago

Received /r=ftang.

Brendan Eich [:brendan]

Comment 22

•

24 years ago

This is a nit I always pick: +#endif break; } else { else after break (or after return, goto, continue, throw) is a non-sequitur. Just remove the else and its braces, unindenting its code. The file generally uses two spaces per indentation level, but the code at the same statement level just above the break, and the break; itself, use four spaces of indentation from the previous level. Unintentional style glitch? There's an earlier if-then whose then block ends with a break, but there is no else with a block enclosing the rest of the for(;;) loop, so if you think this else after break (the one I excerpted above) is justified, why not do it earlier too? One (bad) reason is that that earlier break (the one at line 140 of nsUCvMinSupport.cpp) exists only to justify the for(;;) loop at line 136. The failure to indent nested control structures (if-if on line 134, else-for on line 136 of nsUCvMinSupport.cpp, e.g.) makes things harder to see. But once you reformat things, isn't the reduced control structure this? if (buffer not empty) { if (no room to write) { res = NS_OK_UDEC_MOREOUTPUT; } else { for (;;) { if (no more input) { res = NS_OK_UDEC_MOREINPUT; break; } res = decode some bytes; if (res == NS_OK_UDEC_MOREINPUT && nothing decoded) { res = NS_ERROR_UNEXPECTED; break; } if (we didn't convert all the input) { unfill the buffer; } else { unget the extra input we converted; res = NS_OK; } break; } // for } // if else } // if Notice that the for loop never iterates more than once! It exists only so that the first break can avoid the rest of the code in its body. That calls for an else (properly, no non-sequitur), not a loop with a break: if (buffer not empty) { if (no room to write) { res = NS_OK_UDEC_MOREOUTPUT; } else if (no more input) { res = NS_OK_UDEC_MOREINPUT; } else { res = decode some bytes; if (res == NS_OK_UDEC_MOREINPUT && nothing decoded) { res = NS_ERROR_UNEXPECTED; } else if (we didn't convert all the input) { unfill the buffer; } else { unget the extra input we converted; res = NS_OK; } } // if else } // if Please clarify the control flow and comment it better. I'd like to see a patch that avoids the unnecessary loop and breaks before stamping sr=brendan, because after studying the surrounding code for twenty minutes, I'm *still* not sure all cases are covered correctly. The DEBUG_yokoyama etc. #if's are your business -- just curious whether those were asserting for you guys, in which case there are buffer corruption bugs biting? /be

Roy Yokoyama

Assignee

Comment 23

•

24 years ago

After spending few days on these old codes and tried to understand what's going on, I felt exactly the same way. There are tons of codes in this module that need to be following the coding standard and improving the readability. I have filed a bug with regard to this (#70270). Whomever assigned this bug may need to rewrite the existing code entirely. I am getting the feeling it will eventually be me :) As for the #ifdef's, the Assertion is not happening for us and it should never happen. However, if it ever happens, then it will produce an unpredictable behaivuor and we would like to catch it. /yokoyama

Brendan Eich [:brendan]

Comment 24

•

24 years ago

Ok, if you testify that the patch improves the observed behavior of the code enough to check in, even if neither of us entirely understands the control flow, I can rubber-stamp sr= (rs=, heh; seriously, let's not do to much more of this!). sr=brendan@mozilla.org, thanks for reporting the follow-on bug. /be

kill this account

Reporter

Comment 25

•

24 years ago

As Brendan points out the code is nasty (and the developer is no longer with us ...). The effect of the "for (;;)" on line 136 and the breaks (eg line 140) is a "goto" (an obfuscated goto none the less!!). 136 } else for (;;) { 137 // we need new data to add to the buffer 138 if (src == srcEnd) { 139 res = NS_OK_UDEC_MOREINPUT; 140 break; 141 } We discussed at length how to handle this and the conclusion was to separate the fix and the "de-nastification". This way future developers can see the fix separately from the cleanup.

Roy Yokoyama

Assignee

Comment 26

•

24 years ago

checked-in.

Status: ASSIGNED → RESOLVED

Closed: 24 years ago

Resolution: --- → FIXED

Comment 27

•

24 years ago

Checked with 03/02 mtrunk build. The problems with EUC-TW and EUC-KR have been fixed.

Status: RESOLVED → VERIFIED

You need to log in before you can comment on or make changes to this bug.