Closed Bug 67520 Opened 24 years ago Closed 24 years ago

EUC-TW converersion lost data, probably in block boundary

Categories

(Core :: Internationalization, defect, P1)

x86
Linux
defect

Tracking

()

VERIFIED FIXED
mozilla0.9

People

(Reporter: bstell, Assigned: tetsuroy)

References

()

Details

Attachments

(4 files)

block of characters seem to be lost by the converter
I think the problem is our buffering mechanism have some problem . This should be a P1 bug. schedule moz0.9 Change the summary to "EUC-TW converersion lost data, probably in block boundary"
Priority: -- → P1
Summary: convert loses characters → EUC-TW converersion lost data, probably in block boundary
I think this is a block bundary problem. In my window, I have to click super reload (Shift Reload) to recreate the problem. If I don't superreload but just reload, it looks fine (probably because the data is from the cache so the block is bigger). But if I do a super reload , it chunk after c7e0
roy- I mention this to you last time. Can you work on this ASAP. THanks.
Assignee: ftang → yokoyama
Status: NEW → ASSIGNED
QA Contact: teruko → ji
In order for QA to work on this, can you attach a test case or tell us how to reproduce it? Reassigning QA to ji
I get lost data when I shift-reload the http://village.infoweb.ne.jp/~katakai/mozilla/zh_TW_EUC_1.txt link. Here is a sample of what it should look like (do a cache reload): Note the lines are continous from c7c0 thru c8f0. c7c0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? c7d0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? c7e0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? c7f0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? c8a0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? c8b0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? c8c0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? c8d0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? c8e0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? c8f0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Here is the bad version. Note that we jump form line c7e0 to d5c0. c7c0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? c7d0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? c7e0 ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ? ? ? ? d5c0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? d5d0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? d5e0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
interesting, the data displayed correctly when I submitted the bug but when I redisplay the bug I see question marks (?) for all the asian characters. still one can see the jump in line numbers.
bstell: thanks I was able to see the jump in my Win32 as well. :)
QA Contact: ji → teruko
QA Contact: teruko → ji
Frank, is there Korean code table somewhere so I can check this bug with Korean characters?
from a separate email from ftang: > Notice it apply to Traditional Chinese, Simplified Chinese > AND Korean but not Japanese.
Don't know if this is related, but the bug sounds similar: bug 26920 "Generic UCV buffering scheme byte misalignments"
bstell: I see ? mark for xDEFF and other code points. Is this also a bug? This looks recursing every 96 points ( 16 x 6 )
this encoding is a 94x94 encoding (7bit, less control chars 00-1F, space 20, delete 7F) the 8 bit is set so each set goes from xxA1 to xxFE (without 8th bit = xx21 to xx7F) thus the xxFF should not be mapped.
Target Milestone: --- → mozilla0.9
Checked linux 02/19 mtrunk build with euc-kr. The characters between 0xB080-0xB2FF and 0xBC80-0xBFFF are lost.
I think I have a fix for this (I will post a patch in few minutes). Basically, the problem happens when the conveter decodes the internal buffer into Unicode and some extra data in the internal buffer was left unconverted. In this particular case, the original code puts back the unconverted byte(s) into the source buffer and set the internal buffer count to be 0. What the original code didn't do is to process the source buffer after. It moved on and processed the next chunk (2048 bytes) resulting the missing some data.
Blocks: 67520
Depends on: 67520
Blocks: 60916
No longer depends on: 67520
If this was "repeated throughout the UCONV", why didn't this break for Japanese? From 2001-02-16 18:12 Comments: from a separate email from ftang: > Notice it apply to Traditional Chinese, Simplified Chinese > AND Korean but not Japanese.
Some of the converters were re-written by Frank awhile back, and Japanese was one of them. I believe he didn't remove the old-unused code. He can identify list of coverters which no longer uses the same method.
Received /r=ftang.
This is a nit I always pick: +#endif break; } else { else after break (or after return, goto, continue, throw) is a non-sequitur. Just remove the else and its braces, unindenting its code. The file generally uses two spaces per indentation level, but the code at the same statement level just above the break, and the break; itself, use four spaces of indentation from the previous level. Unintentional style glitch? There's an earlier if-then whose then block ends with a break, but there is no else with a block enclosing the rest of the for(;;) loop, so if you think this else after break (the one I excerpted above) is justified, why not do it earlier too? One (bad) reason is that that earlier break (the one at line 140 of nsUCvMinSupport.cpp) exists only to justify the for(;;) loop at line 136. The failure to indent nested control structures (if-if on line 134, else-for on line 136 of nsUCvMinSupport.cpp, e.g.) makes things harder to see. But once you reformat things, isn't the reduced control structure this? if (buffer not empty) { if (no room to write) { res = NS_OK_UDEC_MOREOUTPUT; } else { for (;;) { if (no more input) { res = NS_OK_UDEC_MOREINPUT; break; } res = decode some bytes; if (res == NS_OK_UDEC_MOREINPUT && nothing decoded) { res = NS_ERROR_UNEXPECTED; break; } if (we didn't convert all the input) { unfill the buffer; } else { unget the extra input we converted; res = NS_OK; } break; } // for } // if else } // if Notice that the for loop never iterates more than once! It exists only so that the first break can avoid the rest of the code in its body. That calls for an else (properly, no non-sequitur), not a loop with a break: if (buffer not empty) { if (no room to write) { res = NS_OK_UDEC_MOREOUTPUT; } else if (no more input) { res = NS_OK_UDEC_MOREINPUT; } else { res = decode some bytes; if (res == NS_OK_UDEC_MOREINPUT && nothing decoded) { res = NS_ERROR_UNEXPECTED; } else if (we didn't convert all the input) { unfill the buffer; } else { unget the extra input we converted; res = NS_OK; } } // if else } // if Please clarify the control flow and comment it better. I'd like to see a patch that avoids the unnecessary loop and breaks before stamping sr=brendan, because after studying the surrounding code for twenty minutes, I'm *still* not sure all cases are covered correctly. The DEBUG_yokoyama etc. #if's are your business -- just curious whether those were asserting for you guys, in which case there are buffer corruption bugs biting? /be
After spending few days on these old codes and tried to understand what's going on, I felt exactly the same way. There are tons of codes in this module that need to be following the coding standard and improving the readability. I have filed a bug with regard to this (#70270). Whomever assigned this bug may need to rewrite the existing code entirely. I am getting the feeling it will eventually be me :) As for the #ifdef's, the Assertion is not happening for us and it should never happen. However, if it ever happens, then it will produce an unpredictable behaivuor and we would like to catch it. /yokoyama
Ok, if you testify that the patch improves the observed behavior of the code enough to check in, even if neither of us entirely understands the control flow, I can rubber-stamp sr= (rs=, heh; seriously, let's not do to much more of this!). sr=brendan@mozilla.org, thanks for reporting the follow-on bug. /be
As Brendan points out the code is nasty (and the developer is no longer with us ...). The effect of the "for (;;)" on line 136 and the breaks (eg line 140) is a "goto" (an obfuscated goto none the less!!). 136 } else for (;;) { 137 // we need new data to add to the buffer 138 if (src == srcEnd) { 139 res = NS_OK_UDEC_MOREINPUT; 140 break; 141 } We discussed at length how to handle this and the conclusion was to separate the fix and the "de-nastification". This way future developers can see the fix separately from the cleanup.
checked-in.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
Checked with 03/02 mtrunk build. The problems with EUC-TW and EUC-KR have been fixed.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: