Closed
Bug 67520
Opened 24 years ago
Closed 24 years ago
EUC-TW converersion lost data, probably in block boundary
Categories
(Core :: Internationalization, defect, P1)
Tracking
()
VERIFIED
FIXED
mozilla0.9
People
(Reporter: bstell, Assigned: tetsuroy)
References
()
Details
Attachments
(4 files)
34.30 KB,
image/gif
|
Details | |
4.80 KB,
patch
|
Details | Diff | Splinter Review | |
10.11 KB,
patch
|
Details | Diff | Splinter Review | |
10.32 KB,
patch
|
Details | Diff | Splinter Review |
block of characters seem to be lost by the converter
Reporter | ||
Comment 1•24 years ago
|
||
Comment 2•24 years ago
|
||
I think the problem is our buffering mechanism have some problem . This should
be a P1 bug. schedule moz0.9
Change the summary to "EUC-TW converersion lost data, probably in block boundary"
Priority: -- → P1
Summary: convert loses characters → EUC-TW converersion lost data, probably in block boundary
Comment 3•24 years ago
|
||
I think this is a block bundary problem. In my window, I have to click super
reload (Shift Reload) to recreate the problem. If I don't superreload but just
reload, it looks fine (probably because the data is from the cache so the block
is bigger). But if I do a super reload , it chunk after c7e0
Comment 4•24 years ago
|
||
roy- I mention this to you last time. Can you work on this ASAP. THanks.
Assignee: ftang → yokoyama
Assignee | ||
Updated•24 years ago
|
Status: NEW → ASSIGNED
In order for QA to work on this, can you attach a test case or tell us how to
reproduce it? Reassigning QA to ji
Reporter | ||
Comment 6•24 years ago
|
||
I get lost data when I shift-reload the
http://village.infoweb.ne.jp/~katakai/mozilla/zh_TW_EUC_1.txt
link.
Here is a sample of what it should look like (do a cache reload):
Note the lines are continous from c7c0 thru c8f0.
c7c0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
c7d0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
c7e0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
c7f0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
c8a0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
c8b0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
c8c0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
c8d0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
c8e0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
c8f0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
Here is the bad version. Note that we jump form
line c7e0 to d5c0.
c7c0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
c7d0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
c7e0 ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ? ? ? ?
d5c0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
d5d0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
d5e0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
Reporter | ||
Comment 7•24 years ago
|
||
interesting, the data displayed correctly when I submitted the bug but when I
redisplay the bug I see question marks (?) for all the asian characters.
still one can see the jump in line numbers.
Assignee | ||
Comment 8•24 years ago
|
||
bstell: thanks I was able to see the jump in my Win32 as well. :)
QA Contact: ji → teruko
Frank, is there Korean code table somewhere so I can check this bug with Korean
characters?
Reporter | ||
Comment 10•24 years ago
|
||
from a separate email from ftang:
> Notice it apply to Traditional Chinese, Simplified Chinese
> AND Korean but not Japanese.
Comment 11•24 years ago
|
||
Don't know if this is related, but the bug sounds similar:
bug 26920 "Generic UCV buffering scheme byte misalignments"
Assignee | ||
Comment 12•24 years ago
|
||
bstell: I see ? mark for xDEFF and other code points. Is this also a bug? This
looks recursing every 96 points ( 16 x 6 )
Reporter | ||
Comment 13•24 years ago
|
||
this encoding is a 94x94 encoding (7bit, less control chars 00-1F, space 20,
delete 7F)
the 8 bit is set so each set goes from xxA1 to xxFE (without 8th bit = xx21 to
xx7F)
thus the xxFF should not be mapped.
Assignee | ||
Updated•24 years ago
|
Target Milestone: --- → mozilla0.9
Comment 14•24 years ago
|
||
Checked linux 02/19 mtrunk build with euc-kr.
The characters between 0xB080-0xB2FF and 0xBC80-0xBFFF are lost.
Assignee | ||
Comment 15•24 years ago
|
||
I think I have a fix for this (I will post a patch in few minutes).
Basically, the problem happens when the conveter decodes the internal buffer
into Unicode and some extra data in the internal buffer
was left unconverted. In this particular case, the original code puts
back the unconverted byte(s) into the source buffer and set the internal buffer
count to be 0. What the original code didn't do is
to process the source buffer after. It moved on and processed the next chunk
(2048 bytes) resulting the missing some data.
Updated•24 years ago
|
Updated•24 years ago
|
Assignee | ||
Comment 16•24 years ago
|
||
Comment 17•24 years ago
|
||
If this was "repeated throughout the UCONV", why didn't this break for
Japanese?
From 2001-02-16 18:12 Comments:
from a separate email from ftang:
> Notice it apply to Traditional Chinese, Simplified Chinese
> AND Korean but not Japanese.
Assignee | ||
Comment 18•24 years ago
|
||
Some of the converters were re-written by Frank awhile back, and Japanese was
one of them. I believe he didn't remove the old-unused
code. He can identify list of coverters which no longer uses the same method.
Assignee | ||
Comment 19•24 years ago
|
||
Assignee | ||
Comment 20•24 years ago
|
||
Assignee | ||
Comment 21•24 years ago
|
||
Received /r=ftang.
Comment 22•24 years ago
|
||
This is a nit I always pick:
+#endif
break;
} else {
else after break (or after return, goto, continue, throw) is a non-sequitur.
Just remove the else and its braces, unindenting its code.
The file generally uses two spaces per indentation level, but the code at the
same statement level just above the break, and the break; itself, use four
spaces of indentation from the previous level. Unintentional style glitch?
There's an earlier if-then whose then block ends with a break, but there is no
else with a block enclosing the rest of the for(;;) loop, so if you think this
else after break (the one I excerpted above) is justified, why not do it earlier
too? One (bad) reason is that that earlier break (the one at line 140 of
nsUCvMinSupport.cpp) exists only to justify the for(;;) loop at line 136.
The failure to indent nested control structures (if-if on line 134, else-for on
line 136 of nsUCvMinSupport.cpp, e.g.) makes things harder to see. But once you
reformat things, isn't the reduced control structure this?
if (buffer not empty) {
if (no room to write) {
res = NS_OK_UDEC_MOREOUTPUT;
} else {
for (;;) {
if (no more input) {
res = NS_OK_UDEC_MOREINPUT;
break;
}
res = decode some bytes;
if (res == NS_OK_UDEC_MOREINPUT && nothing decoded) {
res = NS_ERROR_UNEXPECTED;
break;
}
if (we didn't convert all the input) {
unfill the buffer;
} else {
unget the extra input we converted;
res = NS_OK;
}
break;
} // for
} // if else
} // if
Notice that the for loop never iterates more than once! It exists only so that
the first break can avoid the rest of the code in its body. That calls for an
else (properly, no non-sequitur), not a loop with a break:
if (buffer not empty) {
if (no room to write) {
res = NS_OK_UDEC_MOREOUTPUT;
} else if (no more input) {
res = NS_OK_UDEC_MOREINPUT;
} else {
res = decode some bytes;
if (res == NS_OK_UDEC_MOREINPUT && nothing decoded) {
res = NS_ERROR_UNEXPECTED;
} else if (we didn't convert all the input) {
unfill the buffer;
} else {
unget the extra input we converted;
res = NS_OK;
}
} // if else
} // if
Please clarify the control flow and comment it better. I'd like to see a patch
that avoids the unnecessary loop and breaks before stamping sr=brendan, because
after studying the surrounding code for twenty minutes, I'm *still* not sure all
cases are covered correctly.
The DEBUG_yokoyama etc. #if's are your business -- just curious whether those
were asserting for you guys, in which case there are buffer corruption bugs biting?
/be
Assignee | ||
Comment 23•24 years ago
|
||
After spending few days on these old codes and tried to understand
what's going on, I felt exactly the same way.
There are tons of codes in this module that need to be following the coding
standard and improving the readability.
I have filed a bug with regard to this (#70270).
Whomever assigned this bug may need to rewrite the existing code entirely. I am
getting the feeling it will eventually be me :)
As for the #ifdef's, the Assertion is not happening for us and
it should never happen. However, if it ever happens, then it will
produce an unpredictable behaivuor and we would like to catch it.
/yokoyama
Comment 24•24 years ago
|
||
Ok, if you testify that the patch improves the observed behavior of the code
enough to check in, even if neither of us entirely understands the control flow,
I can rubber-stamp sr= (rs=, heh; seriously, let's not do to much more of
this!). sr=brendan@mozilla.org, thanks for reporting the follow-on bug.
/be
Reporter | ||
Comment 25•24 years ago
|
||
As Brendan points out the code is nasty (and the developer is no longer
with us ...).
The effect of the "for (;;)" on line 136 and the breaks (eg line 140)
is a "goto" (an obfuscated goto none the less!!).
136 } else for (;;) {
137 // we need new data to add to the buffer
138 if (src == srcEnd) {
139 res = NS_OK_UDEC_MOREINPUT;
140 break;
141 }
We discussed at length how to handle this and the conclusion was
to separate the fix and the "de-nastification". This way future
developers can see the fix separately from the cleanup.
Assignee | ||
Comment 26•24 years ago
|
||
checked-in.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
Comment 27•24 years ago
|
||
Checked with 03/02 mtrunk build. The problems with EUC-TW and EUC-KR have been fixed.
Status: RESOLVED → VERIFIED
You need to log in
before you can comment on or make changes to this bug.
Description
•