Closed
Bug 25037
Opened 25 years ago
Closed 23 years ago
illegal 0xA0 code point in Multibyte charset break parser
Categories
(Core :: DOM: HTML Parser, defect, P3)
Core
DOM: HTML Parser
Tracking
()
VERIFIED
FIXED
mozilla0.9
People
(Reporter: teruko, Assigned: shanjian)
References
()
Details
Attachments
(7 files)
1.66 KB,
text/html
|
Details | |
178 bytes,
text/html
|
Details | |
1.37 KB,
patch
|
Details | Diff | Splinter Review | |
754 bytes,
patch
|
Details | Diff | Splinter Review | |
627 bytes,
patch
|
Details | Diff | Splinter Review | |
1.20 KB,
patch
|
Details | Diff | Splinter Review | |
1.34 KB,
patch
|
Details | Diff | Splinter Review |
When you see the above page, the last table on the right displays /TD. Steps of reproduce 1. Go to above URL 2. Look at the table on the right Under "DAVOS 2000", "/TD" is displayed. 3. Select menu View|Page Source Look at the Line <TD VALIGN=TOP><LI> /TD> Source does not have "<" before "/TD>" However, I look at the source of this page in Communicator 4.x. <TD VALIGN=TOP><LI></TD> Tested 2000012515 Win32 and Linux build.
This works perfectly for me. Petersen, can you see the problem. Also -- a reduced test case would be very helpful.
Assignee: rickg → petersen
I reproduced this with the 1/25/2000 M14 build running on Win95. The strange thing, is that when I saved it to the blues server with the intention to try and make a simpler test case, it worked! I looked at the saved unmodified file on blues, and it looked the same. Here is my saved file: http://blues/users/bobj/publish/test/zh-tw-index.html
Comment 3•25 years ago
|
||
With the Feb 07, (20000020608), I can't reproduce the problem described.
Assignee: petersen → rickg
Harish -- Petersen and I dont see this, but you might give it a try.
Assignee: rickg → harishd
Reporter | ||
Comment 5•25 years ago
|
||
Since the content of http://home.netscape.com/zh/cn/ has been changed, I copied simple page to http://jazz/users/teruko/tests/cntest1.html
In my 2000-01-26 14:39 comment, I had strange results. I could reproduce it, but when I copied the page to another server, it worked...
Reporter | ||
Comment 9•25 years ago
|
||
Correction of my previous comment. Since the content of http://home.netscape.com/zh/cn/ has been changed, I copied simple page to http://jazz/users/teruko/publish/tests/cntest1.html harishd, Did you try http://jazz/users/teruko/publish/tests/cntest1.html?
Reporter | ||
Comment 10•25 years ago
|
||
Comment 11•25 years ago
|
||
Comment 12•25 years ago
|
||
Could be a bug in ftang's code in nsScanner::Append(). Frank??
Comment 13•25 years ago
|
||
Can you reproduce the problem w/ the "Simpler test case" http://bugzilla.mozilla.org/showattachment.cgi?attach_id=5105 ?
Comment 14•25 years ago
|
||
harishd is right. This is a converter problem. What happen is somehow there are a 0xA0 after <LI> . but 0xA0 is not a legal code point in GB2312. somehow our error handling code eat two bytes instead of one byte for this and cause the < get eaten. Reassign this back to ftang. Good catch, harishd. jbetak- is this similar to the one you found in the old UTF-8 code ?
Assignee: harishd → ftang
Comment 15•25 years ago
|
||
I am not sure how many content out there have this kind of illegal code point issue. Mark it M18
Status: NEW → ASSIGNED
Target Milestone: M18
Comment 16•25 years ago
|
||
ftang: the problem with UTF-8 before Feb 4 was very similar - we were eating 2 bytes instead of one. My impression was that it was happening in the buffering / error handling code. I was not able to discover a similar regularity like in this 0xA0 problem though. Will have a look again later this week, maybe they have more in common. Referencing to <A HREF="http://bugzilla.mozilla.org/show_bug.cgi?id=8702">Bug #8702</A>
Comment 17•25 years ago
|
||
Change the summary from "</TD> parsing problem" to "illegal 0xA0 code point in Multibyte charset break parser" In 4.x, we silently support undef 0xa0 code point for CJK multibyte code page, however, in seamonkey, we don't. This cuase backward compatability issue. Some page accidentally have this character and the webmaster does not spot the problem since it is display correctly in 4.x. When SeaMonkey hit it, it could cause parser problem. This happen especailly if the 0xA0 is before a open tag, such as <TABLE> , currently SeaMonkey will take the 0xA0 and the next character to form a undefine character, therefore the '<' of the <TABLE> (or other tag) will be eat by the converter We have the following options- 1. Ignore this bug and let the web master fix their page since these character is not defined in the standard of these charset. 2. Add 0xa0 to the convert to unicode converter for all the multi byte charset so it will be convert to U+00A0 Mark it M16
Summary: </TD> parsing problem → illegal 0xA0 code point in Multibyte charset break parser
Target Milestone: M18 → M16
Comment 18•25 years ago
|
||
*** Bug 27704 has been marked as a duplicate of this bug. ***
Comment 19•25 years ago
|
||
Mozilla has worked very hard to be compatible with Nav4. This 0xA0 issue should not be an exception. We should be compatible with Nav4, at least in NavQuirks mode, and probably even in Strict mode. Even in Strict mode, if there is an A0 byte, we should not eat the next byte, I think. Is this bug serious? Is M16 the right milestone for this?
Updated•24 years ago
|
Target Milestone: M16 → M18
Comment 20•24 years ago
|
||
converter related issue reassign to cata.
Assignee: ftang → cata
Status: ASSIGNED → NEW
Comment 21•24 years ago
|
||
*** Bug 27454 has been marked as a duplicate of this bug. ***
Updated•24 years ago
|
Target Milestone: M18 → M21
Comment 23•24 years ago
|
||
We need to add 0xa0 for Big5, gb2312, EUC-KR, GBK, EUC-JP. Mark this as moz0.9 P3
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla0.9
Comment 24•24 years ago
|
||
shanjian- can you help to take a look at this?
Assignee: ftang → shanjian
Status: ASSIGNED → NEW
Assignee | ||
Comment 25•24 years ago
|
||
Assignee | ||
Comment 26•24 years ago
|
||
Assignee | ||
Comment 27•24 years ago
|
||
Assignee | ||
Comment 28•24 years ago
|
||
Assignee | ||
Updated•24 years ago
|
Status: NEW → ASSIGNED
Assignee | ||
Comment 29•24 years ago
|
||
In GBK, 0xA0 is a legal lead byte. Patch 3 (wrongly marked as 1) should be dropped.
Assignee | ||
Comment 30•24 years ago
|
||
Assignee | ||
Comment 31•24 years ago
|
||
When I took a look at bug 64235, I revised fix part2. So problem in 64235 will be taken care of here. For people to review the fix, complete fix include: fix part 1, (first one, not the one wrongly marked) nsUnicodeDecodeHelper.cpp new part2 fix, fix GB2312 fix part 4, fix japanese (eucjp and sjis)
Comment 32•24 years ago
|
||
Bug 64235 is a superset of this bug, IMHO. It has to do with not just stand-alone 0xA0 but also with any stand-alone byte/octet with MSB=1 in various CJK encodings. I added a new patch to take care of it for GBK and GB2312 (to bug 64235)
Comment 34•24 years ago
|
||
shanjian- this is hard to review. Why don't you produce a new patch which include all the necessary part.
Comment 35•24 years ago
|
||
Is that true the patch in 64235 cover the fix of this ?
Assignee | ||
Comment 36•24 years ago
|
||
yes, 64235 covered this one.
Assignee | ||
Comment 37•23 years ago
|
||
fix has been checked.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Comment 38•23 years ago
|
||
Verified on build: 2001-06-20-04-Trunk platform: Win NT I do not see "/TD".
Status: RESOLVED → VERIFIED
You need to log in
before you can comment on or make changes to this bug.
Description
•