Closed Bug 25037 Opened 25 years ago Closed 23 years ago

illegal 0xA0 code point in Multibyte charset break parser

Categories

(Core :: DOM: HTML Parser, defect, P3)

defect

Tracking

()

VERIFIED FIXED
mozilla0.9

People

(Reporter: teruko, Assigned: shanjian)

References

()

Details

Attachments

(7 files)

When you see the above page, the last table on the right displays /TD.

Steps of reproduce
1. Go to above URL
2. Look at the table on the right 
   Under "DAVOS 2000", "/TD" is displayed.

3. Select menu View|Page Source

Look at the Line 

<TD VALIGN=TOP><LI> /TD>
Source does not have "<" before "/TD>"

However, I look at the source of this page in Communicator 4.x.
<TD VALIGN=TOP><LI></TD>

Tested 2000012515 Win32 and Linux build.
This works perfectly for me. Petersen, can you see the problem. Also -- a 
reduced test case would be very helpful.
I reproduced this with the 1/25/2000 M14 build running on Win95.

The strange thing, is that when I saved it to the blues server with the 
intention to try and make a simpler test case, it worked!  I looked at the
saved unmodified file on blues, and it looked the same.  Here is my saved file:
   http://blues/users/bobj/publish/test/zh-tw-index.html
With the Feb 07, (20000020608), I can't reproduce the problem described.

Assignee: petersen → rickg
Harish -- Petersen and I dont see this, but you might give it a try.
Assignee: rickg → harishd
Not able to reproduce!!!  Teruko, is this bug still valid?
In my 2000-01-26 14:39 comment, I had strange results.  I could reproduce it,
but when I copied the page to another server, it worked...
Which implies that the problem could be server related..right?? 
Correction of my previous comment.
Since the content of http://home.netscape.com/zh/cn/ has been changed, I copied simple page to 
http://jazz/users/teruko/publish/tests/cntest1.html

harishd,
Did you try http://jazz/users/teruko/publish/tests/cntest1.html?
Attached file test case
Attached file Simpler test case
Could be a bug in ftang's code in nsScanner::Append().  Frank??
Can you reproduce the problem w/ the "Simpler test case" 
http://bugzilla.mozilla.org/showattachment.cgi?attach_id=5105 ?
harishd is right. This is a converter problem. What happen is somehow there are 
a 0xA0 after <LI> . but 0xA0 is not a legal code point in GB2312. somehow our 
error handling code eat two bytes instead of one byte for this and cause the < 
get eaten. Reassign this back to ftang. 

Good catch, harishd. jbetak- is this similar to the one you found in the old 
UTF-8 code ?
Assignee: harishd → ftang
I am not sure how many content out there have this kind of illegal code point 
issue. Mark it M18
Status: NEW → ASSIGNED
Target Milestone: M18
ftang: the problem with UTF-8 before Feb 4 was very similar - we were eating 2 
bytes instead of one. 

My impression was that it was happening in the buffering / error handling code. 
I was not able to discover a similar regularity like in this 0xA0 problem 
though. Will have a look again later this week, maybe they have more in common. 

Referencing to <A HREF="http://bugzilla.mozilla.org/show_bug.cgi?id=8702">Bug 
#8702</A>



Change the summary from  "</TD> parsing problem" to "illegal 0xA0 code point in
Multibyte charset break parser"

In 4.x, we silently support undef 0xa0 code point for CJK multibyte code page,
however, in seamonkey, we don't. This cuase backward compatability issue.
Some page accidentally have this character and the webmaster does not spot the
problem since it is display correctly in 4.x. When SeaMonkey hit it, it could
cause parser problem. This happen especailly if the 0xA0 is before a open tag,
such as <TABLE> , currently SeaMonkey will take the 0xA0 and the next character
to form a undefine character, therefore the '<' of the <TABLE> (or other tag)
will be eat by the converter

We have the following options-
1. Ignore this bug and let the web master fix their page since these character
is not defined in the standard of these charset.
2. Add 0xa0 to the convert to unicode converter for all the multi byte charset
so it will be convert to U+00A0
Mark it M16

Summary: </TD> parsing problem → illegal 0xA0 code point in Multibyte charset break parser
Target Milestone: M18 → M16
*** Bug 27704 has been marked as a duplicate of this bug. ***
Mozilla has worked very hard to be compatible with Nav4. This 0xA0 issue should
not be an exception. We should be compatible with Nav4, at least in NavQuirks
mode, and probably even in Strict mode. Even in Strict mode, if there is an A0
byte, we should not eat the next byte, I think. Is this bug serious? Is M16 the
right milestone for this?
Target Milestone: M16 → M18
converter related issue reassign to cata.
Assignee: ftang → cata
Status: ASSIGNED → NEW
Status: NEW → ASSIGNED
*** Bug 27454 has been marked as a duplicate of this bug. ***
Target Milestone: M18 → M21
move all cata's bug to ftang
Assignee: cata → ftang
Status: ASSIGNED → NEW
We need to add 0xa0 for Big5, gb2312, EUC-KR, GBK, EUC-JP.
Mark this as moz0.9 P3
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla0.9
shanjian- can you help to take a look at this?
Assignee: ftang → shanjian
Status: ASSIGNED → NEW
Status: NEW → ASSIGNED
In GBK, 0xA0 is a legal lead byte. Patch 3 (wrongly marked as 1) should be 
dropped.
Attached patch new part2 fixSplinter Review
When I took a look at bug 64235, I revised fix part2. So problem in 64235 
will be taken care of here.  
For people to review the fix, complete fix include:
fix part 1, (first one, not the one wrongly marked) nsUnicodeDecodeHelper.cpp
new part2 fix, fix GB2312
fix part 4, fix japanese (eucjp and sjis)
Bug 64235 is a superset of this bug, IMHO. It has to do with not
just stand-alone 0xA0 but also with any stand-alone byte/octet
with MSB=1 in various CJK encodings. I added a new patch to
take care of it for GBK and GB2312 (to bug 64235)
updated qa contact.
QA Contact: janc → bsharma
shanjian- this is hard to review. Why don't you produce a new patch which 
include all the necessary part. 
Is that true the patch in 64235 cover the fix of this ?
yes, 64235 covered this one.
fix has been checked. 
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Verified on
build: 2001-06-20-04-Trunk
platform: Win NT

I do not see "/TD".
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: