Closed Bug 218915 Opened 21 years ago Closed 21 years ago

[FIX]UTF16-LE stylesheet with BOM but no @charset is not used (Latin1 document)

Categories

(Core :: CSS Parsing and Computation, defect, P2)

defect

Tracking

()

RESOLVED FIXED
mozilla1.6alpha

People

(Reporter: mcbridematt, Assigned: bzbarsky)

References

()

Details

Attachments

(2 files, 6 obsolete files)

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.4b) Gecko/20030516 Mozilla Firebird/0.6 Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.4b) Gecko/20030516 Mozilla Firebird/0.6 In the file I'm about to attach, IE applies the CSS styles, but Mozilla doesn't at all. This is a cut-down testcase based on the internal intranet at school. I am unable to evagelize this case; their a pure M$ shop. Reproducible: Always Steps to Reproduce: 1. Go to page Actual Results: CSS Styles not applied Expected Results: applied CSS styles.
Changing url to keep it atleast valid.
The reason that the css is not applied is because you have all those spaces: T D { C O L O R : b l a c k ; F O N T - F A M I L Y : V e r d a n a ; F O N T - S I Z E : 8 p t ; F O N T - W E I G H T : n o r m a l ; T E X T - D E C O R A T I O N : n o r m a l ; B A C K G R O U N D - C O L O R : } Remove the spaced and it will work just fine. --> INVALID
Status: UNCONFIRMED → RESOLVED
Closed: 21 years ago
Resolution: --- → INVALID
Reopening. The html file is encoded in Latin-1. The css file is encoded in UTF16-LE. Convert one file to the encoding of the other and the css will be applied.
Status: RESOLVED → UNCONFIRMED
Component: Layout: Tables → Style System
Resolution: INVALID → ---
.
Assignee: table → dbaron
QA Contact: madhur → ian
According to this post: http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=utf-8&safe=off&selm=slrnapusdb.pbe.choess%40force.stwing.upenn.edu this is the intended behaviour. The <link> does not have a charset defined thus the stylesheet is parsed as Latin-1 and, with that encoding, invalid.
Status: UNCONFIRMED → RESOLVED
Closed: 21 years ago21 years ago
Resolution: --- → INVALID
Hmm... this is not quite as simple as it seems. The sheet _does_ have a BOM indicating that it's encoded in some 16-bit little-endian encoding. It does not follow from this, necessarily, that this encoding is UTF-16LE, of course, and at the moment we do nothing with this information. Perhaps we should assume UTF-16(LE|BE) in the presence of a suitable BOM and absence of an @charset rule in the sheet instead of just using the document encoding? Reopening, ccing some people who know intl stuff and can tell me how crack-potty an idea this is.
Status: RESOLVED → UNCONFIRMED
Resolution: INVALID → ---
I think what Boris is saying is reasonable. RFC 2616 says (paragraph 3.7.1) 'Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value.' I assume this means 'labeled in the HTTP headers' but having a BOM comes to the same thing IMO.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Seconded. Currently, Mozilla does detect BOM in HTML/XML assuming the first couple of characters in html/xml are either '<!' or '<?' [1]. When it comes to CSS(or other text/*), it's a bit trickier. Perhaps, we have to assume nobody uses UTF-32*.... With UTF-16LE being the default encoding for Unicode text on Windows, we will see more UTF-16LE than any other UTF-16/UTF-32 variants. . [1] bug 68738 and bug 183354. http://www.w3.org/TR/REC-xml#sec-guessing
I think the text within http://www.w3.org/TR/2003/WD-css3-syntax-20030813/#css-style represents the result of a good bit of discussion within the CSS working group. That said, the variation you describe seems reasonable, although an additional complication, and it might be worth discussing in the group...
Summary: CSS not applied to table → UTF16-LE stylesheet with BOM but no @charset is not used (Latin1 document)
I was about to ask for the opinion of the CSS WG. Are you gonna bring up the issue in the WG (or www-style)? BTW, IMHO, step #1 in the cited document has to mention (explicitly) that it's also possible to specify 'charset' in 'type' attribute of 'link' tag in (x)html as following: <link rel="stylesheet" type="text/css; charset=XXX" href="/StyleSheets/home.css" /> Another attempt that can be made (before falling back to UTF-8) is to assume 'charset' of the 'container' (x)html document (in the case reported here, that doesn't work, though) is used for stylesheets. re comment #9, pls, disregard my remark on UTF-16* BOM vs UTF-32* BOM. They can be reliably identified(I don't know what I was thinking) unless U+0000 comes right after a BOM in UTF-16*. Tough cases are when a sheet begins with neither '@charset=' nor a BOM. We can do some more even in those cases, but I guess it's not worth going that far.
> <link rel="stylesheet" type="text/css; charset=XXX" href="/StyleSheets/home.css" /> Youe mean: <link rel="stylesheet" type="text/css" charset="XXX" href="/StyleSheets/home.css" /> In any case, the real issue I have is that given a BOM I can't, eg, tell apart UCS-2LE and UTF-16LE and some other little-endian encoding which represents a non-breaking space as two bytes. So I have to assume UTF-16 and run with it, I guess... That's what I really wanted advice on -- is that a reasonable thing to do?
I learned a thing :-). Thanks. I always thought 'type' attribute of link tag and script tag is like 'Content-Type' header in mail, news and http header so that it can have a 'charset' parameter. As for distinguishing between UCS-2LE and UTF-16LE, there's no general way and no need because they're virtually identical. In comment #9, I was worried that we couldn't tell UTF-16LE from UTF-32LE (or UTF-16BE from UTF-32BE), but as I took it back in comment #11, there's little worry there either as long as there's a BOM or '@charset=' at the beginning. See http://lxr.mozilla.org/seamonkey/source/htmlparser/src/nsParser.cpp#2003 Perhaps, we have to refactor the code, move it out of nsParser.cpp and generalize it so that it can be used for html/xml, text/* and the universal charset detector.
>In any case, the real issue I have is that given a BOM I can't, eg, tell apart >UCS-2LE and UTF-16LE and some other little-endian encoding which represents a >non-breaking space as two bytes. Since UCS2 is a subset of UTF-16, why do you care? If an author is using "some other" little-endian encoding (do you have a specific one in mind or are you just trying to cover all the bases?) then I think the onus is on him or her to specify it in a non-ambiguous way. Identifying a stream which begins 0xFF 0xFE 0x## 0x## (where not both ##s are zero) as UTF-16LE is standard practice, see the link in comment 9, RFC 3023 and Section 15.9 of http://www.unicode.org/versions/Unicode4.0.0/ch15.pdf
> step #1 in the cited document has to > mention (explicitly) that it's also possible to specify 'charset' in 'type' > attribute of 'link' tag in (x)html as following Has to mention? That was step #3 before, but now it's been removed. It *could* be inserted between rules 2 and 3, but I tend to prefer leaving it out and not using it at all. The document really shouldn't have to know or care about the stylesheet's character encoding.
OK. Since we are more or less agreed that we should handle this case, taking bug. I'll work on this when I get back.
Assignee: dbaron → bz-vacation
OS: Windows XP → All
Priority: -- → P2
Hardware: PC → All
Target Milestone: --- → mozilla1.6alpha
Attached patch Patch (obsolete) — Splinter Review
Thoughts? Should I use UTF-32 instead of UCS-4 for the 4-byte encodings?
Attachment #132066 - Flags: superreview?(dbaron)
Attachment #132066 - Flags: review?(jshin)
Summary: UTF16-LE stylesheet with BOM but no @charset is not used (Latin1 document) → [FIX]UTF16-LE stylesheet with BOM but no @charset is not used (Latin1 document)
Attachment #132066 - Attachment is obsolete: false
Attachment #132066 - Flags: superreview?(dbaron)
Attachment #132066 - Flags: review?(jshin)
Attachment #132081 - Flags: superreview?(dbaron)
Attachment #132081 - Flags: review?(smontagu)
Comment on attachment 132081 [details] [diff] [review] Fix some issues smontagu pointed out (incorrect comments, redundant code, etc) >+ // Determine the encoding type. If we have a BOM, set aCharset to the >+ // charset listed for that BOM in http://www.w3.org/TR/REC-xml#sec-guessing; >+ // then way even if we don't have a valid @charset rule we can use the BOM to s/then/that/ ? > else if (*aStyleSheetData == 0xFE && *(aStyleSheetData+1) == 0xFF) { > // big-endian 2-byte encoding BOM > step = 2; > pos = 3; s/3/2/ I would use "UTF-32" rather than "UCS-4", all other things being equal. Does either or both work in practice? I don't know where the result of this stuff is getting passed to, which doesn't help. r=smontagu with the typos corrected as noted above. Some indication that this has been tested with at least the most likely cases would be nice too.
Attachment #132081 - Flags: review?(smontagu) → review+
Comment on attachment 132081 [details] [diff] [review] Fix some issues smontagu pointed out (incorrect comments, redundant code, etc) Are you sure that case you removed wasn't supposed to have been the 4-byte big-endian case (pos=0, step=4)? Also, if you don't mind taking the CVS blame, how about using |aStyleSheetData[0]| instead of |*aStyleSheetData|, |aStyleSheetData[1]| instead of |*(aStyleSheetData+1)|, etc.?
> s/then/that/ ? Yes. > s/3/2/ Why? We have two chars of BOM, then a null byte (if this is big-endian), so the first real data char is the one in position 3 (zero-based). > I would use "UTF-32" rather than "UCS-4" Done. The result of this code is passed to GetUnicodeDecoder(). This patch passes the tests at http://web.mit.edu/bzbarsky/www/testcases/charset/ (including the two I just added just to test whether it works).
> Are you sure that case you removed wasn't supposed to have been the 4-byte > big-endian case (pos=0, step=4)? Yep. That case is still in the code with the patch applied (line 449 in the patched file). > Also, if you don't mind taking the CVS blame Heh. I already have blame for the whole function. I'll make that change.
Attached patch Patch updated to those comments (obsolete) — Splinter Review
Attachment #132066 - Attachment is obsolete: true
Attachment #132081 - Attachment is obsolete: true
Attachment #132117 - Flags: superreview?(dbaron)
Attachment #132081 - Flags: superreview?(dbaron)
Attachment #132117 - Attachment is obsolete: true
Attachment #132117 - Flags: superreview?(dbaron)
Attached patch Patch that actually builds (obsolete) — Splinter Review
Attachment #132118 - Flags: superreview?(dbaron)
>Why? We have two chars of BOM, then a null byte (if this is big-endian), so the >first real data char is the one in position 3 (zero-based). Sorry, my bad.
Comment on attachment 132118 [details] [diff] [review] Patch that actually builds >+ else if (aStyleSheetData[0] == 0xFF && aStyleSheetData[1] == 0xFE) { > // little-endian 2-byte encoding BOM >- NS_WARNING("Our unicode decoders aren't likely to deal with this one"); > step = 2; > pos = 2; >+ aCharset = "UTF-16LE"; > } ...... >+ else if (aStyleSheetData[0] == 0xFF && >+ aStyleSheetData[1] == 0xFE && >+ aStyleSheetData[2] == 0x00 && >+ aStyleSheetData[3] == 0x00) { > // little-endian 4-byte encoding BOM ... >+ aCharset = "UTF-32LE"; if-clause for UTF-32LE should come before that for UTF-16LE. otherwise, UTF-32LE could be mistaken for UTF-16LE, couldn't it? Alternatively, if-clause for UTF-16LE has to make sure that the 3rd and 4th bytes are not 0. >+ else if (aStyleSheetData[0] == 0x00 && >+ aStyleSheetData[1] == 0x00 && >+ aStyleSheetData[2] == 0xFF && >+ aStyleSheetData[3] == 0xFE) { > // 4-byte encoding BOM in 2143 order >- NS_WARNING("Our unicode decoders aren't likely to deal with this one"); > step = 4; > pos = 6; >+ aCharset = "UTF-32"; We still need NS_WARNING because we don't support '2143' order. >+ else if (aStyleSheetData[0] == 0xFE && >+ aStyleSheetData[1] == 0xFF && >+ aStyleSheetData[2] == 0x00 && >+ aStyleSheetData[3] == 0x00) { > // 4-byte encoding BOM in 3412 order >- NS_WARNING("Our unicode decoders aren't likely to deal with this one"); > step = 4; > pos = 5; >+ aCharset = "UTF-32"; ditto.
Comment on attachment 132118 [details] [diff] [review] Patch that actually builds I'm a little skeptical about the "UTF-32" bits, but sure.
Attachment #132118 - Flags: superreview?(dbaron) → superreview+
> if-clause for UTF-32LE should come before that for UTF-16LE. Good catch. Fixed. > We still need NS_WARNING Indeed. Left the warning there.
Attachment #132118 - Attachment is obsolete: true
Checked in.
Status: NEW → RESOLVED
Closed: 21 years ago21 years ago
Resolution: --- → FIXED
-> Reopening I just looked at the site agin. Dammit, its back to it's old tricks. w/ build 2003092404
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attached file New CSS file used in the site (obsolete) —
This CSS file is identical to the one in the first attachment. Was it working and then stopped working? If so, could you attach the HTML that loads the CSS? Note that the original bug this showed is still fixed (I just retested), so what you are seeing is a different issue that will need a different bug, most likely.
Attached file Page + all deps.
Sorry, I didn't really have time to save it and zip it up in a hurry, and I noticed that the stylesheet filename changed. Here is the page plus all deps. This has to be a change in the page, since I retested the old file myself to see if the patch regressed.
Attachment #131220 - Attachment is obsolete: true
Attachment #132977 - Attachment is obsolete: true
Comment on attachment 132984 [details] Page + all deps. I never asked Bugzilla to mark this file as a patch. Something must of gone funny in Bugzilla's MIME autodetection. Dropping patch flag.
Attachment #132984 - Attachment is patch: false
Attachment #132984 - Attachment mime type: text/plain → application/zip
When I load the HTML page in that zipfile, the stylesheet is applied correctly...
WFM, too, with my debug build on Linux (the source pulled yesterday)
Using Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.6a) Gecko/20030924. I'll have to wait until my Linux machine finishes compiling Mozilla until I do any platform parity testing. There doesn't seem to be a lot of difference between from when the patch was checked in and now: http://bonsai.mozilla.org/cvsview2.cgi?command=DIFF&subdir=mozilla%2Fcontent%2Fhtml%2Fstyle%2Fsrc&file=nsCSSLoader.cpp&rev1=3.176&rev2=3.175&whitespace_mode=show&diff_mode=context Perhaps it broke in 3.176 (Oct 2) and rev 3.117 (October 6) fixed it?
Wait. "w/ build 2003092404"? That's about 19 hours before I checked in this patch; see date/time of comment 29.
Status: REOPENED → RESOLVED
Closed: 21 years ago21 years ago
Resolution: --- → FIXED
Yep, you are right. Even though I downloaded it on the 6th, it was a build from the 24th of last month. Must of been PlanetMirror reverting to backup.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: