Closed
Bug 218915
Opened 21 years ago
Closed 21 years ago
[FIX]UTF16-LE stylesheet with BOM but no @charset is not used (Latin1 document)
Categories
(Core :: CSS Parsing and Computation, defect, P2)
Core
CSS Parsing and Computation
Tracking
()
RESOLVED
FIXED
mozilla1.6alpha
People
(Reporter: mcbridematt, Assigned: bzbarsky)
References
()
Details
Attachments
(2 files, 6 obsolete files)
10.49 KB,
patch
|
Details | Diff | Splinter Review | |
15.40 KB,
application/zip
|
Details |
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.4b) Gecko/20030516 Mozilla Firebird/0.6
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.4b) Gecko/20030516 Mozilla Firebird/0.6
In the file I'm about to attach, IE applies the CSS styles, but Mozilla doesn't
at all.
This is a cut-down testcase based on the internal intranet at school. I am
unable to evagelize this case; their a pure M$ shop.
Reproducible: Always
Steps to Reproduce:
1. Go to page
Actual Results:
CSS Styles not applied
Expected Results:
applied CSS styles.
Reporter | ||
Comment 1•21 years ago
|
||
Reporter | ||
Comment 2•21 years ago
|
||
Changing url to keep it atleast valid.
Comment 3•21 years ago
|
||
The reason that the css is not applied is because you have all those spaces:
T D
{
C O L O R : b l a c k ;
F O N T - F A M I L Y : V e r d a n a ;
F O N T - S I Z E : 8 p t ;
F O N T - W E I G H T : n o r m a l ;
T E X T - D E C O R A T I O N : n o r m a l ;
B A C K G R O U N D - C O L O R :
}
Remove the spaced and it will work just fine.
--> INVALID
Status: UNCONFIRMED → RESOLVED
Closed: 21 years ago
Resolution: --- → INVALID
Comment 4•21 years ago
|
||
Reopening.
The html file is encoded in Latin-1.
The css file is encoded in UTF16-LE.
Convert one file to the encoding of the other and the css will be applied.
Status: RESOLVED → UNCONFIRMED
Component: Layout: Tables → Style System
Resolution: INVALID → ---
Comment 6•21 years ago
|
||
According to this post:
http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=utf-8&safe=off&selm=slrnapusdb.pbe.choess%40force.stwing.upenn.edu
this is the intended behaviour.
The <link> does not have a charset defined thus the stylesheet is parsed as
Latin-1 and, with that encoding, invalid.
Status: UNCONFIRMED → RESOLVED
Closed: 21 years ago → 21 years ago
Resolution: --- → INVALID
![]() |
Assignee | |
Comment 7•21 years ago
|
||
Hmm... this is not quite as simple as it seems. The sheet _does_ have a BOM
indicating that it's encoded in some 16-bit little-endian encoding. It does not
follow from this, necessarily, that this encoding is UTF-16LE, of course, and at
the moment we do nothing with this information.
Perhaps we should assume UTF-16(LE|BE) in the presence of a suitable BOM and
absence of an @charset rule in the sheet instead of just using the document
encoding?
Reopening, ccing some people who know intl stuff and can tell me how crack-potty
an idea this is.
Status: RESOLVED → UNCONFIRMED
Resolution: INVALID → ---
Comment 8•21 years ago
|
||
I think what Boris is saying is reasonable. RFC 2616 says (paragraph 3.7.1)
'Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled
with an appropriate charset value.' I assume this means 'labeled in the HTTP
headers' but having a BOM comes to the same thing IMO.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Comment 9•21 years ago
|
||
Seconded.
Currently, Mozilla does detect BOM in HTML/XML assuming the first couple of
characters in html/xml are either '<!' or '<?' [1]. When it comes to CSS(or other text/*),
it's a bit trickier. Perhaps, we have to assume nobody uses UTF-32*....
With UTF-16LE being the default encoding for Unicode text on Windows, we will see
more UTF-16LE than any other UTF-16/UTF-32 variants. .
[1] bug 68738 and bug 183354.
http://www.w3.org/TR/REC-xml#sec-guessing
I think the text within
http://www.w3.org/TR/2003/WD-css3-syntax-20030813/#css-style represents the
result of a good bit of discussion within the CSS working group. That said, the
variation you describe seems reasonable, although an additional complication,
and it might be worth discussing in the group...
Summary: CSS not applied to table → UTF16-LE stylesheet with BOM but no @charset is not used (Latin1 document)
Comment 11•21 years ago
|
||
I was about to ask for the opinion of the CSS WG. Are you gonna bring up the
issue in the WG (or www-style)? BTW, IMHO, step #1 in the cited document has to
mention (explicitly) that it's also possible to specify 'charset' in 'type'
attribute of 'link' tag in (x)html as following:
<link rel="stylesheet" type="text/css; charset=XXX" href="/StyleSheets/home.css" />
Another attempt that can be made (before falling back to UTF-8) is to assume
'charset' of the 'container' (x)html document (in the case reported here, that
doesn't work, though) is used for stylesheets.
re comment #9, pls, disregard my remark on UTF-16* BOM vs UTF-32* BOM. They can
be reliably identified(I don't know what I was thinking) unless U+0000 comes
right after a BOM in UTF-16*. Tough cases are when a sheet begins with neither
'@charset=' nor a BOM. We can do some more even in those cases, but I guess it's
not worth going that far.
![]() |
Assignee | |
Comment 12•21 years ago
|
||
> <link rel="stylesheet" type="text/css; charset=XXX"
href="/StyleSheets/home.css" />
Youe mean:
<link rel="stylesheet" type="text/css" charset="XXX" href="/StyleSheets/home.css" />
In any case, the real issue I have is that given a BOM I can't, eg, tell apart
UCS-2LE and UTF-16LE and some other little-endian encoding which represents a
non-breaking space as two bytes. So I have to assume UTF-16 and run with it, I
guess... That's what I really wanted advice on -- is that a reasonable thing to do?
Comment 13•21 years ago
|
||
I learned a thing :-). Thanks. I always thought 'type' attribute of link tag and
script tag is like 'Content-Type' header in mail, news and http header so that
it can have a 'charset' parameter.
As for distinguishing between UCS-2LE and UTF-16LE, there's no general way and
no need because they're virtually identical. In comment #9, I was worried that
we couldn't tell UTF-16LE from UTF-32LE (or UTF-16BE from UTF-32BE), but as I
took it back in comment #11, there's little worry there either as long as
there's a BOM or '@charset=' at the beginning. See
http://lxr.mozilla.org/seamonkey/source/htmlparser/src/nsParser.cpp#2003
Perhaps, we have to refactor the code, move it out of nsParser.cpp and
generalize it so that it can be used for html/xml, text/* and the universal
charset detector.
Comment 14•21 years ago
|
||
>In any case, the real issue I have is that given a BOM I can't, eg, tell apart
>UCS-2LE and UTF-16LE and some other little-endian encoding which represents a
>non-breaking space as two bytes.
Since UCS2 is a subset of UTF-16, why do you care? If an author is using "some
other" little-endian encoding (do you have a specific one in mind or are you
just trying to cover all the bases?) then I think the onus is on him or her to
specify it in a non-ambiguous way. Identifying a stream which begins 0xFF 0xFE
0x## 0x## (where not both ##s are zero) as UTF-16LE is standard practice, see
the link in comment 9, RFC 3023 and Section 15.9 of
http://www.unicode.org/versions/Unicode4.0.0/ch15.pdf
> step #1 in the cited document has to
> mention (explicitly) that it's also possible to specify 'charset' in 'type'
> attribute of 'link' tag in (x)html as following
Has to mention? That was step #3 before, but now it's been removed. It *could*
be inserted between rules 2 and 3, but I tend to prefer leaving it out and not
using it at all. The document really shouldn't have to know or care about the
stylesheet's character encoding.
![]() |
Assignee | |
Comment 16•21 years ago
|
||
OK. Since we are more or less agreed that we should handle this case, taking
bug. I'll work on this when I get back.
Assignee: dbaron → bz-vacation
OS: Windows XP → All
Priority: -- → P2
Hardware: PC → All
Target Milestone: --- → mozilla1.6alpha
![]() |
Assignee | |
Comment 17•21 years ago
|
||
Thoughts? Should I use UTF-32 instead of UCS-4 for the 4-byte encodings?
![]() |
Assignee | |
Updated•21 years ago
|
Attachment #132066 -
Flags: superreview?(dbaron)
Attachment #132066 -
Flags: review?(jshin)
![]() |
Assignee | |
Updated•21 years ago
|
Summary: UTF16-LE stylesheet with BOM but no @charset is not used (Latin1 document) → [FIX]UTF16-LE stylesheet with BOM but no @charset is not used (Latin1 document)
![]() |
Assignee | |
Comment 18•21 years ago
|
||
Attachment #132066 -
Attachment is obsolete: true
![]() |
Assignee | |
Updated•21 years ago
|
Attachment #132066 -
Attachment is obsolete: false
Attachment #132066 -
Flags: superreview?(dbaron)
Attachment #132066 -
Flags: review?(jshin)
![]() |
Assignee | |
Updated•21 years ago
|
Attachment #132081 -
Flags: superreview?(dbaron)
Attachment #132081 -
Flags: review?(smontagu)
Comment 19•21 years ago
|
||
Comment on attachment 132081 [details] [diff] [review]
Fix some issues smontagu pointed out (incorrect comments, redundant code, etc)
>+ // Determine the encoding type. If we have a BOM, set aCharset to the
>+ // charset listed for that BOM in http://www.w3.org/TR/REC-xml#sec-guessing;
>+ // then way even if we don't have a valid @charset rule we can use the BOM to
s/then/that/ ?
> else if (*aStyleSheetData == 0xFE && *(aStyleSheetData+1) == 0xFF) {
> // big-endian 2-byte encoding BOM
> step = 2;
> pos = 3;
s/3/2/
I would use "UTF-32" rather than "UCS-4", all other things being equal. Does
either or both work in practice? I don't know where the result of this stuff is
getting passed to, which doesn't help.
r=smontagu with the typos corrected as noted above. Some indication that this
has been tested with at least the most likely cases would be nice too.
Attachment #132081 -
Flags: review?(smontagu) → review+
Comment on attachment 132081 [details] [diff] [review]
Fix some issues smontagu pointed out (incorrect comments, redundant code, etc)
Are you sure that case you removed wasn't supposed to have been the 4-byte
big-endian case (pos=0, step=4)?
Also, if you don't mind taking the CVS blame, how about using
|aStyleSheetData[0]| instead of |*aStyleSheetData|, |aStyleSheetData[1]|
instead of |*(aStyleSheetData+1)|, etc.?
![]() |
Assignee | |
Comment 21•21 years ago
|
||
> s/then/that/ ?
Yes.
> s/3/2/
Why? We have two chars of BOM, then a null byte (if this is big-endian), so the
first real data char is the one in position 3 (zero-based).
> I would use "UTF-32" rather than "UCS-4"
Done. The result of this code is passed to GetUnicodeDecoder().
This patch passes the tests at
http://web.mit.edu/bzbarsky/www/testcases/charset/ (including the two I just
added just to test whether it works).
![]() |
Assignee | |
Comment 22•21 years ago
|
||
> Are you sure that case you removed wasn't supposed to have been the 4-byte
> big-endian case (pos=0, step=4)?
Yep. That case is still in the code with the patch applied (line 449 in the
patched file).
> Also, if you don't mind taking the CVS blame
Heh. I already have blame for the whole function. I'll make that change.
![]() |
Assignee | |
Comment 23•21 years ago
|
||
Attachment #132066 -
Attachment is obsolete: true
Attachment #132081 -
Attachment is obsolete: true
![]() |
Assignee | |
Updated•21 years ago
|
Attachment #132117 -
Flags: superreview?(dbaron)
![]() |
Assignee | |
Updated•21 years ago
|
Attachment #132081 -
Flags: superreview?(dbaron)
![]() |
Assignee | |
Updated•21 years ago
|
Attachment #132117 -
Attachment is obsolete: true
Attachment #132117 -
Flags: superreview?(dbaron)
![]() |
Assignee | |
Comment 24•21 years ago
|
||
![]() |
Assignee | |
Updated•21 years ago
|
Attachment #132118 -
Flags: superreview?(dbaron)
Comment 25•21 years ago
|
||
>Why? We have two chars of BOM, then a null byte (if this is big-endian), so the
>first real data char is the one in position 3 (zero-based).
Sorry, my bad.
Comment 26•21 years ago
|
||
Comment on attachment 132118 [details] [diff] [review]
Patch that actually builds
>+ else if (aStyleSheetData[0] == 0xFF && aStyleSheetData[1] == 0xFE) {
> // little-endian 2-byte encoding BOM
>- NS_WARNING("Our unicode decoders aren't likely to deal with this one");
> step = 2;
> pos = 2;
>+ aCharset = "UTF-16LE";
> }
......
>+ else if (aStyleSheetData[0] == 0xFF &&
>+ aStyleSheetData[1] == 0xFE &&
>+ aStyleSheetData[2] == 0x00 &&
>+ aStyleSheetData[3] == 0x00) {
> // little-endian 4-byte encoding BOM
...
>+ aCharset = "UTF-32LE";
if-clause for UTF-32LE should come before that for UTF-16LE. otherwise,
UTF-32LE could be mistaken for UTF-16LE, couldn't it? Alternatively, if-clause
for UTF-16LE has to make sure that the 3rd and 4th bytes are not 0.
>+ else if (aStyleSheetData[0] == 0x00 &&
>+ aStyleSheetData[1] == 0x00 &&
>+ aStyleSheetData[2] == 0xFF &&
>+ aStyleSheetData[3] == 0xFE) {
> // 4-byte encoding BOM in 2143 order
>- NS_WARNING("Our unicode decoders aren't likely to deal with this one");
> step = 4;
> pos = 6;
>+ aCharset = "UTF-32";
We still need NS_WARNING because we don't support '2143' order.
>+ else if (aStyleSheetData[0] == 0xFE &&
>+ aStyleSheetData[1] == 0xFF &&
>+ aStyleSheetData[2] == 0x00 &&
>+ aStyleSheetData[3] == 0x00) {
> // 4-byte encoding BOM in 3412 order
>- NS_WARNING("Our unicode decoders aren't likely to deal with this one");
> step = 4;
> pos = 5;
>+ aCharset = "UTF-32";
ditto.
Comment on attachment 132118 [details] [diff] [review]
Patch that actually builds
I'm a little skeptical about the "UTF-32" bits, but sure.
Attachment #132118 -
Flags: superreview?(dbaron) → superreview+
![]() |
Assignee | |
Comment 28•21 years ago
|
||
> if-clause for UTF-32LE should come before that for UTF-16LE.
Good catch. Fixed.
> We still need NS_WARNING
Indeed. Left the warning there.
![]() |
Assignee | |
Updated•21 years ago
|
Attachment #132118 -
Attachment is obsolete: true
![]() |
Assignee | |
Comment 29•21 years ago
|
||
Checked in.
Status: NEW → RESOLVED
Closed: 21 years ago → 21 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 30•21 years ago
|
||
-> Reopening
I just looked at the site agin. Dammit, its back to it's old tricks.
w/ build 2003092404
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Reporter | ||
Comment 31•21 years ago
|
||
![]() |
Assignee | |
Comment 32•21 years ago
|
||
This CSS file is identical to the one in the first attachment. Was it working
and then stopped working? If so, could you attach the HTML that loads the CSS?
Note that the original bug this showed is still fixed (I just retested), so what
you are seeing is a different issue that will need a different bug, most likely.
Reporter | ||
Comment 33•21 years ago
|
||
Sorry, I didn't really have time to save it and zip it up in a hurry, and I
noticed that the stylesheet filename changed.
Here is the page plus all deps. This has to be a change in the page, since I
retested the old file myself to see if the patch regressed.
Reporter | ||
Updated•21 years ago
|
Attachment #131220 -
Attachment is obsolete: true
Attachment #132977 -
Attachment is obsolete: true
Reporter | ||
Comment 34•21 years ago
|
||
Comment on attachment 132984 [details]
Page + all deps.
I never asked Bugzilla to mark this file as a patch. Something must of gone
funny in Bugzilla's MIME autodetection.
Dropping patch flag.
Attachment #132984 -
Attachment is patch: false
Updated•21 years ago
|
Attachment #132984 -
Attachment mime type: text/plain → application/zip
![]() |
Assignee | |
Comment 35•21 years ago
|
||
When I load the HTML page in that zipfile, the stylesheet is applied correctly...
Comment 36•21 years ago
|
||
WFM, too, with my debug build on Linux (the source pulled yesterday)
Reporter | ||
Comment 37•21 years ago
|
||
Using Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.6a) Gecko/20030924.
I'll have to wait until my Linux machine finishes compiling Mozilla until I do
any platform parity testing.
There doesn't seem to be a lot of difference between from when the patch was
checked in and now:
http://bonsai.mozilla.org/cvsview2.cgi?command=DIFF&subdir=mozilla%2Fcontent%2Fhtml%2Fstyle%2Fsrc&file=nsCSSLoader.cpp&rev1=3.176&rev2=3.175&whitespace_mode=show&diff_mode=context
Perhaps it broke in 3.176 (Oct 2) and rev 3.117 (October 6) fixed it?
![]() |
Assignee | |
Comment 38•21 years ago
|
||
Wait. "w/ build 2003092404"? That's about 19 hours before I checked in this
patch; see date/time of comment 29.
Status: REOPENED → RESOLVED
Closed: 21 years ago → 21 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 39•21 years ago
|
||
Yep, you are right.
Even though I downloaded it on the 6th, it was a build from the 24th of last
month. Must of been PlanetMirror reverting to backup.
You need to log in
before you can comment on or make changes to this bug.
Description
•