Closed
Bug 1058021
Opened 10 years ago
Closed 10 years ago
Make undefined codepoints from 0x80-0x9f in windows-* encodings decode to Unicode U+0080-U-009F instead of U+FFFD
Categories
(Core :: Internationalization, defect)
Core
Internationalization
Tracking
()
RESOLVED
FIXED
mozilla35
People
(Reporter: GPHemsley, Assigned: smontagu)
References
()
Details
Attachments
(4 files)
63.00 KB,
patch
|
annevk
:
review+
|
Details | Diff | Splinter Review |
2.16 KB,
patch
|
jfkthame
:
review+
|
Details | Diff | Splinter Review |
2.21 KB,
patch
|
jfkthame
:
review+
|
Details | Diff | Splinter Review |
93.46 KB,
patch
|
jfkthame
:
review+
|
Details | Diff | Splinter Review |
According to the W3C's test results, windows-1250 and windows-1251 are decoding a handful of characters incorrectly:
http://www.w3.org/International/tests/repository/encoding/indexes/results-aliases#windows-1250
http://www.w3.org/International/tests/repository/run?manifest=encoding/indexes&test=windows-1250_test
http://www.w3.org/International/tests/repository/run?manifest=encoding/indexes&test=windows-1251_test
(It looks like they're actually not being decoded at all: They all result in U+FFFD.)
Not sure how this relates to bug 379840, if at all.
Reporter | ||
Updated•10 years ago
|
Reporter | ||
Comment 1•10 years ago
|
||
It turns out there are also issues with windows-1253, windows-1254, windows-1255, windows-1257, windows-1258, and windows-874.
http://www.w3.org/International/tests/repository/encoding/indexes/results-aliases#windows-1253
http://www.w3.org/International/tests/repository/encoding/indexes/results-aliases#windows-1254
http://www.w3.org/International/tests/repository/encoding/indexes/results-aliases#windows-1255
http://www.w3.org/International/tests/repository/encoding/indexes/results-aliases#windows-1257
http://www.w3.org/International/tests/repository/encoding/indexes/results-aliases#windows-1258
http://www.w3.org/International/tests/repository/encoding/indexes/results-aliases#windows-874
Summary: Fix support for windows-1250 and windows-1251 encodings → Fix support for windows-* encodings
Assignee | ||
Comment 2•10 years ago
|
||
So from comparing http://encoding.spec.whatwg.org/index-windows-1250.txt with ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1250.TXT and parallels, it appears that where codepoints are listed as "UNDEFINED" in the mapping tables, the Encoding Standard's tables give the same numeric codepoint as passthru output? Anne, is that a correct summary?
Personally, I think our behaviour makes more sense, since "undefined" means "undefined", but since we aready do the same thing with windows-1252, we might as well change to become consistent with ourselves and other browsers.
Assignee: nobody → smontagu
Flags: needinfo?(annevk)
Comment 3•10 years ago
|
||
These tests might be wrong and the standard might be wrong. Did anyone actually do proper analysis? If not, I will do so later.
Flags: needinfo?(annevk)
Assignee | ||
Comment 4•10 years ago
|
||
I can't answer comment 3 directly, because I have no clue what constitutes "proper analysis". The information I based my comment on is as follows:
The failures in http://www.w3.org/International/tests/repository/run?manifest=encoding/indexes&test=windows-1250_test are:
Cell Expected Got
81 0x0081 FFFD �
83 0x0083 FFFD �
88 0x0088 FFFD �
90 0x0090 FFFD �
98 0x0098 FFFD �
The corresponding lines in http://encoding.spec.whatwg.org/index-windows-1250.txt are:
1 0x0081 (<control>)
3 0x0083 (<control>)
8 0x0088 (<control>)
16 0x0090 (<control>)
24 0x0098 (<control>)
and the corresponding lines in ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1250.TXT are
0x81 #UNDEFINED
0x83 #UNDEFINED
0x88 #UNDEFINED
0x90 #UNDEFINED
0x98 #UNDEFINED
Bug 116143 changed windows-1252 to make UNDEFINED codepoints pass-thru to Unicode codepoints.
Comment 5•10 years ago
|
||
Sorry, I meant checking with other browsers. But it does seem both IE10 and Chrome pass that test so that seems a bug in our code, indeed. I hope to review those tests within the next couple of weeks.
Assignee | ||
Updated•10 years ago
|
Assignee | ||
Comment 6•10 years ago
|
||
Just let me refine what I said in comment 2: it only applies to codepoints from 0x80 to 0x9F. Other undefined codepoints are left undefined in the Encoding Standard and decoded as U+FFFD by all browsers AFAICT.
See also bug 577945.
Summary: Fix support for windows-* encodings → Make undefined codepoints from 0x80-0x9f in windows-* encodings decode to Unicode U+0080-U-009F instead of U+FFFD
Comment 7•10 years ago
|
||
The tests and results have been updated to check what happens if there is no line for a pointer in the index file. According to the single-byte decoding algorithm, this should produce U+FFFD. See the updated results at
http://www.w3.org/International/tests/repository/encoding/indexes/results-aliases
I have tried to indicate, where the pass is only partial, how many errors were due to U+FFFD not being served, vs. how many were due to unexpected characters being served that are not those in the tables. I did that in the summary. For details, open the test in the relevant browser (by clicking on the link to the left of the row). See for example
http://www.w3.org/International/tests/repository/encoding/indexes/results-aliases#iso-8859-6
The main differences are for windows-1253 and windows-874 and Chrome/Safari/Opera, but also 6 more IE boxes turned orange.
Assignee | ||
Comment 8•10 years ago
|
||
The decoding and encoding tests used to just skip over undefined codepoints. I've added them here on the principle described above: 0x80 - 0x9f pass thru to Unicode (and therefore round-trip); other undefined codepoints decode as U+FFFD
Attachment #8482403 -
Flags: review?(annevk)
Assignee | ||
Comment 9•10 years ago
|
||
Attachment #8482404 -
Flags: review?(jfkthame)
Assignee | ||
Comment 10•10 years ago
|
||
Attachment #8482407 -
Flags: review?(jfkthame)
Assignee | ||
Comment 11•10 years ago
|
||
Attachment #8482408 -
Flags: review?(jfkthame)
Reporter | ||
Comment 12•10 years ago
|
||
Comment on attachment 8482407 [details] [diff] [review]
Change the mapfile generator to pass thru undefined codepoints from 0x80-0x9F and update the comments
Review of attachment 8482407 [details] [diff] [review]:
-----------------------------------------------------------------
::: intl/uconv/tools/umaptable.c
@@ +173,5 @@
> printf(" You can find this tool under mozilla/intl/uconv/tools/umaptable.c.\n");
>
> + printf(" If you have any problems with this file, please file a bug\n");
> + printf(" under the \"Internationalization\" component in\n");
> + printf(" https://bugzilla.mozilla.org/enter_bug.cgi?product=Core\n");
You can link directly to https://bugzilla.mozilla.org/enter_bug.cgi?product=Core&component=Internationalization
Any particular reason you didn't?
Comment 13•10 years ago
|
||
Comment on attachment 8482403 [details] [diff] [review]
Bring our tests into line with the w3.org tests
Review of attachment 8482403 [details] [diff] [review]:
-----------------------------------------------------------------
That looks fine. Thanks for getting to this so quickly.
Attachment #8482403 -
Flags: review?(annevk) → review+
Updated•10 years ago
|
Attachment #8482404 -
Flags: review?(jfkthame) → review+
Comment 14•10 years ago
|
||
Comment on attachment 8482407 [details] [diff] [review]
Change the mapfile generator to pass thru undefined codepoints from 0x80-0x9F and update the comments
Review of attachment 8482407 [details] [diff] [review]:
-----------------------------------------------------------------
::: intl/uconv/tools/umaptable.c
@@ +437,2 @@
> sscanf(buf,"%hx %hx",&c,&u);
> + if (u == -1 && 0x80 <= c && c <=0x9f)
I guess this is fine for now, though the temptation to re-write a bunch of this code is almost overwhelming! **must resist**
(How was this loop ever OK previously, if there were lines with a missing unicode value in the input? Or was the tool run on mapping files that didn't have such lines? I'm assuming we run this on the mapping files exactly as downloaded from the Unicode site, but perhaps we used another source, or preprocessed them somehow? I'm not seeing a lot of documentation....)
Attachment #8482407 -
Flags: review?(jfkthame) → review+
Comment 15•10 years ago
|
||
Comment on attachment 8482408 [details] [diff] [review]
Generated files from the patched umaptable.c
Review of attachment 8482408 [details] [diff] [review]:
-----------------------------------------------------------------
rs=me; I haven't tried to check all the changes here, but (a) they're generated files, so provided the source data and tool are OK, we should be fine; (b) a few spot-checks look reasonable; and (c) I presume you've made sure the tests still pass. :)
Attachment #8482408 -
Flags: review?(jfkthame) → review+
Assignee | ||
Comment 16•10 years ago
|
||
(In reply to Jonathan Kew (:jfkthame) from comment #14)
> I guess this is fine for now, though the temptation to re-write a bunch of
> this code is almost overwhelming! **must resist**
Tell me about it!!
> (How was this loop ever OK previously, if there were lines with a missing
> unicode value in the input? Or was the tool run on mapping files that didn't
> have such lines? I'm assuming we run this on the mapping files exactly as
> downloaded from the Unicode site, but perhaps we used another source, or
> preprocessed them somehow? I'm not seeing a lot of documentation....)
I can only assume that either earlier versions of the files on the Unicode site didn't have the "UNDEFINED" lines, or that ftang deleted such lines manually before running the tool.
Assignee | ||
Comment 17•10 years ago
|
||
Assignee | ||
Comment 18•10 years ago
|
||
(In reply to Gordon P. Hemsley [:GPHemsley] from comment #12)
> You can link directly to
> https://bugzilla.mozilla.org/enter_bug.
> cgi?product=Core&component=Internationalization
Oops, I forgot to make this change before checking in. I'll do that later.
Comment 19•10 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/aba0740931a9
https://hg.mozilla.org/mozilla-central/rev/8569d3c584c2
https://hg.mozilla.org/mozilla-central/rev/f3375e1c8736
https://hg.mozilla.org/mozilla-central/rev/2cbf3ad53beb
Status: NEW → RESOLVED
Closed: 10 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla35
Reporter | ||
Comment 20•10 years ago
|
||
(In reply to Simon Montagu :smontagu from comment #18)
> (In reply to Gordon P. Hemsley [:GPHemsley] from comment #12)
> > You can link directly to
> > https://bugzilla.mozilla.org/enter_bug.
> > cgi?product=Core&component=Internationalization
>
> Oops, I forgot to make this change before checking in. I'll do that later.
When/where?
You need to log in
before you can comment on or make changes to this bug.
Description
•