Closed Bug 1058021 Opened 10 years ago Closed 10 years ago

Make undefined codepoints from 0x80-0x9f in windows-* encodings decode to Unicode U+0080-U-009F instead of U+FFFD

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla35

People

(Reporter: GPHemsley, Assigned: smontagu)

References

()

Details

Attachments

(4 files)

According to the W3C's test results, windows-1250 and windows-1251 are decoding a handful of characters incorrectly:

http://www.w3.org/International/tests/repository/encoding/indexes/results-aliases#windows-1250

http://www.w3.org/International/tests/repository/run?manifest=encoding/indexes&test=windows-1250_test

http://www.w3.org/International/tests/repository/run?manifest=encoding/indexes&test=windows-1251_test

(It looks like they're actually not being decoded at all: They all result in U+FFFD.)

Not sure how this relates to bug 379840, if at all.
So from comparing http://encoding.spec.whatwg.org/index-windows-1250.txt with ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1250.TXT and parallels, it appears that where codepoints are listed as "UNDEFINED" in the mapping tables, the Encoding Standard's tables give the same numeric codepoint as passthru output? Anne, is that a correct summary? 

Personally, I think our behaviour makes more sense, since "undefined" means "undefined", but since we aready do the same thing with windows-1252, we might as well change to become consistent with ourselves and other browsers.
Assignee: nobody → smontagu
Flags: needinfo?(annevk)
These tests might be wrong and the standard might be wrong. Did anyone actually do proper analysis? If not, I will do so later.
Flags: needinfo?(annevk)
I can't answer comment 3 directly, because I have no clue what constitutes "proper analysis". The information I based my comment on is as follows:

The failures in http://www.w3.org/International/tests/repository/run?manifest=encoding/indexes&test=windows-1250_test are:

Cell	Expected	Got
81 	0x0081  	FFFD �
83	0x0083          FFFD �
88	0x0088	        FFFD �
90	0x0090   	FFFD �
98	0x0098  	FFFD �

The corresponding lines in http://encoding.spec.whatwg.org/index-windows-1250.txt are:
1       0x0081   (<control>)
3       0x0083   (<control>)
8       0x0088   (<control>)
16      0x0090   (<control>)
24      0x0098   (<control>)

and the corresponding lines in ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1250.TXT are
0x81	      	#UNDEFINED
0x83	      	#UNDEFINED
0x88	      	#UNDEFINED
0x90	      	#UNDEFINED
0x98	      	#UNDEFINED

Bug 116143 changed windows-1252 to make UNDEFINED codepoints pass-thru to Unicode codepoints.
Sorry, I meant checking with other browsers. But it does seem both IE10 and Chrome pass that test so that seems a bug in our code, indeed. I hope to review those tests within the next couple of weeks.
Just let me refine what I said in comment 2: it only applies to codepoints from 0x80 to 0x9F. Other undefined codepoints are left undefined in the Encoding Standard and decoded as U+FFFD by all browsers AFAICT.

See also bug 577945.
Summary: Fix support for windows-* encodings → Make undefined codepoints from 0x80-0x9f in windows-* encodings decode to Unicode U+0080-U-009F instead of U+FFFD
The tests and results have been updated to check what happens if there is no line for a pointer in the index file. According to the single-byte decoding algorithm, this should produce U+FFFD. See the updated results at
http://www.w3.org/International/tests/repository/encoding/indexes/results-aliases

I have tried to indicate, where the pass is only partial, how many errors were due to U+FFFD not being served, vs. how many were due to unexpected characters being served that are not those in the tables. I did that in the summary. For details, open the test in the relevant browser (by clicking on the link to the left of the row). See for example
http://www.w3.org/International/tests/repository/encoding/indexes/results-aliases#iso-8859-6

The main differences are for windows-1253 and windows-874 and Chrome/Safari/Opera, but also 6 more IE boxes turned orange.
The decoding and encoding tests used to just skip over undefined codepoints. I've added them here on the principle described above: 0x80 - 0x9f pass thru to Unicode (and therefore round-trip); other undefined codepoints decode as U+FFFD
Attachment #8482403 - Flags: review?(annevk)
Attachment #8482404 - Flags: review?(jfkthame)
Attachment #8482408 - Flags: review?(jfkthame)
Comment on attachment 8482407 [details] [diff] [review]
Change the mapfile generator to pass thru undefined codepoints from 0x80-0x9F and update the comments

Review of attachment 8482407 [details] [diff] [review]:
-----------------------------------------------------------------

::: intl/uconv/tools/umaptable.c
@@ +173,5 @@
>  	printf("  You can find this tool under mozilla/intl/uconv/tools/umaptable.c.\n");
>  
> +	printf("  If you have any problems with this file, please file a bug\n");
> +	printf("  under the \"Internationalization\" component in\n");
> +	printf("  https://bugzilla.mozilla.org/enter_bug.cgi?product=Core\n");

You can link directly to https://bugzilla.mozilla.org/enter_bug.cgi?product=Core&component=Internationalization

Any particular reason you didn't?
Comment on attachment 8482403 [details] [diff] [review]
Bring our tests into line with the w3.org tests

Review of attachment 8482403 [details] [diff] [review]:
-----------------------------------------------------------------

That looks fine. Thanks for getting to this so quickly.
Attachment #8482403 - Flags: review?(annevk) → review+
Attachment #8482404 - Flags: review?(jfkthame) → review+
Comment on attachment 8482407 [details] [diff] [review]
Change the mapfile generator to pass thru undefined codepoints from 0x80-0x9F and update the comments

Review of attachment 8482407 [details] [diff] [review]:
-----------------------------------------------------------------

::: intl/uconv/tools/umaptable.c
@@ +437,2 @@
>            sscanf(buf,"%hx %hx",&c,&u);
> +          if (u == -1 && 0x80 <= c && c <=0x9f)

I guess this is fine for now, though the temptation to re-write a bunch of this code is almost overwhelming! **must resist**

(How was this loop ever OK previously, if there were lines with a missing unicode value in the input? Or was the tool run on mapping files that didn't have such lines? I'm assuming we run this on the mapping files exactly as downloaded from the Unicode site, but perhaps we used another source, or preprocessed them somehow? I'm not seeing a lot of documentation....)
Attachment #8482407 - Flags: review?(jfkthame) → review+
Comment on attachment 8482408 [details] [diff] [review]
Generated files from the patched umaptable.c

Review of attachment 8482408 [details] [diff] [review]:
-----------------------------------------------------------------

rs=me; I haven't tried to check all the changes here, but (a) they're generated files, so provided the source data and tool are OK, we should be fine; (b) a few spot-checks look reasonable; and (c) I presume you've made sure the tests still pass. :)
Attachment #8482408 - Flags: review?(jfkthame) → review+
(In reply to Jonathan Kew (:jfkthame) from comment #14)
> I guess this is fine for now, though the temptation to re-write a bunch of
> this code is almost overwhelming! **must resist**

Tell me about it!!

> (How was this loop ever OK previously, if there were lines with a missing
> unicode value in the input? Or was the tool run on mapping files that didn't
> have such lines? I'm assuming we run this on the mapping files exactly as
> downloaded from the Unicode site, but perhaps we used another source, or
> preprocessed them somehow? I'm not seeing a lot of documentation....)

I can only assume that either earlier versions of the files on the Unicode site didn't have the "UNDEFINED" lines, or that ftang deleted such lines manually before running the tool.
(In reply to Gordon P. Hemsley [:GPHemsley] from comment #12)

> You can link directly to
> https://bugzilla.mozilla.org/enter_bug.
> cgi?product=Core&component=Internationalization

Oops, I forgot to make this change before checking in. I'll do that later.
(In reply to Simon Montagu :smontagu from comment #18)
> (In reply to Gordon P. Hemsley [:GPHemsley] from comment #12)
> > You can link directly to
> > https://bugzilla.mozilla.org/enter_bug.
> > cgi?product=Core&component=Internationalization
> 
> Oops, I forgot to make this change before checking in. I'll do that later.

When/where?
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: