Add &Szlig; entity
Categories
(Core :: DOM: HTML Parser, enhancement)
Tracking
()
People
(Reporter: marius.spix, Unassigned)
Details
Steps to reproduce:
The Latin small letter ß has a named entity ß The capital version ẞ is currently missing a named entity.
For consistency I propose to add the named entity &Szlig;
According to DIN 5008:2020, when using capital letters, the capital ẞ is preferable to the resolution in SS or SZ. This is also backed by the Leibniz Institute for the German Language and the Duden (German standard dictionary).
Actual results:
While the entity ß is supported by Firefox, Thunderbird, Seamonkey and other products, its uppercase equivalent &Szlig; is not. This case is different from bug #1625458, because the lower case ß already exists and there is no logical reason, why the uppercase letter should not have its own named entity.
Expected results:
As many keyboard layouts and character encodings (e. g. the common Windows-1252) do not support uppercase ẞ and lowercase ß already has a named entity, Mozilla should follow the recommendation of DIN 5008:2020 to add support for that character by adding support for the &Szlig; entity.
| Reporter | ||
Updated•3 months ago
|
Comment 1•3 months ago
|
||
The set of supported entities comes from the HTML spec.
I'm marking this as WONTFIX in the sense of not adding a named character reference that's not in the spec. A change can be made to Firefox's parser on this point if the spec changes. That is, the correct venue for discussing this isn't this bug tracker but the issue tracker for the spec.
For setting expectations: The set of named character references is the union of names from HTML 4 and MathML circa 2007. There's reluctance to add items to the list, since adding items to the list does not expand the expressiveness of HTML (that is, you can already express ẞ as UTF-8 bytes) and in browsers that implement the current list, the user-visible outcome of &Szlig; would be worse than the user-visible outcome of ẞ as UTF-8 bytes. In that sense, adding named characters to HTML is a poor substitute for updates to actual system-level input methods.
(Not for the reporter but other folks who might be reading this: The orthography board changed its opinion to make the capital ẞ preferable only very recently.)
Comment 2•3 months ago
|
||
It turns out that this was already filed at https://github.com/whatwg/html/issues/11782
| Reporter | ||
Comment 3•3 months ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #1)
The set of supported entities comes from the HTML spec.
I'm marking this as WONTFIX in the sense of not adding a named character reference that's not in the spec. A change can be made to Firefox's parser on this point if the spec changes. That is, the correct venue for discussing this isn't this bug tracker but the issue tracker for the spec.
Dear Henry,
thank you for pointing that out.
I think, we have a chicken-egg problem here. I already reported that issue to WhatWG here, but they would only add this to the standard, if implementers show interest. According to the Mozilla Manifesto, Mozilla stands for openness and a belief in the ability of the internet to enrich the lives of people. This also includes supporting languages like German, which is spoken by over 155 million people on earth.
This feature request is a special case, because the lowercase letter already has a named entity, so an uppercase version would be logical and obvious.
So I kindly ask you to reopen that issue and consider it for discussion.
Best regards,
Marius
Comment 4•3 months ago
|
||
This also includes supporting languages like German
ẞ is already supported as evidenced by it showing up on this page. There are plenty of characters in a wide variety of scripts used by many more languages that are supported by browsers without HTML-level named characters.
This feature request is a special case, because the lowercase letter already has a named entity, so an uppercase version would be logical and obvious.
As I'm sure you are aware, this isn't an oversight and the reason for the asymmetry is that ẞ in Unicode post-dates the HTML 4 entity set. The history around the official stance on whether ß has an upper-case form has much more problematic effects on the technology stack, including the default language-untailored Unicode case mapping in the context of the stability policy that it is under.
| Reporter | ||
Comment 5•3 months ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #4)
This also includes supporting languages like German
ẞ is already supported as evidenced by it showing up on this page. There are plenty of characters in a wide variety of scripts used by many more languages that are supported by browsers without HTML-level named characters.
This feature request is a special case, because the lowercase letter already has a named entity, so an uppercase version would be logical and obvious.
As I'm sure you are aware, this isn't an oversight and the reason for the asymmetry is that ẞ in Unicode post-dates the HTML 4 entity set. The history around the official stance on whether ß has an upper-case form has much more problematic effects on the technology stack, including the default language-untailored Unicode case mapping in the context of the stability policy that it is under.
Thank you again for these valuable information,
ẞ is showing on this page, because I had to copy it from a character table and this page is encoded in UTF-8. There are a plenty of web pages using encodings like Windows-1252 or ISO-8859-1 (aka Latin-1), which do not support the uppercase sharp s. In these encodings ẞ leads to mojibake, so you have to use the hard-to-remember codes ẞ or ẞ. Adding an additional named SGML entity to MathML and HTML would be an elegant way to provide support for this character and complete the set, because the lowercase variant already has a named entity.
| Reporter | ||
Comment 6•3 months ago
|
||
Regarding to the case mapping, there actually are other examples like Turkish dotless i, Greek sigma and Dutch ij (which becomes IJ in title-case). These are specially treated on language-level.
Comment 7•3 months ago
|
||
(In reply to marius.spix from comment #5)
There are a plenty of web pages using encodings like Windows-1252 or ISO-8859-1 (aka Latin-1), which do not support the uppercase sharp s. In these encodings ẞ leads to mojibake, so you have to use the hard-to-remember codes ẞ or ẞ.
As you note, in the case of legacy encodings, it is possible to express the character using numeric character references even if it is not particularly ergonomic to write numeric character references by manually.
The HTML spec requires authors to use UTF-8 for newly-created documents, so improving the ergonomics of using legacy encodings when writing HTML source in a text editor (as opposed to using a serializer) isn't a persuasive reason to change the spec.
Description
•