Closed
Bug 212547
Opened 22 years ago
Closed 22 years ago
Saving a Web Page as Text Yields Unwanted "Enhancements" for Styles, Links, and Special Characters
Categories
(Core :: DOM: Serializers, defect)
Tracking
()
RESOLVED
DUPLICATE
of bug 135239
People
(Reporter: david, Unassigned)
References
()
Details
User-Agent: Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko/20030624
Build Identifier: Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko/20030624
When I open an HTML file in the browser and then save it as text, special
characters are saved as question marks, links are expanded and shown, Italic
text is bracketed with virgules, bold text is bracketed with asterisks, etc.
Reproducible: Always
Steps to Reproduce:
1. Display the Web page at the cited URL.
2. From the menu bar, select File > Save Page As
3. In the Save As window, select Text Files for Save as type.
4. In the Save As window, change the .html extension to .txt (another problem)
5. In the Save As window, select the Save button.
Actual Results:
The file was saved with "enhancements" as described above. Em-dashes (—)
were saved as question marks. (Ellipses (…) on a different page were also
saved as question marks.) Text within a link was followed by the link itself.
Expected Results:
I expected a file that showed running text as it would be seen if all stylistic
markups had been suppressed but with the hex bytes for special characters
preserved (so that they could be viewed via Wordpad).
Workaround:
Display the page. Select the entire page. Copy it and then paste it into a
Notepad or Wordpad window.
This, however, shows the effects of bug #99159.
Comment 1•22 years ago
|
||
Status: UNCONFIRMED → RESOLVED
Closed: 22 years ago
Component: Browser-General → DOM to Text Conversion
QA Contact: general → sujay
Resolution: --- → DUPLICATE
Reporter | ||
Comment 2•22 years ago
|
||
See bug #212554, which restates this after eliminating those parts that
duplicate bug #135239. It was easier to write a new bug than to resurrect and
revise this one.
Comment 3•22 years ago
|
||
One thing you should note: — is not an em-dash, and … is not an
ellipsis, regardless of what the original reporter said. Numeric character
references, according to the HTML standards, are with respect to the Unicode
character code positions, not any particular character encoding. While the
proprietary Windows-1252 encoding does indeed have an em-dash at position 151
(decimal) and an ellipsis at position 133, the Unicode character set has control
characters in these positions (as does ISO-8859-1). Since this range of control
characters is specifically disallowed in HTML according to the standards, their
meaning in an HTML document is undefined (and a validator will reject a page
that contains them).
You need to log in
before you can comment on or make changes to this bug.
Description
•