Closed Bug 212547 Opened 22 years ago Closed 22 years ago

Saving a Web Page as Text Yields Unwanted "Enhancements" for Styles, Links, and Special Characters

Categories

(Core :: DOM: Serializers, defect)

x86
Windows 98
defect
Not set
minor

Tracking

()

RESOLVED DUPLICATE of bug 135239

People

(Reporter: david, Unassigned)

References

()

Details

User-Agent: Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko/20030624 Build Identifier: Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko/20030624 When I open an HTML file in the browser and then save it as text, special characters are saved as question marks, links are expanded and shown, Italic text is bracketed with virgules, bold text is bracketed with asterisks, etc. Reproducible: Always Steps to Reproduce: 1. Display the Web page at the cited URL. 2. From the menu bar, select File > Save Page As 3. In the Save As window, select Text Files for Save as type. 4. In the Save As window, change the .html extension to .txt (another problem) 5. In the Save As window, select the Save button. Actual Results: The file was saved with "enhancements" as described above. Em-dashes (—) were saved as question marks. (Ellipses (…) on a different page were also saved as question marks.) Text within a link was followed by the link itself. Expected Results: I expected a file that showed running text as it would be seen if all stylistic markups had been suppressed but with the hex bytes for special characters preserved (so that they could be viewed via Wordpad). Workaround: Display the page. Select the entire page. Copy it and then paste it into a Notepad or Wordpad window. This, however, shows the effects of bug #99159.
see also bug 131166 and bug 138568 *** This bug has been marked as a duplicate of 135239 ***
Status: UNCONFIRMED → RESOLVED
Closed: 22 years ago
Component: Browser-General → DOM to Text Conversion
QA Contact: general → sujay
Resolution: --- → DUPLICATE
See bug #212554, which restates this after eliminating those parts that duplicate bug #135239. It was easier to write a new bug than to resurrect and revise this one.
One thing you should note: — is not an em-dash, and … is not an ellipsis, regardless of what the original reporter said. Numeric character references, according to the HTML standards, are with respect to the Unicode character code positions, not any particular character encoding. While the proprietary Windows-1252 encoding does indeed have an em-dash at position 151 (decimal) and an ellipsis at position 133, the Unicode character set has control characters in these positions (as does ISO-8859-1). Since this range of control characters is specifically disallowed in HTML according to the standards, their meaning in an HTML document is undefined (and a validator will reject a page that contains them).
You need to log in before you can comment on or make changes to this bug.