Closed
Bug 131166
Opened 22 years ago
Closed 22 years ago
[4xp] FileSaveAs .TXT Does Not Strip All Markup
Categories
(Core :: DOM: Serializers, defect)
Core
DOM: Serializers
Tracking
()
VERIFIED
WONTFIX
People
(Reporter: mrmazda, Assigned: t_mutreja)
References
()
Details
Attachments
(3 files)
2002031416 OS/2 URL is just one example. OS and Platform as per bug 70045 comment #17, which has a different example URL. Go to site using asp and/or javascript and choose to save page as text. Actual behavior: Some markup tags are stripped. Expected behavior: All markup tags are stripped.
Comment 1•22 years ago
|
||
Confirmed. Over to DOM-to-text conversion. I see this on current tip linux.
Assignee: law → harishd
Status: UNCONFIRMED → NEW
Component: File Handling → DOM to Text Conversion
Ever confirmed: true
QA Contact: sairuh → sujay
Comment 3•22 years ago
|
||
*** Bug 132943 has been marked as a duplicate of this bug. ***
Assignee | ||
Comment 4•22 years ago
|
||
My observation here is that markup tags are stripped out but the value of "href" attributes for *some* of the <A> tags is displayed as "<...>" which gives the feeling that some tags are still there. Mrmazda, would you please confirm this? If the above is true then we need to decide what behavior we expect in such cases. To me it looks like bug#66035, which was related to images and we decided that in plain text mode, img src URL should not be pasted.
Status: NEW → ASSIGNED
Reporter | ||
Comment 5•22 years ago
|
||
I'm not sure what you are describing or asking. Here's what Mozilla saved as text from http://us.imdb.com/Name?Bax,+Kylie.
Reporter | ||
Comment 6•22 years ago
|
||
Same http://us.imdb.com/Name?Bax,+Kylie saved by Netscape 4.x. <...> is absent, though I wouldn't like to see Mozilla add the [image]'s all over.
Reporter | ||
Comment 7•22 years ago
|
||
On further study it appears <...> is left in place only when it contains one or more slash characters.
Assignee | ||
Comment 8•22 years ago
|
||
Thanks for the attachments. It's exactly what I mentioned above. If you view the source for this page you will realize that things like "</Top/>" in your saved .txt file are not the tags but it's the value of href in an <a href=...> which we are displaying with a pair of '<' & '>'. Seems that at present we are retaining this information in the serialized text. I need to check if this is the decided behavior or we should take it as bug. I feel this as a bug and fix would be very simple.
Reporter | ||
Comment 9•22 years ago
|
||
In the absence of a complete URL, I can't imagine what value href attributes would provide in a text file. This looks to me like a bug.
Comment 10•22 years ago
|
||
Since I am one the ones who reported this behavior as a bug, let me tell you why it is seen as a bug: One of the uses of saving a URL as text-only is to allow data to be saved as whitespace delimited ASCII (stock data from yahoo, for example). All of the partial html tags left hanging around the asci file that Mozilla creates garbage up the file sufficiently that manual editing is necessary to clean it up. This is clearly not desired behavior. I suggest that you use Netscape4.X as a definition of design requirement for this feature. --Doug
Comment 11•22 years ago
|
||
There were long discussions of this, and I believe this is what we eventually agreed was best. BenB was active in the discussions too (and also in the recent discussions of whether img tags should output their src attributes in copy or in save) -- cc'ing him. The argument for preserving the hrefs is that you lose a lot of data otherwise, since they're often an important part of the content (e.g. mail messages where someone sends you a link to check out, but the link text just says "here"). We don't output the link in unformatted mode (e.g. copy/paste), only in formatted mode (e.g. save as text). I could easily see an argument for having two different "save as text" options, e.g. "Save as text with formatting" (or something, need help from tech pubs here), "Save as simple text", where the difference would be setting the formatted flag.
Reporter | ||
Comment 12•22 years ago
|
||
Discussions where? This what? Who is BenB? Check the description. This bug is about saving files from the browser, not from email. Bug 66035 doesn't apply to this bug. Text means no formatting. UgaddaBkidding if you think </tiger_redirect?FT_LIC&/Licensing/> (from first attachment) in a "text" file meets an ordinary user's expectations or serves any useful purpose. It is not a complete URL, not that even a complete URL would meet a user's expectations in any way. In display of an HTML file, that gibberish is hidden from the user unless made explicit outside of '<' & '>' What possible reason for it to magically appear in a text file could there be? If the user wants HTML, she should save as HTML. TXT should mean a file that matches as closely as possible the text that is actually displayed in the browser view pane.
Keywords: 4xp
Comment 13•22 years ago
|
||
> Discussions where? This what? Who is BenB? Check the description. Phew. I am tempted to just close as WONTFIX. I am BenB, I wrote part of the the converter in question here. > saving files from the browser, not from email. Bug 66035 doesn't apply to > this bug. Bug 66035 is also about the browser. > </tiger_redirect?FT_LIC&/Licensing/> [...] is not a complete URL Right, that is a bug, but a different one. Filed bug 134457. > not that even a complete URL would meet a user's expectations in any way. How are you to say that? I do expect to see URLs in the output, and consider 4.x's failure to do so a severe bug, which made the converter useless for me in many cases. No need to copy bugs. > In display of an HTML file, that gibberish is hidden from the user Yes, that's why HTML is superiour to plaintext. But that doesn't mean that we should drop that info in plaintext. People do add URLs to hand-written documents. > TXT should mean a file that matches as closely as possible the text that is > actually displayed in the browser view pane. No, the output should preserve all important information in the document and write it out the way a human writer would do. > On further study it appears <...> is left in place only when it contains one > or more slash characters. No, but you probably see bug 122877. I think that URLs are an essential part of a document. Sometimes (e.g. on Telepolis <http://www.heise.de/tp/>), they are the only hint to further information or sources that an author inserted. With many HTML documents, this creates very readable and useful output. In fact, in some cases, you might not even know, why some text is there, if you don't see that there was an URL. Removing the URLs would leave no trails of the information in the original document and thus be dataloss. IMO, this is WONTFIX or a dup of bug 46990.
Keywords: 4xp
Comment 14•22 years ago
|
||
> I could easily see an argument for having two different "save as text"
> options, e.g. "Save as text with formatting" (or something, need help from tech
> pubs here), "Save as simple text", where the difference would be setting
> the formatted flag.
That would work for me. I could also see a special mode that skips (only) URLs.
Comment 15•22 years ago
|
||
I'm with mrmazda@atlantic.net on this. If you want a mode that allows you to save html tags (or some portion thereof), with plain text, fine. But I don't need it. When I want to save a file as plain text, I want PLAIN TEXT. No html. Period. If you insist that saving text with partial html tags is essential, than please make a mode that does it. But don't bugger up the mode that is supposed to save PLAIN TEXT. Some of us users out here (DING! User Speaking Here!) need it. --Doug
Comment 16•22 years ago
|
||
It *is* plain text, just like this comment and comment 13 with the URL is plaintext. There are no HTML tags. The <> is the official and correct way to insert URLs into plaintext, see RFC2396, Appendix E: | there are many occasions when URI are included in plain text [...] | In practice, URI are delimited in a variety of ways, but usually | within double-quotes "http://test.com/", angle brackets |<http://test.com/>, or just using whitespace [...] | Using <> angle brackets around each URI is especially recommended | as a delimiting style for URI that contain whitespace.
Comment 17•22 years ago
|
||
<http://www.heise.de/tp/english/inhalt/te/12163/1.html> Imagine the "[Local Link]" etc. were not there - that's a glitch in the page (there should be no alt text for images intended as visual clues). You see that there are several embedded links, which are very much part of the content. Not saving them would lose significant data, namely practically all references. Imagine a scientific article without references... I see that this doesn't work very nicely with overloaden commercial websites which have link bars on the left and right of each page. That's why I proposed the additional mode.
Assignee | ||
Comment 18•22 years ago
|
||
> I could easily see an argument for having two different "save as text"
> options, e.g. "Save as text with formatting" (or something, need help from
> tech pubs here), "Save as simple text", where the difference would be setting
> the formatted flag.
I agree with this but "Save as text with formatting" may create a wider
expectation.
For me having the URL's in text is meaningful only if we change all of them to
the absolute ones. Retaining relative URI's create nothing but confusion.
Reporter | ||
Comment 19•22 years ago
|
||
>> not that even a complete URL would meet a user's expectations in any way. > How are you to say that? I do expect to see URLs in the output, and consider > 4.x's failure to do so a severe bug, which made the converter useless for me > in many cases. No need to copy bugs. Embedded material is just that, embedded. It is not something seen, so it is not something expected from a save as text. If embedded material is essential to the page you are saving you have the option to save as HTML. Those whose expectation is to save no more than what is actually displayed on screen, under your definition of plain text, would have no option to save only their expectation. Text is readable words and numbers with intrinsic meaning. A URL has no intrinsic meaning, and I defy anyone to pronounce one. A URL has value only as an index for a browser to use to reach a web destination. Outside a HTML file displayed by a browser, it is just so much gibberish. >> In display of an HTML file, that gibberish is hidden from the user > Yes, that's why HTML is superiour to plaintext. But that doesn't mean that we > should drop that info in plaintext. People do add URLs to hand-written > documents. If HTML is superior for your purpose, then choose that format when you save, and don't foul up plain text for others who expect to save only the text that the browser is displaying. The browser is not displaying embedded URL's. When a human writes out a URL by hand, he does not use invisible ink. When a human writes out a URL by hand, it is an intentional act. >> TXT should mean a file that matches as closely as possible the text that is >> actually displayed in the browser view pane. > No, the output should preserve all important information in the document and > write it out the way a human writer would do. If saving what is not displayed is that important to some, then text formatted is not the appropriate save format for them. It is not dataloss to not save what is not expected. As its assignee, ben.bucksch@beonex.com should be familiar with bug 30888 and http://bugzilla.mozilla.org/show_bug.cgi?id=30888#c126, which addresses email message display, plain text vs. HTML. If including undisplayed material is so important to some, then the right solution is to add a new hybrid save format, much like "simple HTML" for displaying an email.
Keywords: 4xp
Comment 20•22 years ago
|
||
You defy yourself in your own post: > Text is readable words and numbers with intrinsic meaning. A URL has no > intrinsic meaning, and I defy anyone to pronounce one. A URL has value only > as an index for a browser to use to reach a web destination. Outside a HTML > file displayed by a browser, it is just so much gibberish. But: > As its assignee, ben.bucksch@beonex.com should be familiar with bug 30888 > and http://bugzilla.mozilla.org/show_bug.cgi?id=30888#c126, which addresses > email message display What the converter does is *exactly* to generate plaintext similar to the one you just wrote yourself manually. And that's why I strongly think that this bug is WONTFIX or even INVALID, at least the way it's formulated currently. > When a human writes out a URL by hand, it is an intentional act. And when they include a link in an HTML document, it isn't? In HTML, your comment would probably have looked like [You] should be familiar with <a href="http://bugzilla.mozilla.org/show_bug.cgi?id=30888#c126">Jennifer's comment</a>, which addresses email message display (At least, that's I would have written it.) With this bug being "fixed", it would look like [You] should be familiar with Jennifer's comment, which addresses email message display which is *useless*. Currently, the converter outputs: [You] should be familiar with Jennifer's comment <http://bugzilla.mozilla.org/show_bug.cgi?id=30888#c126>, which addresses email message display which looks a lot like your hand-written comment (I'd argue, it's even better).
Comment 21•22 years ago
|
||
I'm curious. What part of the following user requirement is not understood by the development team? Description: File->Save As->Text Definition: The ability to save a URL as plain text, stripping out all html, retaining to the extent possible all text and tabular data If there is another requirement to save a URL retaining some portion of the html, fine. But that is another requirement.
Reporter | ||
Comment 22•22 years ago
|
||
BenB: "What the converter does is *exactly* to generate plaintext similar to the one you just wrote yourself manually." There's a big difference. I was filling in a form in which I knew the form result would be HTML output that made that URL a clickable link in an HTML page. The URL itself had no intrinsic meaning. My intent was for it to be there exclusively as a link, a shortcut to click on so that you could conveniently get to the other bug with a mouse click. If I actually knew how (and it was not more work to do) to make the form cause the display of some other text and hide the actual URL, that is what I would have done. BenB: "And when they include a link in an HTML document, it isn't?" Not normally. Sometimes an HTML page author will put the URL in both as href content and as ordinary text, but the general rule is that the reader has no interest in the actual URL, only the ability to click the underlined blue word(s) to reach the described destination, so the author puts appropriate descriptive words in between the markup tags and leaves the URL itself hidden. BTW, as of 2002040216 OS/2, files saved as text from an asp are still littered with useless relative URL's.
Summary: [4xp] FileSaveAs .TXT Does Not Strip All Markup Tags → [4xp] FileSaveAs .TXT Does Not Strip All Markup
Comment 23•22 years ago
|
||
Oh, I understand you very well, I just disagree with your definitions. If you'd say that you would like to have *a* mode (in *addition* to the current one), where no URLs are inserted, I could subscribe to that (not that I'd implement it). Your claim about what plaintext is and why the current behaviour is wrong and a bug is what I oppose to. This bug is not so overwhelmed with discussion that I'd suggest you file a new bug, if you want to suggest a new mode in addition to the current one.
Comment 24•22 years ago
|
||
Sorry, I missed the latest comment. > I was filling in a form in which I knew the form result would be HTML output > that made that URL a clickable link in an HTML page. But you were wrong with that assumption. What you write is plaintext, and bugzilla sends it out as bugmail in plaintext. The fact that Bugzilla's HTML interface recognizes the URL is irrelevant, otherwise I could argue that you, when you view the saved plaintext file (result from SaveAs .txt) in Mozilla, have clickable URLs (currently, you don't, but that's definitely planned). > The URL itself had no intrinsic meaning. No, but I can copy it from a plaintext document and paste it in a browser urlbar (which is exactly the reason why they are there). So what? > the reader has no interest in the actual URL, only the ability to click the > underlined blue word(s) to reach the described destination Of course, the reader is interested in the linked document only. Just that there is no other way in plaintext - other than including the URL - to get there. If you remove the URL, you also removed "the ability [...] to reach the described destination". We can go on this way forever. I think that omitting URLs is dataloss, I am thus closing this as WONTFIX. Possible actions now: - tmutreja or some other developer disagrees and reopens. - You hope for bug 46990 to be fixed. (My favourite) - You file a new bug about *adding* a mode that does what you want. I'd still think that that bug would be a dup of bug 46990, but if someonoe wants to implement it, then fine with me, since no harm is done (other than a more complicated UI).
Reporter | ||
Comment 25•22 years ago
|
||
In re http://bugzilla.mozilla.org/show_bug.cgi?id=131166#c23: My definition of plain text is the tradition one, that is, 100% lacking of HTML markup, unlike that of BenB, who is trying to do a M$ and redefine a word with longstanding meaning, leaving the the original meaning lacking of a word to ascribe to it. Currently, Mozilla leaves markup in supposed plain text files. That means Mozilla is broken, leaving it deficient with respect to Netscape 4 behavior, which leaves zero markup in plain text file saves. There is no need for me to file bug to ask for another save mode, since the definition for plain plain text save is already provided, and I am perfectly capable of saving in HTML mode if I need the markup.
Reporter | ||
Comment 26•22 years ago
|
||
In re: http://bugzilla.mozilla.org/show_bug.cgi?id=131166#c24: "But you were wrong with that assumption. What you write is plaintext, and bugzilla sends it out as bugmail in plaintext." Bugmail is irrelevant. Bugmail is a CC, not primary. The bugzilla website is the repository of this dialog, and the behavior there is relevant as primary. You can't reply to bugmail to update the bug or add comments. "The fact that Bugzilla's HTML interface recognizes the URL is irrelevant. . ." Anything but. It is very relevant. It automatically created the the desired link for display in a web page (HTML, not plain text). ". . . .(currently, you don't, but that's definitely planned)." If so, it needs a new name, as it would not be plain text. "No, but I can copy it from a plaintext document and paste it in a browser urlbar (which is exactly the reason why they are there). So what?" You can also right click the link when the HTML page is displayed by the web browser, and paste that. So what? If you want HTML, use HTML. That way, you don't need to copy and paste, and people wanting the use of plain text, not using a browser to display it or otherwise use it, do not have to stumble over it, or edit it away to get what they wanted in the first place. "Of course, the reader is interested in the linked document only. Just that here is no other way in plaintext - other than including the URL - to get there. If you remove the URL, you also removed "the ability [...] to reach the described destination"." That ability is not expected from PLAIN text. That is why there is HTML. You can select HTML in your file save and get that ability when that is what you require. Removing HTML markup is not dataloss in a plain text save. If you need non-explicit (HTML markup), save as HTML.
Comment 27•22 years ago
|
||
From ben.bucksch@beonex.com >Oh, I understand you very well, I just disagree with your definitions. If you'd >say that you would like to have *a* mode (in *addition* to the current one), >where no URLs are inserted, I could subscribe to that (not that I'd implement >it). Your claim about what plaintext is and why the current behaviour is wrong >and a bug is what I oppose to. It is our opinion that Mozilla broke a feature that was in use and accpeted in Netscape4.x, i.e. the ability save a URL as plan text without any html tags, partial or not. You can your broken implementation of File->Save As->Text or not. If you decide not to fix it, then Mozilla is of less use to us. >This bug is not so overwhelmed with discussion that I'd suggest you file a new >bug, if you want to suggest a new mode in addition to the current one. You can keep your broken text save mode.
Comment 28•22 years ago
|
||
> My definition of plain text is the tradition one This is exactly why I dislike this bug. Because you are wrong with that claim. URLs have been in plaintext since URLs exist. Do a search on Usenet, in posts and FAQs (both written in plaintext). Plus your very own example. > Mozilla leaves markup in supposed plain text files. Is *this* plaintext? Oh, save me! It contains markup! > That means Mozilla is broken I think your argumentation is broken. > There is no need for me to file bug to ask for another save mode Fine, you want no compromise, you get none. > Bugmail is a CC, not primary. Another definition of yours? > It automatically created the the desired link > for display in a web page (HTML, not plain text). Fine, and Mozilla will eventually display the same links, if you load the saved plaintext file. Mind you, even a |cat file.txt| "linkifies" the URLs in my terminal. > > . . . .(currently, you don't, but that's definitely planned). > > If so, it needs a new name, as it would not be plain text. I guess you think that email software that links URLs can't display plaintext? Then, there's not much plaintext-capable software left (in your definition, of course). > If you want HTML, use HTML. Did I say that? I just want all the content, as plaintext. > the ability save a URL as plan text without any html tags, partial or not ARG! This <my://url> is *not* an HTML tag. Not everything in angle brackets is an HTML tag or even a tag. This is getting more and more absurd. Actually closing WONTFIX (forgot it the last time). Again, see bug 46990.
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → WONTFIX
Reporter | ||
Comment 30•22 years ago
|
||
This bug appears to be assigned to tmutreja@netscape.com. Maybe BenB won't fix it, but that doesn't mean it shouldn't be fixed. Content is text the browswer displays. Unless a URL is displayed on the page (explicit), it is not content. Sure, there have long been explicit URL's in FAQ's and on Usenet. That they ever have been explicitly used is all the more reason not to assume them when the page author has not made them explicit. If you think my argument is broken it is only because you are doing as M$ and redefining words with plain, long-established meaning. res ipsa loquitur. You cannot update or provide additional comments via bugmail, only via the web form. Bugmail is secondary. What happens in email display has no material relevance to converting a web page into plain text. What normally happens in email is that the sender has supplied an explicit URL as text, and the email reader is merely adding the ability to use that text as a lik without pasting it into a browser. Yes, you said you wanted a URL that was not displayed by the browser. That means you want something from the underlying HTML code, and therefore you should save as HTML. Bug 46990 is about HTML. Save as text is supposed to strip HTML. I don't see any connection between that bug and this. Please see bug 135220.
Status: VERIFIED → REOPENED
Resolution: WONTFIX → ---
Comment 31•22 years ago
|
||
I don't understand *any* of the recent arguments in this bug. * A URI (or URL if you prefer) is NOT HTML nor HTML markup. * A URI is a string (or text if you prefer) which someone types. Saving a file as "text" is a conversion process. For Mozilla, converting html to text includes as much information as possible so very little information is lost. It is easier for a conversion process to include as much information as possible and let others strip it down rather than not put it in and make it hard for users to get the information they need. For example, I consider it a bug if my hrule is not converted to a line of dashes. (I don't recall if this is in bugzilla or if it works as I expect.)
Status: REOPENED → RESOLVED
Closed: 22 years ago → 22 years ago
Resolution: --- → WONTFIX
Comment 32•22 years ago
|
||
> Bug 46990 is about HTML. No, it is about saving HTML as plaintext, with the links in footnotes. brade wrote: > For Mozilla, converting html to text includes as much information as possible > so very little information is lost. Thank you! > For example, I consider it a bug if my hrule is not converted to a line of > dashes. (That should work, IIRC.) verified again.
Status: RESOLVED → VERIFIED
Comment 33•22 years ago
|
||
*** Bug 136038 has been marked as a duplicate of this bug. ***
Comment 34•22 years ago
|
||
As the user who submitted bug 136038 [1] I'd like to say I do consider this implementation *is* broken. I understand Ben's view and can see why it might be useful to have the conversion in some situations, but the fact remains it didn't work that way before and there are legitimate reasons for wanting it to work the original way. My legitimate reason was wanting to get the hex dump from http://www.jwz.org/ so I could convert it to a binary file. What I got was not partically useful (see bug 136038), and non-trivial to convert to the form I expected. Fortunately I still had NN4.75 hanging around and it did the job correctly... [1] How can anybody *find* dupes around here? I searched and still managed to miss this...
Comment 35•22 years ago
|
||
Our 'good friend' MzMazda filed a "continuation" bug: bug 135239.
Reporter | ||
Comment 36•20 years ago
|
||
*** Bug 243183 has been marked as a duplicate of this bug. ***
Comment 37•19 years ago
|
||
*** Bug 296045 has been marked as a duplicate of this bug. ***
Comment 38•19 years ago
|
||
*** Bug 288261 has been marked as a duplicate of this bug. ***
You need to log in
before you can comment on or make changes to this bug.
Description
•