Closed Bug 131166 Opened 22 years ago Closed 22 years ago

[4xp] FileSaveAs .TXT Does Not Strip All Markup

Categories

(Core :: DOM: Serializers, defect)

defect
Not set
normal

Tracking

()

VERIFIED WONTFIX

People

(Reporter: mrmazda, Assigned: t_mutreja)

References

()

Details

Attachments

(3 files)

2002031416 OS/2

URL is just one example. OS and Platform as per bug 70045 comment #17, which has
a different example URL.

Go to site using asp and/or javascript and choose to save page as text.

Actual behavior:
Some markup tags are stripped.

Expected behavior:
All markup tags are stripped.
Confirmed.  Over to DOM-to-text conversion. I see this on current tip linux.
Assignee: law → harishd
Status: UNCONFIRMED → NEW
Component: File Handling → DOM to Text Conversion
Ever confirmed: true
QA Contact: sairuh → sujay
--> Tanu.
Assignee: harishd → tmutreja
*** Bug 132943 has been marked as a duplicate of this bug. ***
My observation here is that markup tags are stripped out but the value 
of "href" attributes for *some* of the <A> tags is displayed as "<...>" which 
gives the feeling that some tags are still there. Mrmazda, would you please 
confirm this?

If the above is true then we need to decide what behavior we expect in such 
cases. To me it looks like bug#66035, which was related to images and we 
decided that in plain text mode, img src URL should not be pasted.
Status: NEW → ASSIGNED
I'm not sure what you are describing or asking. Here's what Mozilla saved as
text from http://us.imdb.com/Name?Bax,+Kylie.
Same http://us.imdb.com/Name?Bax,+Kylie saved by Netscape 4.x. <...> is absent,
though I wouldn't like to see Mozilla add the [image]'s all over.
On further study it appears <...> is left in place only when it contains one or
more slash characters.
Thanks for the attachments. It's exactly what I mentioned above. If you view the 
source for this page you will realize that things like "</Top/>" in your saved 
.txt file are not the tags but it's the value of href in an <a href=...> which 
we are displaying with a pair of '<' & '>'. 
Seems that at present we are retaining this information in the serialized text. 
I need to check if this is the decided behavior or we should take it as bug. I 
feel this as a bug and fix would be very simple. 
In the absence of a complete URL, I can't imagine what value href attributes
would provide in a text file. This looks to me like a bug.
Since I am one the ones who reported this behavior as a bug, let me tell you why
it is seen as a bug:

One of the uses of saving a URL as text-only is to allow data to be saved as
whitespace delimited ASCII (stock data from yahoo, for example).  All of the
partial html tags left hanging around the asci file that Mozilla creates garbage
up the file sufficiently that manual editing is necessary to clean it up.  This
is clearly not desired behavior.  I suggest that you use Netscape4.X as a
definition of design requirement for this feature.

--Doug
There were long discussions of this, and I believe this is what we eventually
agreed was best.  BenB was active in the discussions too (and also in the recent
discussions of whether img tags should output their src attributes in copy or in
save) -- cc'ing him.

The argument for preserving the hrefs is that you lose a lot of data otherwise,
since they're often an important part of the content (e.g. mail messages where
someone sends you a link to check out, but the link text just says "here").

We don't output the link in unformatted mode (e.g. copy/paste), only in
formatted mode (e.g. save as text).

I could easily see an argument for having two different "save as text" options,
e.g. "Save as text with formatting" (or something, need help from tech pubs
here), "Save as simple text", where the difference would be setting the
formatted flag.
Discussions where? This what? Who is BenB? Check the description. This bug is
about saving files from the browser, not from email. Bug 66035 doesn't apply to
this bug. Text means no formatting.

UgaddaBkidding if you think </tiger_redirect?FT_LIC&/Licensing/> (from first
attachment) in a "text" file meets an ordinary user's expectations or serves any
useful purpose. It is not a complete URL, not that even a complete URL would
meet a user's expectations in any way. In display of an HTML file, that
gibberish is hidden from the user unless made explicit outside of '<' & '>' What
possible reason for it to magically appear in a text file could there be? If the
user wants HTML, she should save as HTML. TXT should mean a file that matches as
closely as possible the text that is actually displayed in the browser view pane.
Keywords: 4xp
> Discussions where? This what? Who is BenB? Check the description.

Phew. I am tempted to just close as WONTFIX.
I am BenB, I wrote part of the the converter in question here.

> saving files from the browser, not from email. Bug 66035 doesn't apply to
> this bug.

Bug 66035 is also about the browser.

> </tiger_redirect?FT_LIC&/Licensing/> [...] is not a complete URL

Right, that is a bug, but a different one. Filed bug 134457.

> not that even a complete URL would meet a user's expectations in any way.

How are you to say that? I do expect to see URLs in the output, and consider
4.x's failure to do so a severe bug, which made the converter useless for me in
many cases. No need to copy bugs.

> In display of an HTML file, that gibberish is hidden from the user 

Yes, that's why HTML is superiour to plaintext. But that doesn't mean that we
should drop that info in plaintext. People do add URLs to hand-written documents.

> TXT should mean a file that matches as closely as possible the text that is
> actually displayed in the browser view pane.

No, the output should preserve all important information in the document and
write it out the way a human writer would do.

> On further study it appears <...> is left in place only when it contains one
> or more slash characters.

No, but you probably see bug 122877.


I think that URLs are an essential part of a document. Sometimes (e.g. on
Telepolis <http://www.heise.de/tp/>), they are the only hint to further
information or sources that an author inserted. With many HTML documents, this
creates very readable and useful output. In fact, in some cases, you might not
even know, why some text is there, if you don't see that there was an URL.

Removing the URLs would leave no trails of the information in the original
document and thus be dataloss.


IMO, this is WONTFIX or a dup of bug 46990.
Keywords: 4xp
> I could easily see an argument for having two different "save as text"
> options, e.g. "Save as text with formatting" (or something, need help from tech
> pubs here), "Save as simple text", where the difference would be setting
> the formatted flag.

That would work for me. I could also see a special mode that skips (only) URLs.
I'm with mrmazda@atlantic.net on this.  If you want a mode that allows you to
save html tags (or some portion thereof), with plain text, fine.

But I don't need it.  When I want to save a file as plain text, I want PLAIN
TEXT. No html.  Period.

If you insist that saving text with partial html tags is essential, than please
make a mode that does it.  But don't bugger up the mode that is supposed to save
PLAIN TEXT.  Some of us users out here (DING! User Speaking Here!) need it.

--Doug
It *is* plain text, just like this comment and comment 13 with the URL is
plaintext. There are no HTML tags. The <> is the official and correct way to
insert URLs into plaintext, see RFC2396, Appendix E:

| there are many occasions when URI are included in plain text
[...]
| In practice, URI are delimited in a variety of ways, but usually
| within double-quotes "http://test.com/", angle brackets
|<http://test.com/>, or just using whitespace
[...]
| Using <> angle brackets around each URI is especially recommended
| as a delimiting style for URI that contain whitespace.
<http://www.heise.de/tp/english/inhalt/te/12163/1.html>

Imagine the "[Local Link]" etc. were not there - that's a glitch in the page
(there should be no alt text for images intended as visual clues).

You see that there are several embedded links, which are very much part of the
content. Not saving them would lose significant data, namely practically all
references. Imagine a scientific article without references...

I see that this doesn't work very nicely with overloaden commercial websites
which have link bars on the left and right of each page. That's why I proposed
the additional mode.
> I could easily see an argument for having two different "save as text"
> options, e.g. "Save as text with formatting" (or something, need help from 
> tech pubs here), "Save as simple text", where the difference would be setting
> the formatted flag.

I agree with this but "Save as text with formatting" may create a wider 
expectation. 

For me having the URL's in text is meaningful only if we change all of them to 
the absolute ones. Retaining relative URI's create nothing but confusion.  
>> not that even a complete URL would meet a user's expectations in any way.

> How are you to say that? I do expect to see URLs in the output, and consider
> 4.x's failure to do so a severe bug, which made the converter useless for me
> in many cases. No need to copy bugs.

Embedded material is just that, embedded. It is not something seen, so it is not
something expected from a save as text. If embedded material is essential to the
page you are saving you have the option to save as HTML. Those whose expectation
is to save no more than what is actually displayed on screen, under your
definition of plain text, would have no option to save only their expectation.

Text is readable words and numbers with intrinsic meaning. A URL has no
intrinsic meaning, and I defy anyone to pronounce one. A URL has value only as
an index for a browser to use to reach a web destination. Outside a HTML file
displayed by a browser, it is just so much gibberish.

>> In display of an HTML file, that gibberish is hidden from the user 

> Yes, that's why HTML is superiour to plaintext. But that doesn't mean that we
> should drop that info in plaintext. People do add URLs to hand-written
> documents.

If HTML is superior for your purpose, then choose that format when you save, and
don't foul up plain text for others who expect to save only the text that the
browser is displaying. The browser is not displaying embedded URL's. When a
human writes out a URL by hand, he does not use invisible ink. When a human
writes out a URL by hand, it is an intentional act.

>> TXT should mean a file that matches as closely as possible the text that is
>> actually displayed in the browser view pane.

> No, the output should preserve all important information in the document and
> write it out the way a human writer would do.

If saving what is not displayed is that important to some, then text formatted
is not the appropriate save format for them. It is not dataloss to not save what
is not expected.

As its assignee, ben.bucksch@beonex.com should be familiar with bug 30888 and
http://bugzilla.mozilla.org/show_bug.cgi?id=30888#c126, which addresses email
message display, plain text vs. HTML. If including undisplayed material is so
important to some, then the right solution is to add a new hybrid save format,
much like "simple HTML" for displaying an email.
Keywords: 4xp
You defy yourself in your own post:
> Text is readable words and numbers with intrinsic meaning. A URL has no
> intrinsic meaning, and I defy anyone to pronounce one. A URL has value only
> as an index for a browser to use to reach a web destination. Outside a HTML
> file displayed by a browser, it is just so much gibberish.

But:

> As its assignee, ben.bucksch@beonex.com should be familiar with bug 30888
> and http://bugzilla.mozilla.org/show_bug.cgi?id=30888#c126, which addresses
> email message display

What the converter does is *exactly* to generate plaintext similar to the one
you just wrote yourself manually. And that's why I strongly think that this bug
is WONTFIX or even INVALID, at least the way it's formulated currently.

> When a human writes out a URL by hand, it is an intentional act.

And when they include a link in an HTML document, it isn't? In HTML, your
comment would probably have looked like

[You] should be familiar with <a
href="http://bugzilla.mozilla.org/show_bug.cgi?id=30888#c126">Jennifer's
comment</a>, which addresses email message display

(At least, that's I would have written it.) With this bug being "fixed", it
would look like

[You] should be familiar with Jennifer's comment, which addresses email message
display

which is *useless*. Currently, the converter outputs:

[You] should be familiar with Jennifer's comment
<http://bugzilla.mozilla.org/show_bug.cgi?id=30888#c126>, which addresses email
message display

which looks a lot like your hand-written comment (I'd argue, it's even better).
I'm curious.  What part of the following user requirement is not understood by
the development team?

Description: File->Save As->Text 

Definition:
The ability to save a URL as plain text, stripping out all html, retaining to
the extent possible all text and tabular data

If there is another requirement to save a URL retaining some portion of the
html, fine.  But that is another requirement.
BenB:
"What the converter does is *exactly* to generate plaintext similar to the one
you just wrote yourself manually."

There's a big difference. I was filling in a form in which I knew the form
result would be HTML output that made that URL a clickable link in an HTML page.
The URL itself had no intrinsic meaning. My intent was for it to be there
exclusively as a link, a shortcut to click on so that you could conveniently get
to the other bug with a mouse click. If I actually knew how (and it was not more
work to do) to make the form cause the display of some other text and hide the
actual URL, that is what I would have done.

BenB:
"And when they include a link in an HTML document, it isn't?"

Not normally. Sometimes an HTML page author will put the URL in both as href
content and as ordinary text, but the general rule is that the reader has no
interest in the actual URL, only the ability to click the underlined blue
word(s) to reach the described destination, so the author puts appropriate
descriptive words in between the markup tags and leaves the URL itself hidden.

BTW, as of 2002040216 OS/2, files saved as text from an asp are still littered
with useless relative URL's.
Summary: [4xp] FileSaveAs .TXT Does Not Strip All Markup Tags → [4xp] FileSaveAs .TXT Does Not Strip All Markup
Oh, I understand you very well, I just disagree with your definitions. If you'd
say that you would like to have *a* mode (in *addition* to the current one),
where no URLs are inserted, I could subscribe to that (not that I'd implement
it). Your claim about what plaintext is and why the current behaviour is wrong
and a bug is what I oppose to.

This bug is not so overwhelmed with discussion that I'd suggest you file a new
bug, if you want to suggest a new mode in addition to the current one.
Sorry, I missed the latest comment.

> I was filling in a form in which I knew the form result would be HTML output
> that made that URL a clickable link in an HTML page.

But you were wrong with that assumption. What you write is plaintext, and
bugzilla sends it out as bugmail in plaintext.
The fact that Bugzilla's HTML interface recognizes the URL is irrelevant,
otherwise I could argue that you, when you view the saved plaintext file (result
from SaveAs .txt) in Mozilla, have clickable URLs (currently, you don't, but
that's definitely planned).

> The URL itself had no intrinsic meaning.

No, but I can copy it from a plaintext document and paste it in a browser urlbar
(which is exactly the reason why they are there). So what?

> the reader has no interest in the actual URL, only the ability to click the
> underlined blue word(s) to reach the described destination

Of course, the reader is interested in the linked document only. Just that there
is no other way in plaintext - other than including the URL - to get there. If
you remove the URL, you also removed "the ability [...] to reach the described
destination".

We can go on this way forever. I think that omitting URLs is dataloss, I am thus
closing this as WONTFIX.
Possible actions now:
- tmutreja or some other developer disagrees and reopens.
- You hope for bug 46990 to be fixed. (My favourite)
- You file a new bug about *adding* a mode that does what you want. I'd still
think that that bug would be a dup of bug 46990, but if someonoe wants to
implement it, then fine with me, since no harm is done (other than a more
complicated UI).
In re http://bugzilla.mozilla.org/show_bug.cgi?id=131166#c23:

My definition of plain text is the tradition one, that is, 100% lacking of HTML
markup, unlike that of BenB, who is trying to do a M$ and redefine a word with
longstanding meaning, leaving the the original meaning lacking of a word to
ascribe to it.

Currently, Mozilla leaves markup in supposed plain text files. That means
Mozilla is broken, leaving it deficient with respect to Netscape 4 behavior,
which leaves zero markup in plain text file saves.

There is no need for me to file bug to ask for another save mode, since the
definition for plain plain text save is already provided, and I am perfectly
capable of saving in HTML mode if I need the markup.
In re: http://bugzilla.mozilla.org/show_bug.cgi?id=131166#c24:

"But you were wrong with that assumption. What you write is plaintext, and
bugzilla sends it out as bugmail in plaintext."

Bugmail is irrelevant. Bugmail is a CC, not primary. The bugzilla website is the
repository of this dialog, and the behavior there is relevant as primary. You
can't reply to bugmail to update the bug or add comments.

"The fact that Bugzilla's HTML interface recognizes the URL is irrelevant. . ."

Anything but. It is very relevant. It automatically created the the desired link
for display in a web page (HTML, not plain text).

". . . .(currently, you don't, but
that's definitely planned)."

If so, it needs a new name, as it would not be plain text.

"No, but I can copy it from a plaintext document and paste it in a browser
urlbar (which is exactly the reason why they are there). So what?"

You can also right click the link when the HTML page is displayed by the web
browser, and paste that. So what? If you want HTML, use HTML. That way, you
don't need to copy and paste, and people wanting the use of plain text, not
using a browser to display it or otherwise use it, do not have to stumble over
it, or edit it away to get what they wanted in the first place.

"Of course, the reader is interested in the linked document only. Just that here
is no other way in plaintext - other than including the URL - to get there. If
you remove the URL, you also removed "the ability [...] to reach the described
destination"."

That ability is not expected from PLAIN text. That is why there is HTML. You can
select HTML in your file save and get that ability when that is what you require.

Removing HTML markup is not dataloss in a plain text save. If you need
non-explicit (HTML markup), save as HTML.
From ben.bucksch@beonex.com 
>Oh, I understand you very well, I just disagree with your definitions. If you'd
>say that you would like to have *a* mode (in *addition* to the current one),
>where no URLs are inserted, I could subscribe to that (not that I'd implement
>it). Your claim about what plaintext is and why the current behaviour is wrong
>and a bug is what I oppose to.

It is our opinion that Mozilla broke a feature that was in use and accpeted in
Netscape4.x, i.e. the ability save a URL as plan text without any html tags,
partial or not.  You can your broken implementation of File->Save As->Text or
not. If you decide not to fix it, then Mozilla is of less use to us.

>This bug is not so overwhelmed with discussion that I'd suggest you file a new
>bug, if you want to suggest a new mode in addition to the current one.

You can keep your broken text save mode.  
> My definition of plain text is the tradition one

This is exactly why I dislike this bug. Because you are wrong with that claim.
URLs have been in plaintext since URLs exist. Do a search on Usenet, in posts
and FAQs (both written in plaintext). Plus your very own example.

> Mozilla leaves markup in supposed plain text files.

Is *this* plaintext? Oh, save me! It contains markup!

> That means Mozilla is broken

I think your argumentation is broken.

> There is no need for me to file bug to ask for another save mode

Fine, you want no compromise, you get none.

> Bugmail is a CC, not primary.

Another definition of yours?

> It automatically created the the desired link
> for display in a web page (HTML, not plain text).

Fine, and Mozilla will eventually display the same links, if you load the saved
plaintext file. Mind you, even a |cat file.txt| "linkifies" the URLs in my terminal.

> > . . . .(currently, you don't, but that's definitely planned).
>
> If so, it needs a new name, as it would not be plain text.

I guess you think that email software that links URLs can't display plaintext?
Then, there's not much plaintext-capable software left (in your definition, of
course).

> If you want HTML, use HTML.

Did I say that? I just want all the content, as plaintext.

> the ability save a URL as plan text without any html tags, partial or not

ARG! This <my://url> is *not* an HTML tag. Not everything in angle brackets is
an HTML tag or even a tag.


This is getting more and more absurd. Actually closing WONTFIX (forgot it the
last time). Again, see bug 46990.
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → WONTFIX
verified
Status: RESOLVED → VERIFIED
This bug appears to be assigned to tmutreja@netscape.com. Maybe BenB
won't fix
it, but that doesn't mean it shouldn't be fixed.

Content is text the browswer displays. Unless a URL is displayed on the page
(explicit), it is not content. Sure, there have long been explicit URL's in
FAQ's and on Usenet. That they ever have been explicitly used is all the
more
reason not to assume them when the page author has not made them explicit.

If you think my argument is broken it is only because you are doing as
M$ and
redefining words with plain, long-established meaning.

res ipsa loquitur. You cannot update or provide additional comments via
bugmail,
only via the web form. Bugmail is secondary.

What happens in email display has no material relevance to converting a
web page
into plain text. What normally happens in email is that the sender has
supplied
an explicit URL as text, and the email reader is merely adding the
ability to
use that text as a lik without pasting it into a browser.

Yes, you said you wanted a URL that was not displayed by the browser.
That means
you want something from the underlying HTML code, and therefore you
should save
as HTML.

Bug 46990 is about HTML. Save as text is supposed to strip HTML. I don't
see any
connection between that bug and this.

Please see bug 135220.
Status: VERIFIED → REOPENED
Resolution: WONTFIX → ---
I don't understand *any* of the recent arguments in this bug.
  * A URI (or URL if you prefer) is NOT HTML nor HTML markup.
  * A URI is a string (or text if you prefer) which someone types.

Saving a file as "text" is a conversion process.  For Mozilla, converting html
to text includes as much information as possible so very little information is
lost.  It is easier for a conversion process to include as much information as
possible and let others strip it down rather than not put it in and make it hard
for users to get the information they need.  

For example, I consider it a bug if my hrule is not converted to a line of
dashes.  (I don't recall if this is in bugzilla or if it works as I expect.)
Status: REOPENED → RESOLVED
Closed: 22 years ago22 years ago
Resolution: --- → WONTFIX
> Bug 46990 is about HTML.

No, it is about saving HTML as plaintext, with the links in footnotes.

brade wrote:
> For Mozilla, converting html to text includes as much information as possible
> so very little information is lost.

Thank you!

> For example, I consider it a bug if my hrule is not converted to a line of
> dashes.

(That should work, IIRC.)

verified again.
Status: RESOLVED → VERIFIED
*** Bug 136038 has been marked as a duplicate of this bug. ***
As the user who submitted bug 136038 [1] I'd like to say I do consider this
implementation *is* broken.

I understand Ben's view and can see why it might be useful to have the
conversion in some situations, but the fact remains it didn't work that way
before and there are legitimate reasons for wanting it to work the original way.

My legitimate reason was wanting to get the hex dump from http://www.jwz.org/ so
I could convert it to a binary file. What I got was not partically useful (see
bug 136038), and non-trivial to convert to the form I expected.

Fortunately I still had NN4.75 hanging around and it did the job correctly...

[1] How can anybody *find* dupes around here? I searched and still managed to
miss this...
Our 'good friend' MzMazda filed a "continuation" bug: bug 135239.
*** Bug 243183 has been marked as a duplicate of this bug. ***
*** Bug 296045 has been marked as a duplicate of this bug. ***
*** Bug 288261 has been marked as a duplicate of this bug. ***
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: