Closed Bug 30888 Opened 25 years ago Closed 23 years ago

Want "View as plaintext" (version/conversion)

Categories

(MailNews Core :: MIME, enhancement, P3)

enhancement

Tracking

(Not tracked)

VERIFIED FIXED
mozilla1.0

People

(Reporter: akkzilla, Assigned: BenB)

References

(Blocks 1 open bug)

Details

(Whiteboard: Checked into trunk. Should go in 1.0?)

Attachments

(7 files, 20 obsolete files)

4.60 KB, message/rfc822
Details
28.69 KB, image/gif
Details
5.25 KB, text/plain
Details
45.56 KB, patch
BenB
: review+
Details | Diff | Splinter Review
65.07 KB, patch
Bienvenu
: superreview+
Details | Diff | Splinter Review
117.68 KB, patch
Details | Diff | Splinter Review
4.20 KB, text/plain
Details
I'm forever getting messages from people who use Outlook and put everything in
a 1mm yellow font on a purple-and-pink image background.

Having something in the View menu that said "View as Plaintext" and ran the
message through the plaintext converter (or, if the message is already
multipart/alternative, just take the text part) would totally make my day, and
would be a huge incentive to use mozilla mail over 4.x.
I'd also like to figure out what broken mechanism MS is using to make their
messages look bad in communicator...
Oh, that's easy: here are some samples from two recent messages:
<FONT size=3D2>
<FONT face=3DArial size=3D2>
<FONT color=3D#00ff00 face=3D"Comic Sans MS" size=3D2>

The common element is that a huge proportion of OE users seem to set the font
size to "size=3D2" (which apparently is a quoted-printable encoding for
"size=2"), and many of them also set the face to something that looks even worse
at size 2 than my default fonts.

It's nothing nefarious to kill Netscape specifically (in fact, these messages
usually look fine on Netscape on Windows, modulo poor choices of colors and
backgrounds); it's just that something in the UI seems to encourage them to use
font sizes and faces that many unix and mac users can't read easily.
-> helpwanted
Assignee: sspitzer → nobody
Keywords: helpwanted
Bug #18427 "Non-presentational HTML mail" is related.
BTW: I use view source, if I want to see the plain text part of an
multipart/alternative msg.
BTW, if a message has been sent in text and html, I find that on Communicator
4.x "View Attachments As Links" (which IIRC has not been implemented in Mozilla
yet) replaces the message with a link to the text alternative.

Re "broken MS mechanism", I find that some versions of OE won't accept HTML
messages without a <BODY> tag which 4.x removes if there are no attributes.
moving way out there...nobody is a busy person.
Target Milestone: --- → Future
If we have a HTML+plaintext multipart msg, should we
- display the plaintext part
- convert the HTML part to plaintext and display that
The formaer would be The Right Things, since the sender can theoretically
convert best, but pratically, Mozilla might be better, e.g. not lose links or
structs (if I like).
> If we have a HTML+plaintext multipart msg, should we
> - display the plaintext part

Yes.

> - convert the HTML part to plaintext and display that
> The formaer would be The Right Things,
> since the sender can theoretically
> convert best, but pratically, Mozilla might be better,
>  e.g. not lose links or structs (if I like).

You have a point. How about two buttons/tab/options: 'View plaintext version' 
and 'Convert to plaintext'?
My vote: Display the plaintext part, if it's multipart/alternative, but display
it the same way we display other plaintext messages, i.e. run it through the
converter to add links and so forth.

If we take the html part and convert it to plaintext, then we run the risk that
there might be something else in their html (MS extensions, complex tables,
etc.) which we can't handle.  A user who asks for the plaintext wants something
fairly simple.
| My vote: Display the plaintext part, if it's multipart/
| alternative, but display
| it the same way we display other plaintext messages,
| i.e. run it through the converter to add links and so forth.

The problem with this approach, is that some links will be lost in the 
plaintext version *before* Mozilla converts it to HTML. E.g., Outlook Express 
removes all links in plaintext version (the link text is shown -- the actual 
link/URI is not).
> Outlook Express removes all links in plaintext version 

As does 4.x. (Interestingly, Exchange Server seems to put URLs in, exactly the
same way Mozilla does now.
Again: Isn't bug 18427 what you really want? Doing HTML->TXT->HTML just to loose
some presentation formatting doesn't make too much sense to me.
Sometimes -- but there are cases where I just want the text, with no html crap
at all.  When I get messages like this now, sometimes I read them by replying in
plaintext.  Sure, I lose formatting and emphasis and linkification and
everything else, but I don't care -- just show me the text and let me get on
with life.
Are the HTML options
* Always use my fonts, overriding document
* Always use my colors
* Always use my style sheet
applying here?

Wouldn't they (combined with text zoom) solve this problem?
Do we have "always use my style sheet"?  I want it!  That would help immensely
with readability of lots of web pages (including our own documentation pages).
I can't find that pref in the pref dialog yet, though -- if we have it as a
hidden pref, what's it called?

"Always use my fonts" helps font face, but it doesn't help font size, the #1
most common problem with html readability.  And I'm not willing to turn off
colors on all pages just so that I can read mail messages sent by Outlook users
(though if I could turn off colors just for mail, I'd consider doing that; 4.x
didn't offer that as an option and we don't seem to either).

There are also issues involving incompatible html.  Lots of mailers send bad
html; even if our layout engine is perfect and doesn't have the sorts of
problems that 4.x had (e.g. headers of Eudora html messages drawing on top of
each other because 4.x didn't understand the div tag), the mozilla parser is
known to be intolerant of bad html, and sometimes just drops those parts it
doesn't like.  Dropping into plaintext avoids this sort of problem.
The point of bug #18427 is just to use your style sheet.
Anyone got any more comments as to why this bug isn't really a dupe?
> the mozilla parser is
> known to be intolerant of bad html, and sometimes just drops those parts it
> doesn't like.  Dropping into plaintext avoids this sort of problem.

Why? The HTML->TXT conversion will use the "mozilla parser".

> Anyone got any more comments as to why this bug isn't really a dupe?

We should consider security. There are security/privacy problems coming up with
HTML mail from time to time (e.g. the "webbug in mail" and "email wiretapping"
problems). Having the rendering engine never see the original HTML will surely
help (though not garantee protection).

But then again, it might be good to have this implemented on another level (e.g.
extract only those parts of the msg that we are known to be interested in and
recreate a new HTML document from that info - this will save the (useful part of
the) richer info HTML provides. (Does anybody want to file a bug about that?)
> There are security/privacy problems coming up with HTML mail from time to time

But we already have a pref to turn mailnews JS off!
OK, but the "webbug" problem has noting to do with JS.
Not just have a "View as plain text" option to textify HTML emails, but have an
Edit | Preferences option that all email with f*#$% HTML is displayed as plain
text (source) - never allow images, webbugs, invisible forms... to do their
nefarious deeds.
that's another bug, go file it.. dont' polute this one.
*** Bug 69529 has been marked as a duplicate of this bug. ***
Alecf? I consider a preference (not just a temporary(?) menu option) very much
part of this bug.
Ben? Are you just trying to pick fights?
No, surely not. I always thought that this bug was a user preference. Some
people just dislike HTML and never want to see it, not even a reduced version
(bug 18427 or similar). For them, a pref is the logical thing to do.
If you strongly oppose it, we can surely separate both bugs and make the bug for
the pref dependant on this one.
I am one of those people.  If I don't have to see another HTML mail for the rest
of my life I'll be happy as a clam.  Seriously.

So a preference would be *nice*.
Bug 18427 bug 69529 bug 72920 & bug 102511, if not others I've not yet found, at
least in part appear to cover the same ground as here. I would be ecstatic to
never see HTML formatted email ever again.  Now if Mozilla had some UI to easily
build me the non-programmer a stylesheet that would prevent the problems HTML
imposes I suppose that would be OK, but I can't see how implementing that would
be anything less than grossly more difficult than the prefs proposal by
fdjsouthey@uwaterloo.ca on 2001-05-01 in bug 69529. 
modifying summary to more clearly distinguish between this bug and bug 69529.
The current summaries are almost reversed.
Summary: [FEATURE] Want "View as plaintext" → [FEATURE] Want "View as plaintext" (strip HTML formatting)
*** Bug 72920 has been marked as a duplicate of this bug. ***
More SUMMARY: This bug is not about stripping, but using the plaintext
alternative. If that is not available (rare case), we should *convert* to plaintext.
Summary: [FEATURE] Want "View as plaintext" (strip HTML formatting) → [FEATURE] Want "View as plaintext" (version/conversion)
even when a plaintext alternative *is* available, there should still be an
option to view the HTML as plaintext

too many messages are sent out with a plaintext alternative effectively
equivilant to "this message was sent in HTML format" which is a total abuse of
the alternative, but that doesn't change the fact that it happens.

-matt
man enough advocacy already! I think everyone agrees we should have this. You
just need someone to implement it. Stop trying to convince "nobody@mozilla.org"
to fix this - nobody is not listening.
I received a private email asking which mailers do this (send bogus text
alternatives) and pointing out that such behavior is broken.

I agree that this behavior is broken.

I don't think any end-user MUAs do this (at least I haven't noticed any), but
I've seen this from a fair number of comercial concerns sending out their latest
newsletter or bulletin or promotion or whatever. I'm sure the mail is being sent
by a program which accepts a web page as a source, and makes no effort to come
up with a *real* text alternative. These programs are probably hand-rolled
things written at the behest of the company's marketing dept.

-matt
If nobody is not listening, does that mean everyone is listening?  (Sorry.)
Taking bug. See <http://bugzilla.mozilla.org/show_bug.cgi?id=69529#c32> for
current state.
Assignee: nobody → ben.bucksch
Component: Mail Window Front End → MIME
Keywords: helpwanted
OS: Linux → All
Hardware: PC → All
Summary: [FEATURE] Want "View as plaintext" (version/conversion) → Want "View as plaintext" (version/conversion)
Target Milestone: Future → mozilla0.9.9
Attached patch Suggested fix, Version 3 (obsolete) — Splinter Review
I have this implemented, the backend at least. Will attach patch


Documentation:
There are 2 prefs:

key: mailnews.display.prefer_plaintext
type: bool
default: false
descr: If the mail is multipart/alternative, ignore HTML ("text/html") parts,
  which (usually) causes the plaintext part to display.

key: mailnews.display.html_as
type: Int
values: 0, 1 or 2
default: 0
descr: Determines, how HTML parts will be displayed, *if* they are displayed
  (prefer_plaintext takes precedence).
  Values:
  0: Render sender's HTML unchanged.
  1: HTML->TXT->HTML - Convert HTML to plaintext and that to HTML again and
     display the result.
     This also happens to disable the inline-display of HTML attachments.
  2. Show HTML source (tags) in message pane (bug 69529)

Ceveats:
- When you SaveAs|File|.txt, you will get a result similar to what you see,
following the prefs. Expection is when html_as=2 is active, in which case you
get a complete mess when saving as HTML. I don't bother too much, because users
of that option are not too likely to save as HTML. Other cases work as
expected.
- I cannot test Print, because that crashes (even for the old/conventional
settings).
- No clue about I18N.

Reviewer:
Please check my usage of |nsString*|.
Consider buffer overflow exploits, because we are processing external source.
Only tested on Unix. Usual disclaimers.

QA:
I will attach 3 test messages, which represent the *minimal* test coverage.
I encourage everyone to firetest this and try to break it. Try malformed
|Content-Type|s and other such attacks.
Attached file Testcases
These are my testcases.
- multipart/alternative
  (with different text in each part, for better illustration)
- HTML-only
- Plaintext with inline HTML attachment
They are in mbox mail folder format, which is used by Mozilla. The first one
should also be loadable directly in Navigator by clicking on the attachment.

Some rough guidelines:
If you see "*bold*" (with stars), the memssage has gone through the HTML->TXT
converter, at the sender or recipient side.
If you see a bold "bold" without stars around it, the sender's HTML has been
rendered.
J-F, can you review, please?
Status: NEW → ASSIGNED
Keywords: patch, review
Whiteboard: Backend done, waiting for review
Yes I will...

This feature is kind of similar to another I receive a patch for but I did
refused because it wasn't working very well all the time. Let me find the bug
number...
Here is the UI.

It implements a View|Body submenu, located directly under "Headers". It has 2
radio items, "Allow Sender's HTML" and "As Plaintext"*. The former stands for
prefer_plaintext = false and html_as = 0, the latter for prefer_plaintext =
true and html_as = 1. Any other combination will cause none of the 2 options to
be selected, but you can still select them (overriding your custom prefs).

If you select one of the options, the prefs will be sat (premanently) and the
message will be reloaded with the new settings.

Thanks to the good ground-work of others, it works the same in the 3pane and
the standalone message window. Was trivial to implement. :-)

*Other options, e.g. for bug 28327, could be added here as well.
> This also happens to disable the inline-display of HTML attachments.

No, they are (usually) displayed following the html_as pref.

> Expection is when html_as=2 is active...

s/Expection is when html_as=2/However, when html_as=1/
Blocks: 108004
Blocks: 31907
Please simplify "Allow Sender's HTML" to "As HTML" - to make it consistent with
"As Plaintext".
Häkan, nope, see bug 108153. That option will be aded in the same menu.
Attachment #67191 - Attachment is obsolete: true
Attachment #67238 - Attachment is obsolete: true
For bug 108153, I had to touch libmime, incl. mimei.cpp, too, so I included it
in this patch. The changes are easy to tell apart, however: For that bug, I added
- mimethsa.(cpp|h) (which are very much like the mimethpl.* created for
  this bug, BTW)
- a few obvious lines to mimei.cpp
- a new pref + a new pref option
- a new menu item
For further discussion of these changes, see bug 108153.
That code depends on another part of bug 108153 (namely the sanitizer sink in
content/). Should that code take longer to review/checkin, I will just comment
out the menu item for that.

Please review all I attached, though, incl. the part for bug 108153. I will
attach the current version of the sink there.

I also made a few tweaks for this bug. Mainly, I fixed 2 I18N bugs. Japanese
bodies (HTML and plaintext) and txt attachments work fine now. Japanese HTML
attachments do not, however (charset problem, I guess). Hints appreciated. I
don't consider this a blocker, though.

I will still need to work on bug 126082.

In case, you already started to review, I will do a diff of the diffs on request.
I haven't started yet the review...
ducarroz, I'd like to get this in for 1.0, and I don't know, if I'll get it in
after the 20th. SO, it would be nice, if you could review today or tomorrow, if
possible.
I am looking at this patch... sorry for the delay!
My latest version. The rest of the patches should still be valid.

I found a problem: If the content type is (exactly) "text/html" (without
charset), nothing is shown. This is because the call sequence of the begin,
line, end, finalize functions are messed up in that case (see the comment in
the source), i.e. the real bug is outside of the new class, but the new classes
exposes it. I have no fix it that yet, and help to fix the true cause is
appreciated.
Attachment #70298 - Attachment is obsolete: true
starting the review again...
Comment on attachment 71589 [details] [diff] [review]
Same patch (version 5 MIME), but as diff -w


1) if you need to add
+		  content \
+		  htmlparser \
+		  layout \
to the unix makefile, you must add then also to the windows make file.


2) Need to add new files (only c++) to the mac project

3)
+
+  (There is another, somewhat more current, diagram in mimei.cpp, but
+  this one already shows the most important classes.)
+
Would be cleaner to complete/update the diagram in mimeei.h instead of adding
that comment.
I am pretty sure that somebody that read the diagram will miss your comment!

4) in mime_is_allowed_class, when allow_classes == avoid_bugs, you are blocking
AppleDouble attachment!
(MimeMultipartAppleDouble)! That would prevent the user to decode an
appledouble attachment! while you should block
it's content if it's an image or an html document instead.

5)
+	    (!attachments_inline
+	     && (
+		  clazz == (MimeObjectClass *)&mimeMessageClass
+		))
What are you trying to block: Inline attachment or inline message? this is
confusing!

6) do we realy need to have to mode of convertion:
mimeInlineTextHTMLAsPlaintextClass and
mimeInlineTextHTMLSanitizedClass?

7)
@@ -619,4 +646,5 @@
	   (clazz != (MimeObjectClass *)&mimeInlineTextPlainFlowedClass) &&
	   (clazz != (MimeObjectClass *)&mimeInlineTextHTMLClass) &&
+	   (clazz != (MimeObjectClass *)&mimeInlineTextHTMLSanitizedClass) &&
	   (clazz != (MimeObjectClass *)&mimeInlineTextRichtextClass) &&
	   (clazz != (MimeObjectClass *)&mimeInlineTextEnrichedClass) &&
should you test also for mimeInlineTextHTMLAsPlaintextClass?

8) instead of ifdefing mime_find_html_part_1, just remove it!

9)
+  // Create a parser
+  nsIParser* parser;
+  rv = nsComponentManager::CreateInstance(kParserCID, nsnull,
+					   kIParserIID,(void**)&parser);
use a nsCOMPtr<nsIParser> here else you are leaking the parser. 2 occurences.

10)Try to use do_CreateInstance instead of nsComponentManager::CreateInstance,
I see several occurences in mimemoz2.cpp

11) 
@@ -181,5 +181,5 @@
	 ) 
     {
-      if (cp)
+      if (cp) // this has been checked above already -- BenB
       {
cp is dynamically changed by every step of the if statment, it's possible that
the last part of the if set cp to null!
please remove this comment.

12) in mimeptfl.cpp, I don't think so it's a good idea to use an nsAutoString
as it's very likely that at least one of
the line of the message exceed 64 characters.  It's better to keep the actual
code which preset the buffer size to 100
characters.
Attachment #71589 - Flags: needs-work+
BTW, I haven't tested yet the patch and reviewed the new files
More comment:

13) Why does the new files aren't using an MPL licence? I am not sure you could
do that!

14) Mac Build error!
Error   : macro 'DEBUG' redefined
mimethsa.cpp line 25   #define DEBUG

15) Mac Build Error!
Error   : the file 'mozISanitizingHTMLSerializer.h' cannot be opened
mimemoz2.cpp line 99   #include "mozISanitizingHTMLSerializer.h"

I cannot find this file on lxr!! Where does it comes from?
> to the unix makefile, you must add then also to the windows make file.

Done.

> 2) Need to add new files (only c++) to the mac project

I can't  I have no Mac. (I don't have a Windows build env either, but the
Windows makesfiles qare similar enough to Unix to hack them.)

> Would be cleaner to complete/update the diagram in mimeei.h instead of adding
> that comment.

That would mean that we have 2 full lists - redundancy.

> allow_classes == avoid_bugs, you are blocking
> AppleDouble attachment!

I am blocking everything I don't know. I guess appledouble is common? I'll add it.

> > (!attachments_inline

> What are you trying to block: Inline attachment or inline message? this is
> confusing!

I am trying to block all attachments. Probably not good apprach. This mode is
not used in the UI. Should I remove it?

> do we realy need to have to mode of convertion:
> mimeInlineTextHTMLAsPlaintextClass and
> mimeInlineTextHTMLSanitizedClass?

?
These classes mainly implement this bug and bug 108153.
We need the former for this bug, if the sender sent only HTML (bit the user
wants only plaintext).

I think they are sufficiently different from the existing classes (for HTML and
plaintext) to warrent being their own ones.
It also makes the code cleaner, because I don't have to work with modes. Just
change the invoked class in mimei.cpp.

(line 619)
> should you test also for mimeInlineTextHTMLAsPlaintextClass?

This forces attachments to be inline. I am not sure, if users of the "As
Plaintext" mode want this for HTML attachments (which is what this would do).

> instead of ifdefing mime_find_html_part_1, just remove it!

OK. I didn't know what it does, and didn't care. (But it's not used, appearantly.)

> use a nsCOMPtr<nsIParser> here else you are leaking the parser. 2 occurences.

heh, quite a big thing to leak :). Thanks. Will fix.

> use do_CreateInstance instead of nsComponentManager::CreateInstance

Will fix.

> cp is dynamically changed by every step of the if statment, it's possible that
> the last part of the if set cp to null!

Isn't cp checked at the very end?

        ...
        (cp = PL_strncasestr(cp, "CHARSET=", length - (int)(cp - line))) 
        ) 
    {
      if (cp) // this has been checked above already -- BenB

> 12) in mimeptfl.cpp, I don't think so it's a good idea to use an nsAutoString
> as it's very likely that at least one of
> the line of the message exceed 64 characters.  It's better to keep the actual
> code which preset the buffer size to 100
> characters.

That's part of bug 126896 (unintentionally left in here), please add your
comments there.

> Why does the new files aren't using an MPL licence? I am not sure you could
> do that!

I am sick of the license problems (relicensing, the problems before it etc.
etc.). BSD is generally allowed in the Mozilla tree. I will check, if mixed
licenses (in one modules) are allowed.
(They are new files and practically all my code, so I'm not bound by the
copy&paste rules of MPL.)

If you have large objections, I may change to MPL/GPL/LGPL tripply license,
because it practically doesn't matter.

> Error   : macro 'DEBUG' redefined
> mimethsa.cpp line 25   #define DEBUG

Yeah, a few debug statements are left in in the latest patch. Will remove that
stuff before checkin.

> Error   : the file 'mozISanitizingHTMLSerializer.h' cannot be opened
> I cannot find this file on lxr!! Where does it comes from?

As I said, you need the first patch from bug 108153 for this to work.
(But i wasn't aware that it wouldn't even build.)
> do we realy need to have to mode of convertion:
> mimeInlineTextHTMLAsPlaintextClass and
> mimeInlineTextHTMLSanitizedClass?

Or do you mean that I could move these 2 into 1? I guess, that's possible, but
I'd have a lot of mode checks. I guess we would have more if blocks than common
code. But it's a choice.
*** Bug 128030 has been marked as a duplicate of this bug. ***
from mozilla.org/build:
Understanding the Macintosh Build System 
http://www.mozilla.org/build/mac-build-system.html
2) I will provide you the patch for the Mac project...

3) Either we remove on of the list or we keep both up-to-date. But having one
incomplete is not a good idea. Personnaly I would prefer the have both complete
as it use to be.

4)Apple double are always threated as attachment, therefore it safe to keep it.

5)>I am trying to block all attachments. Probably not good apprach. This mode is
  >not used in the UI. Should I remove it?
mimeMessageClass is for inline message/RFC822 and inline news message. You can
block then if you want but do not call the variable attachments_inline, call it
rather message_inline or something like that.

6) Does mimeInlineTextHTMLAsPlaintextClass will just display HTML part as Plain
text or if will do an HTML -> plain text and then back to HTML like I have seen
in one of you comment some where (I cannot found it anymore). If it's just as
plain text then it's fine to have the snitized mode too.

7)>This forces attachments to be inline. I am not sure, if users of the "As
  >Plaintext" mode want this for HTML attachments (which is what this would do).
If the user decide to see HTML as plain text, we should still let him to see
them  inline as we would have done normally. Avoiding displaying attachment
inline is part of another bug.

11) You are totally right. Then instead of adding a comment, just remove the
extra test.

13) I'll check if you can do that...

15) As it seems we already have a lot of overcrossing between this patch and the
one for bug 108153, could you at least add the needed .h to this patch in the
future or if you prefers just ifdef the implementation of sanitizer in mimemoz2.cpp
.
Be aware that the max filename size on Mac is 31 characters. You are current ly
at the limit with mozSanitizingHTMLSerializer.cpp. As MacCVS will add ~n at the
end when you pull a new version, would be nice if you could reduce the name a
little bit.


I still have to review the UI part...






Depends on: 108153
J-F, I missed your comment, sorry.

> I will provide you the patch for the Mac project...

Thanks.

> 3) Either we remove on of the list or we keep both up-to-date. But having one
> incomplete is not a good idea. Personnaly I would prefer the have both complete
> as it use to be.

Note: It isn't up-to-date in CVS either. Personally, I dislike redundancy, but I
will make both lists complete.

> mimeMessageClass is for inline message/RFC822 and inline news message. You can
> block then if you want but do not call the variable attachments_inline, call it
> rather message_inline or something like that.

Thanks. I can save that for another bug. I will remove my hack then.

> Does mimeInlineTextHTMLAsPlaintextClass will just display HTML part as Plain
> text or if will do an HTML -> plain text and then back to HTML like I have seen
> in one of you comment some where (I cannot found it anymore)

The comment is in the header file of the class and in mimei.cpp, IIRC. Yes, I do
HTML->Plaintext->HTML. The latter is, because
- we display using Gecko and that is primarily a HTML renderer. It can display
plaintext currently,
  but not too well.
- links wouldn't be clickable otherwise
- We wouldn't be able to take advantage of format=flowed (currently, we don't do
either, but
  that should be fixable).

> Then instead of adding a comment, just remove the extra test.

Will do.


Also, did you already review the 2 new MIME classes (the HTMLAsPlaintext and
HTMLSanitizer)? Once you've done that (and I assume you have feedback to that
code as well), I will make the changes together and attach a new patch,
incorporating your review comments, so that we can get this puppy in.
Attached patch Suggested fix, version 5, UI. (obsolete) — Splinter Review
Attachment #70300 - Attachment is obsolete: true
Attachment #70299 - Attachment is obsolete: true
J-F, here you have one (diff -uwN) patch with *everything* - all the patches
above combines plus the content/ one from bug 108153. This should be all you
need to build and run it.

This should be the same version as above, i.e. without incorporating your
comments (at at least most of them). So, you only need to review the files you
didn't look at already. You can also ignore everything in content/ - I will ask
akk to review that in bug 108153.

I will reduce the filename to mozSanitizingHTMLSerial.cpp. (I was not aware of
the Mac filename size limit.)
Oh, and as for the license. Please don't let that hold you up. I'd much rather
change it to the tripple license than to let this bug miss 1.0.
Attached file Crash report (obsolete) —
The comments I made to the bugreport #70 didn't seem to make it to bugzilla,
here's what I wrote:

I applied the patch to the 0.9.9 source release and got an error that the first
hunk had failed.
I continued to build (Mandrake Linux 8.1, gcc 3.0.2) and all went well.
I entered the following two entries into prefs.js
 user_pref("mailnews.display.html_as", 0);
 user_pref("mailnews.display.prefer_plaintext", true);
all was fine.
Changing mailnews.display.html_as to 2 displayed the html page source for html
mail as expected.
Changing mailnews.display.html_as to 1 worked fine on the messages I tried it
with except for one message, where Mozilla crashed and exited back to the OS.
I ran mozilla with the following command :
  mozilla >mozlog.txt 2>mozerr.txt
The attached bug report includes the output from these two files, the
mailWindowOverlay.js.rej report from the patch failure and the source of the
mail message on which it crashes. This makes the file about about 32K so I
thought it best to include it as an attachment rather than paste it all into
bugzilla.
J-F,

I made the promised changes in response to your (half) review. Attached.

> use a nsCOMPtr<nsIParser> here else you are leaking the parser. 2 occurences.


There were an NS_RELEASE at the end. I leaked in failure siituations, though.
Changed to nsCOMPtr.

> mimeMessageClass is for inline message/RFC822 and inline news message.
> You can block then if you want but do not call the variable
> attachments_inline, call it rather message_inline or something like that.

I just noticed that you probably misunderstood the purpose of that pref. It
didn't just block inline messages, but inline message *in addition to* the
other types blockes by the other pref modes.

It was a hack nevertheless and I removed it.


Adam Ruddock,

I couldn't reproduce the crash. I imported and loaded the msg, both from IMAP
and local folders, and saw no crash in any of the modes.

Make sure that the patch applies cleanly (remove the old patch leftover first)
and all changes take effect.

I added a few more printfs at the place which seems to crash, so please rerun
the same test with the latest patch.

BTW: You really need this bugfix - I saw unique IDs for you in the img URLs. I
tried not to trigger them, but accidently did nevertheless - sorry.
Attachment #71580 - Attachment is obsolete: true
Attachment #71589 - Attachment is obsolete: true
Attachment #73878 - Attachment is obsolete: true
Attachment #73879 - Attachment is obsolete: true
Attachment #73886 - Attachment is obsolete: true
No longer blocks: 108004
Whiteboard: Backend done, waiting for review → Done, waiting for review
Target Milestone: mozilla0.9.9 → mozilla1.0
Do I need to have applied any other patches first?
I'm working from the mozilla-source-0.9.9.tar.gz as downloaded from the
release/mozilla0.9.9 ftp directory.
I've run patch -p0 -u --verbose --dry-run -i patches/id74397.patch and have two
problems at this point.
1. Hunk #3 FAILED at 79.
This seems to be because #include "nsIComponentRegistrar.h" is not in mimei.cpp,
the final line in the #include block is #include "nsMsgUtils.h"
2. Hunk #10 FAILED at 581.
The patch file says :
@@ -447,7 +581,9 @@
         nsCOMPtr<nsIComponentRegistrar> reg;
         NS_GetComponentRegistrar(getter_AddRefs(reg));
         PRBool isReg = PR_FALSE;
-        nsCAutoString
decoderId(NS_LITERAL_CSTRING("@mozilla.org/image/decoder;2?type=") +
nsDependentCString(content_type));
+        nsCAutoString decoderId(
+                    NS_LITERAL_CSTRING("@mozilla.org/image/decoder;2?type=")
+                    + nsDependentCString(content_type));
From line 580 in mimei.cpp I have
  if (opts && opts->part_to_load)
	/* Always ignore Content-Disposition when we're loading some specific
	   sub-part (which may be within some container that we wouldn't otherwise
A search for the string "nsCAutoString" in that file turns up a blank!

I think I may have something fairly fundamentally wrong (or at least different).

BTW I know I receive e-mails fairly frequently with these "tracking images", one
of the reasons I'm particularly keen on this fix (plus a desire to wipe HTML
mail from the face of the planet altogether!). Luckily the message that I've
experienced this crash with wasn't one that I particularly need to keep secret -
thus the reason that I included it in its entirety in my crash report.
I am looking at the lastest patch...
Attached patch Mac project changes (obsolete) — Splinter Review
Modification needed in order to build the patch (version 6) on Mac
Comment on attachment 74397 [details] [diff] [review]
Fix, version 6, everything, diff -uwN. Contains 108153.

comment:

1) Licence in new file. Per Mozilla License-policy
(http://www.mozilla.org/MPL/license-policy.html)
Your are requested to use the MPL/GPL/LGPL triple licence in order to add new
files in the tree.
Unless you get an official exeption from Mozilla staff, I will not let you
check in that code.

2) in mimethpl.cpp
Please remove those printf you left.

3) in mimeei.cpp
+  PRBool avoid_html = PR_FALSE;
+  PRBool avoid_images = PR_FALSE;
+  PRBool avoid_strange_content = PR_FALSE;
+  PRBool avoid_bugs = PR_FALSE;
+  if (allow_classes >= 1)
+    avoid_html = PR_TRUE;
+  if (allow_classes >= 2)
+    avoid_images = PR_TRUE;
+  if (allow_classes >= 3)
+    avoid_strange_content = PR_TRUE;
+  if (allow_classes == 100)
+    avoid_bugs = PR_TRUE;

could be written

+  PRBool avoid_html = (allow_classes >= 1);
+  PRBool avoid_images = (allow_classes >= 2);
+  PRBool avoid_strange_content = (allow_classes >= 3);
+  PRBool avoid_bugs = (allow_classes == 100);

4) in mimeei.cpp @@ -617,4 +764,5 @@
	   (clazz != (MimeObjectClass *)&mimeInlineTextPlainFlowedClass) &&
	   (clazz != (MimeObjectClass *)&mimeInlineTextHTMLClass) &&
+	   (clazz != (MimeObjectClass *)&mimeInlineTextHTMLSanitizedClass) &&
	   (clazz != (MimeObjectClass *)&mimeInlineTextRichtextClass) &&
	   (clazz != (MimeObjectClass *)&mimeInlineTextEnrichedClass) &&

and what about mimeInlineTextHTMLAsPlaintextClass! we should display it inline
as well.

5) in mimemoz2.cpp
please remove debugging code and printf in function HTMLSanitize and
HTML2Plaintext.
If you want to keep some of the printf, use NS_ASSERT or NS_WARNING instead.
Also there is a NS_ERROR_NS_ERROR_FAILURE which prevent the pach to build!!!

6) in mailWindowOverlay.js
+var allow_processors_no_html = 1; /* the user preference,
+     if HTML is not allowed. I assume, that the user could have sat this to a

please correct typo: ...could have SET this...


7) I am not the right guy to review mozilla/content/... part of the fix.


I still haven't reviewed the UI part and I still need to test the patch, I'll
do it tomorrow on Mac, my build is ready...
Need to locate mozilla.org contributors for writing testcases and testing this
feature.  No resource within Netscape to test this feature.
QA Contact: lchiang → asa
the UI changes are fine execpt for the wording: In mail, we use Rich text when
speaking about HTML and Plain Text in in two word. I would rather see something
like:

+<!ENTITY bodyMenu.label "Body">
fine

+<!ENTITY bodyAllowHTML.label "Allow Sender's HTML">
Allow Rich Text (HTML)

+<!ENTITY bodySanitized.label "Sanitize HTML">
Sanitize Rich Text (HTML)

+<!ENTITY bodyAsPlaintext.label "As Plaintext">
As Plain Text


But Robin would be the right person to decide the best wording...We want to add
the following menu to control how we should display the body of an email message
in the mail3Pane window:

View/Body/Allow Rich (HTML) Text    --> we display the me message without
modification
View/Body/Sanitize Rich (HTML) Text --> we modify the HTML to prevent security issue
View/Body/As Plain Text --> we always convert the message to plain text


This feature works great and so far I haven't had any problem with it.

But when the message is sanitized or converted to plain text, we should somehow
give the user access to embedded images that got striped out. Maybe inserting a
link in the text or show the image as attachment.

Also find the sanitized mode cut too much! like colors attributes!

Good job Ben
+<!ENTITY bodyMenu.label "Body">
Should this be "Bodies"? Does this apply to all messages? If yes, then Bodies is
preferred. If this menu item only applies to the selected message, than Body is OK.

+<!ENTITY bodyAllowHTML.label "Allow Sender's HTML">
as HTML

+<!ENTITY bodySanitized.label "Sanitize HTML">
as Sanitized HTML

+<!ENTITY bodyAsPlaintext.label "As Plaintext">
as Plain Text
I was proposing the term Rich Text instead of HTML as is it what we are
currently doing in message compose (menu options/format/...)
You can't use "Rich Text" because there's a well-known format named "Rich Text
Format" (.rtf) already.  In many applications, choosing "Save As Rich Text..."
will save as this.   Robinf's suggestions sound good enough for me.
Comment on attachment 74397 [details] [diff] [review]
Fix, version 6, everything, diff -uwN. Contains 108153.

I already suggested to Ben that he look into the DTD classes and the way
composer
uses a dtd class to enforce which tags are allowed or illegal.	He's going to
look
into it for future revisions, but it shouldn't block getting a first
implementation
of this much-wanted functionality into the tree.

As for being qualified to review new serializers, I'm not sure anyone really is 
(I discussed some of the serializer API issues with jst while I was looking
through
this patch, and apparently no one understands the intent behind the APIs any
more).
I guess I'm as good a person as any (I'll also cc Tanu, who now owns the other
serializers).

>Index: content/base/public/mozISanitizingHTMLSerial.h

Can I ask why the filename is "Serial" rather than "Serializer", i.e. an
adjective
rather than a noun, and not the same as the name of the class?	If the filename
can't
be that long for some reason, I would have preferred dropping the "HTML" or 
abbreviating "Sanitizing" rather than adjectivizing "Serializer", 
though it may just be a personal preference.

>+  NS_IMETHOD Initialize(nsAWritableString* aOutString,
>+                        PRUint32 aFlags,
>+                        nsAReadableString& allowedTags) = 0;
>+  // This function violates string ownership rules, see impl.

I still don't understand why the function violates string ownership rules 
(likewise for the other place you added that comment).

>Index: content/base/src/mozSanitizingHTMLSerial.cpp
>===================================================================
>+mozSanitizingHTMLSerializer::Init(PRUint32 aFlags, PRUint32 dummy,
>+                                  nsIAtom* aCharSet, PRBool aIsCopying)
>+{
>+  nsresult rv;
>+
>+  nsCOMPtr<nsIPref> prefs(do_GetService(NS_PREF_CONTRACTID, &rv));
>+  if (NS_SUCCEEDED(rv) && prefs) {
>+    //PRBool tempBool;
>+    // Get some prefs that controls how we do formatted output
>+    //prefs->GetBoolPref(PREF_STRUCTS, &tempBool);
>+    //prefs->GetIntPref(PREF_HEADER_STRATEGY, &mHeaderStrategy);
>+  }

Maybe comment out the call to get the pref service if we're not actually 
getting any prefs?

>+mozSanitizingHTMLSerializer::Flush(nsAWritableString& aStr)
>+{
>+  mOutputString = &aStr;
>+  mOutputString = nsnull;
>+  return NS_OK;
>+}

I probably would have guessed that Flush would append the argument rather than
replacing the whole output string with it, but that's just a guess; I don't
see it documented anywhere, and the existing serializers treat it as a no-op.
I asked jst, but he has no idea what it's supposed to be either.
So you're free to use your own judgement there.

>+mozSanitizingHTMLSerializer::GetIdForContent(nsIContent* aContent,
>+                                             PRInt32* aID)

I suspect you could improve performance by using HTMLAtomTagToId instead of
HTMLStringTagToId, later when you start working on performance improvements.
I've been told getting the string for a content node is slow.

>+mozSanitizingHTMLSerializer::AppendText(nsIDOMText* aText, 
>+                                  PRInt32 aStartOffset,
>+                                  PRInt32 aEndOffset, 
>+                                  nsAWritableString& aStr)

I'm confused about this routine: you get the contents of the content node
into textstr, but then you don't do anything with textstr?
I see that some of this is copied from nsPlainTextSerializer.cpp, but that
file then makes use of textstr, whereas yours doesn't, so something must
be missing here.

>+/* XXX I don't really know, what these functions do, but they seem to be
>+   needed ;-). Mostly copied from nsPlaintextSerializer. */

I don't know either, sorry. :-/
I wonder if the sanitizing class could inherit from nsHTMLSerializer,
so that at least these methods that none of us understand only have to be
written once?  Also, you might be able to inherit some of the whitespace
and prettyprinting code, if you wanted that.  (Something to consider for 
future revisions -- don't worry about it right now.)

>+            Write(key); // I get an infinive loop with | + key + | !!!
				     "infinite"

>+mozSanitizingHTMLSerializer::SanitizeAttrValue(nsHTMLTag aTag,
>+                                               nsAReadableString& attr_name,
>+                                               nsString& aValue /*inout*/)
>+{
>+  nsAutoString value(Substring(aValue, 0, 1000));

Why 1000?  What is this doing?	Might be good to explain this odd code.

>+#if 0 /*
>+   somehow, this gets compiled!?!
>+  { "HTML sanitizing content serializer",
>+    MOZ_SANITIZINGHTMLSERIALIZER_CID,
>+    NS_CONTENTSERIALIZER_CONTRACTID_PREFIX "text/html",
>+    -- can I achieve it that I say e.g. "text/html;sanitized" and a
>+       consumer does the same and the stream is automatically transformed.
>+       e.g. through Necko's stream converter?
>+    CreateSanitizingHTMLSerializer },
>+      */
>+#endif

I don't understand this.  Why is it ifdef'ed commented out?
Do we need it or not?

>+  { "HTML sanitizing content serializer",
>+    // What's the difference between a "serializer" and a "sink"?

Ah, something I can answer. :-)
A serializer is something that works from a dom tree and produces serial output
(in html, plaintext or whatever format).
A sink is something that can act as the output of the parser -- it takes parser
calls and does something with them (maybe produces output, or maybe sends
commands
to layout to display on the screen).  Most of the classes that implement the 
serializer interface also implement the content sink interface.

Most of these comments are not blockers, just suggestions for future work.
I'd like to know about the last two comments before the last one, i.e.
the 1000 and the #if 0, but pending that you have my r=akkana.
(I may be out this afternoon, but if J-F or whoever sr's this is happy 
with the answers for those two issues, then I'm happy too.)
Comment on attachment 74397 [details] [diff] [review]
Fix, version 6, everything, diff -uwN. Contains 108153.

I already suggested to Ben that he look into the DTD classes and the way
composer
uses a dtd class to enforce which tags are allowed or illegal.	He's going to
look
into it for future revisions, but it shouldn't block getting a first
implementation
of this much-wanted functionality into the tree.

As for being qualified to review new serializers, I'm not sure anyone really is 
(I discussed some of the serializer API issues with jst while I was looking
through
this patch, and apparently no one understands the intent behind the APIs any
more).
I guess I'm as good a person as any (I'll also cc Tanu, who now owns the other
serializers).

>Index: content/base/public/mozISanitizingHTMLSerial.h

Can I ask why the filename is "Serial" rather than "Serializer", i.e. an
adjective
rather than a noun, and not the same as the name of the class?	If the filename
can't
be that long for some reason, I would have preferred dropping the "HTML" or 
abbreviating "Sanitizing" rather than adjectivizing "Serializer", 
though it may just be a personal preference.

>+  NS_IMETHOD Initialize(nsAWritableString* aOutString,
>+                        PRUint32 aFlags,
>+                        nsAReadableString& allowedTags) = 0;
>+  // This function violates string ownership rules, see impl.

I still don't understand why the function violates string ownership rules 
(likewise for the other place you added that comment).

>Index: content/base/src/mozSanitizingHTMLSerial.cpp
>===================================================================
>+mozSanitizingHTMLSerializer::Init(PRUint32 aFlags, PRUint32 dummy,
>+                                  nsIAtom* aCharSet, PRBool aIsCopying)
>+{
>+  nsresult rv;
>+
>+  nsCOMPtr<nsIPref> prefs(do_GetService(NS_PREF_CONTRACTID, &rv));
>+  if (NS_SUCCEEDED(rv) && prefs) {
>+    //PRBool tempBool;
>+    // Get some prefs that controls how we do formatted output
>+    //prefs->GetBoolPref(PREF_STRUCTS, &tempBool);
>+    //prefs->GetIntPref(PREF_HEADER_STRATEGY, &mHeaderStrategy);
>+  }

Maybe comment out the call to get the pref service if we're not actually 
getting any prefs?

>+mozSanitizingHTMLSerializer::Flush(nsAWritableString& aStr)
>+{
>+  mOutputString = &aStr;
>+  mOutputString = nsnull;
>+  return NS_OK;
>+}

I probably would have guessed that Flush would append the argument rather than
replacing the whole output string with it, but that's just a guess; I don't
see it documented anywhere, and the existing serializers treat it as a no-op.
I asked jst, but he has no idea what it's supposed to be either.
So you're free to use your own judgement there.

>+mozSanitizingHTMLSerializer::GetIdForContent(nsIContent* aContent,
>+                                             PRInt32* aID)

I suspect you could improve performance by using HTMLAtomTagToId instead of
HTMLStringTagToId, later when you start working on performance improvements.
I've been told getting the string for a content node is slow.

>+mozSanitizingHTMLSerializer::AppendText(nsIDOMText* aText, 
>+                                  PRInt32 aStartOffset,
>+                                  PRInt32 aEndOffset, 
>+                                  nsAWritableString& aStr)

I'm confused about this routine: you get the contents of the content node
into textstr, but then you don't do anything with textstr?
I see that some of this is copied from nsPlainTextSerializer.cpp, but that
file then makes use of textstr, whereas yours doesn't, so something must
be missing here.

>+/* XXX I don't really know, what these functions do, but they seem to be
>+   needed ;-). Mostly copied from nsPlaintextSerializer. */

I don't know either, sorry. :-/
I wonder if the sanitizing class could inherit from nsHTMLSerializer,
so that at least these methods that none of us understand only have to be
written once?  Also, you might be able to inherit some of the whitespace
and prettyprinting code, if you wanted that.  (Something to consider for 
future revisions -- don't worry about it right now.)

>+            Write(key); // I get an infinive loop with | + key + | !!!
				     "infinite"

>+mozSanitizingHTMLSerializer::SanitizeAttrValue(nsHTMLTag aTag,
>+                                               nsAReadableString& attr_name,
>+                                               nsString& aValue /*inout*/)
>+{
>+  nsAutoString value(Substring(aValue, 0, 1000));

Why 1000?  What is this doing?	Might be good to explain this odd code.

>+#if 0 /*
>+   somehow, this gets compiled!?!
>+  { "HTML sanitizing content serializer",
>+    MOZ_SANITIZINGHTMLSERIALIZER_CID,
>+    NS_CONTENTSERIALIZER_CONTRACTID_PREFIX "text/html",
>+    -- can I achieve it that I say e.g. "text/html;sanitized" and a
>+       consumer does the same and the stream is automatically transformed.
>+       e.g. through Necko's stream converter?
>+    CreateSanitizingHTMLSerializer },
>+      */
>+#endif

I don't understand this.  Why is it ifdef'ed commented out?
Do we need it or not?

>+  { "HTML sanitizing content serializer",
>+    // What's the difference between a "serializer" and a "sink"?

Ah, something I can answer. :-)
A serializer is something that works from a dom tree and produces serial output
(in html, plaintext or whatever format).
A sink is something that can act as the output of the parser -- it takes parser
calls and does something with them (maybe produces output, or maybe sends
commands
to layout to display on the screen).  Most of the classes that implement the 
serializer interface also implement the content sink interface.

Most of these comments are not blockers, just suggestions for future work.
I'd like to know about the last two comments before the last one, i.e.
the 1000 and the #if 0, but pending that you have my r=akkana for the
content part of the patch.
(I may be out this afternoon, but if J-F or whoever sr's this is happy 
with the answers for those two issues, then I'm happy too.)
Frist, thanks J-F and akk for the reviews.

re comment 76:
All OK with me. Esp. point 3 was a good catch

re wording:
- IMO, "Sender's" is important, because it stresses the processing of untrusted
content (which is exactly the reason for this whole bug) and clearly separates
it from the Sanitized option, which also allows (some) HTML.
- I agree with Hakan that "Rich Text" has a special meaning, and I think that
our send format UI is buggy in that respect. But I can also see the desire for
being understandable and consistent. I have no opinion, as long as "HTML" does
appear.
- "Bodies": It applies only to the selected message, but will persist for the
next messages.
I think the logic in the existing UI is that we use singular when refering to
the message pane (e.g. View|Message) (apart from "Headers", because there are
several headers in one message) and plural when referring to the thread pane
(e.g. View|Messages|Unread). This makes sense to me.
- "Plaintext" vs. "Plain Text": I do not care. You are right - we use "Plain
Text" elsewhere, so let's go for that.

> But when the message is sanitized or converted to plain text, we should
somehowgive the user
> access to embedded images that got striped out. Maybe inserting a link in the
text or show the
> image as attachment.

We should show the alt attribute, which /should/ offer an equiv. text
alternative. Unfortunately, that's prevented by bug 125983.

Showing an inline link would not be nice code-wise, because it would require
special-casing.

Showing /embedded/ images (multipart/related) as attachment would be nice, and
was intended. Maybe it's as easy as replacing m/r with m/m. Let's postpone that
until after the checkin, though.

Note that you can always switch back and forth between Allow Sender's HTML and
Sanitized HTML. (Just don't forget to switch back to Sanitized mode!)

> Also find the sanitized mode cut too much! like colors attributes!

Many people dislike sender-forced colors, often they are even a major reason to
not want HTML mail, I believe that enabling colors would make the feature much
less useful for the majority of users. The code is generic, though, and it's all
just a pref. Docs in the code / default prefs ;-).

Responses to akk in the next comment.
> apparently no one understands the intent behind the APIs any more

hehehe
/me holds back comments about documentation

> If the filename can't be that long for some reason

Exactly - Macs (see above).

> I would have preferred dropping the "HTML" or 
> abbreviating "Sanitizing" rather than adjectivizing "Serializer", 
> though it may just be a personal preference.

Yes, that's a good option, too. Anyone else agrees with akk?

> I still don't understand why the function violates string ownership rules 

me neither, because I don't know these rules at all. I just copied it from the
impl. in nsPlaintextSerializer. I thought that such a comment, if true, should
be in the interface, not just the impl..
I think that it won't be an actual problem (for any reasonable use of the class).

> Maybe comment out the call to get the pref service if we're not actually 
> getting any prefs?

Ops, yes. Excessing copying ... ;-P

> Flush

Will look into that.

> I'm confused about this routine:

me too!

> you get the contents of the content node
> into textstr, but then you don't do anything with textstr?

I tried to reduce the function to the part that - from my understanding -
actually did have an effect (just the mOutputString assignment), but it broke.

I might try to mess with it again.

> Why 1000?  What is this doing?	Might be good to explain this odd code.

Draconian attitude... (The more restrictive we are, the less likely are security
problems.)

I thought that attributes with values longer than 1000 chars seem bogus,
considering that we don't support any JS. The longest attributes I can think of
are URLs, and URLs with 1000 chars are likely to be bogus, too.

Will add a comment.

> I don't understand this.  Why is it ifdef'ed commented out?

Phew, don't ask me why, but my compiler somehow seemed to ignore the "#if 0".

I'll just remove it altogether.

> +    // What's the difference between a "serializer" and a "sink"?

Thanks for the explanation. I happened to pick the right one :). Will remove the
comment.
Re stripping colors: stripping colors is one of the biggest reasons people want
this (for people who send unreadable html mail that uses red on yellow or pink
on purple) so stripping colors is definitely desirable here.  In fact, see the
very first comment in this bug (which was inspired by an actual message).

Sorry about that double comment submission -- accidental submission the first time.

Re the 1000: thanks for the explanation; I had missed that it was attribute
length.  That sounds fine (but please do add a comment).
Revisiting my comment #80:

+<!ENTITY bodySanitized.label "Sanitize HTML">
as Sanitized HTML

I don't think users are going to know what we mean if we use the word
"Sanitized". We would also need to define this word (whichever word we decide to
use) elsewhere in the UI (such as in a pref) in order for users to understand
what this view is. cc:ing jglick.
This patch contains the previously announced changes and ducarroz's mac build
changes (thanks).

Per request, I changed the filename to moz[I]SanitizingSerializer.*.
I also shorted AppendText and Flush.

re Wording: I did not change the UI wording apart from "Plain Text", because
there's no common agreement on anything yet.

re "Rich Text": Ironically, I actually allow the real "RichText" (RTF) to
display in all modes, unless the pref mailnews.display.allow_processors is >=
3.

I changed to the tripple license, but made 2 minor changes to the license.
I'm checking it with mozilla.org (cc .license) - ignore it.

I made several tweaks to comments and a few tweaks to the code where I
found remaining problems (e.g. libmime error code 1 -> -1).

I am happy that I could remove even the biggest remaining problem, namely that
msgs with mimetype "text/html" without charset were not displayed at all.
Carefully re-reading over the code is sometimes quite helpful. :-) One ceveat,
though, see below.

This patch is a "RC", i.e. I'd check this in as-is, if possible.
So, if you find anything goofy like remaining unconditional printfs,
please tell me.

ducarroz, akk, please r=.

Remaining problems
(All of those are only problems for users of the new features):
- Bug 126887: Messages with a mimetype of exactly "text/html" (without charset)

  produce a messed-up call sequence within libmime. Result:
  In HTML->TXT->HTML case, no formatting whatsoever (not even linebreaks).
  Sanitized mode still looks fine.
- Bug 125983: In the sanitizer mode, inline images leave
  no trace whatsoever (no link) or just a little dot (linked).
- Bug 122876: In the HTML->TXT->HTML mode together with overzealous bolding
  by the sender, the output is even more overzealously bolded.

superreviewers/drivers:
The code is designed to have minimal impact on users who did not activate the
new modes. It does provide a workaround for bug 28327 (and bug 18427), though,
and is thus much-needed.
Attachment #74397 - Attachment is obsolete: true
Attachment #74913 - Attachment is obsolete: true
Same patch (v7 everything), this time without diff -w, i.e. with all whitespace
changes. I made a few in mimei.cpp.
Attached image Proposed Pref Panel
I think this is a great feature, but you need to keep in mind who this feature
is targeted for, a more advanced informed user. 

This bug suggests adding the following menu items to the View Menu. 

Body as --> Sender's HTML 
	    Sanitized HTML 
	    Convert HTML to plain text 

I don't think this belongs in the View menu because: 
1. While this may make sense to a select few, the majority of users are going
to have no clue what this means. What the heck is Sanitized html? In addition,
the majority of users, don't even need to know this.  
2. There isn't enough room in a menu item to properly explain what these items
mean. Sticking this in a menu when most users won't understand the options
isn't good.
3. Is this really something users need to be able to change on a per message
basis? Or will they more likely pick a setting and stick with it? 
4. The browser has a related feature, enabling/disabling js, as and Advanced
Pref.

I would suggest this belongs in Preferences. 
1. You can better explain the purpose of it there since you have a little more
space. 
2. We already have Plain text viewing options in Prefs, so html viewing options
makes sense. 
3. Its available for more experienced users but a little hidden for users who
don't care. 

A proposal pref panel is attached.
> 3. Is this really something users need to be able to change on a per message
> basis?

Yes. As ducarroz, there are cases where we don't show all content yet (namely
embedded images). Also, the mode is designed to be very restrictive and you may
want to release the restrictions, if you trust the sender to some extend (about
as much as you do webpages you visit), for example to see all formatting. In the
bugs filed around this feature (dups etc.), many reporters requested a way to
toggle this on a per-message basis.

All your other points are valid, but I think this one requires a menu item.

> 2. We already have Plain text viewing options in Prefs, so html viewing options
> makes sense.

I agree, that's why this blocks bug 31907.
> I think this is a great feature, but you need to keep in mind who this 
> feature is targeted for, a more advanced informed user. 

I don't think this is entirely true. It will be targeted for a more advanced
user only if it stays in the pref panel. When less advanced user gets
"unreadable" mail he/she will not know that something can be done about it. The
pref panel looks great for setting the initial view but I think that there
should be easy way to change it per message for example by using tabs as in
composer to view different interpretation of the same message. The pref panel
would then allow to customize the order of individual tabs and what tabs to
display (ie. do not ever show tab with the original html - I am a paranoid user).
I don't think it matters that most users might not understand the meaning of
sanitized. That option will sit right between two menu options whose meaning
will be very obvious to most users. Its placement between should imply something
between the two, and even when it doesn't, it's easy enough for them to try it,
and put back if not desired. The help menu should have an explanation for those
who require one.

OTOH, digging into prefs for a what if for a single message is clumsy.
I'm in favor of putting these controls in the menu as well. This is something
that we would like to be easily accessible to less experienced users, because
they can benefit from it as well, as a privacy-protection feature. How about
"Simple HTML" or "Limited HTML" instead of "Sanitized HTML?"
I don't thing either this feature will be practical is it in a pref panel. I
thing is fine for me to have in in the view menu like we already have a submenu
to decide how to display headers.

And what about "safe HTML" or "Inert HTML"
Simplified? [this implies some amount of mangling, and I think we're doing that]

I don't think simple or limited clearly describe what we're doing -- or I could be confused about what we're doing.
Forgive my ignorance, but what will "sanitized" mode filter out? Does it filter
out iframes?
I'd vote for menu items too if there was this kind of voting system in Bugzilla
:) Simplified HTML sounds reasonable. Or how about Purified HTML?
Out of the choices, "limited HTML" sounds the best as a menu item (if it's
determined that a menu item is necessary).   

Also, I hope that the default will be "sender's html" so that there are no
changes there, somewhat minimizing the amount of testing if we limit this to be
an opted-in choice, affecting only those advanced users who understand or care
about this feature.

Thanks
I like "Sanitized". A sanitizer is something that cleans things up, even at a
level where the "risks" removed are not visible. Isn't that exactly what this
code does? What is not understandable?

I would find "Simple" better. It would be my first alternative to "Sanitized".
"Purified" is good, too, IMO.

I would accept "Limited", but "Limited" is more ambiguous IMO. Even HTML with JS
turned off is already "limited".

"Safe" is not a good wording IMO. Yes, being safe is a major goal, but not the
only one (we also remove colors, which are safe, but an eye annoyance), and I
don't want to promise something that the code cannot deliver. 'There's no such
thing as a safe application.'
"Inert" (sound of a dictionary): Hm, would be OK.

> Forgive my ignorance, but what will "sanitized" mode filter out? Does it filter
> out iframes?

Yes, that's one of the first things to strip. E.g. bug 109249 is prevented.
Plugins, too. Generally, only harmless structural HTML tags are allowed and a
few harmless presentational ones like <b>.

However, that's just the default config. The list of allowed tags and attributes
is a user-configurable whitelist. The only restriction is that we strip (single
and double) quotes from attributes, so JS won't work (why would you want that?).

> Also, I hope that the default will be "sender's html" so that there are no
> changes there, somewhat minimizing the amount of testing if we limit this to be
> an opted-in choice, affecting only those advanced users who understand or care
> about this feature.

Exactly.
I agree with jglick that using "Sanitized HTML" is too ambiguous, and that users
won't know what we mean by it. It's best to avoid terms like "Secure" or "Safe"
for the same reasons. I'd suggest using "Simple" HTML, "Basic" HTML, or
"Limited" HTML. 
I was asked to add my suggestion to this bug:

View > HTML and Media
       HTML only
       Plaintext only
Comment on attachment 75529 [details] [diff] [review]
Same patch, with whitespace changes

The patch looks good but you still have too much debugging printf in
mimethpl.cpp
. Also before I can give you my ok, we need to wait for a final decision about
the ui from jglick
> The patch looks good

good, tnx.

> but you still have too much debugging printf in mimethpl.cpp

I inserted specifically to allow debugging of the crash Andy Ruddock saw. If,
with this patch, he can reproduce the bug again and provide the same info as
before (the console output), I would have a pretty good idea where the problem
is. Note that it's only active for DEBUG build. I could of course remove it and
let him use the patch attached here, since he builds himself. But maybe what he
saw was not local hockage and pther people see the bug, too.

So, are you sure I should remove it? (I will do so, if you say so.)
Removed the #if DEBUG blocks.
Attachment #75528 - Attachment is obsolete: true
Attachment #75529 - Attachment is obsolete: true
Jennifer, what's the final decision for the UI?
we should find a different QA contact for this since I'm not going to have any
time to do any testing of this feature. Suggestions?
asa, I think that there's enough interest in the community that we can get the
common cases covered "automatically". Formal QA would be needed for systematic
testing and edge cases only, but QA usually doesn't do that anyways :-/.

Still waiting for UI decision, sr= (and later a=). Time is running out :-(.
you haven't got an r= from JF yet.
a few comments: 

allow_classes should be an enum with descriptive names, instead of
0,1,2,3,100...makes the code more readable

+        clazz == (MimeObjectClass *)&mimeInlineTextHTMLSanitizedClass ||
+        clazz == (MimeObjectClass *)&mimeInlineTextHTMLAsPlaintextClass ||
+           /* The latter 2 classes bear some risk, because they use the Gecko
+              HTML parser, but the user has the option to make an explicit
+              choice in this case, via html_as. */
this comment is referring to mimeInlineTextHTMLSanitizedClass and
mimeInlineTextHTMLAsPlaintextClass? "latter 2" is not as clear as it could be.

mime_find_class: why not cache these two prefs instead of running through this
code every time we try to find a class? 

all the comments about the charset handling are very scary. You should get a
review from the i18n people as well.

If you're adding code to the content module, you need to get a review from the
content module owner.

If this is in the bug already, I apologize for asking about it, but there's a
lot of comments!  What is the exact list of things that "sanitized" mode takes
out?  It would be helpful for QA.

I think it would be worthwhile to have the preferences and menu item.  My guess
is that most users aren't going to want to change this very often. The
preferences give the ability to explain what this does and the menu item gives
the ability to override for an individual basis.  We do this in the compose
window in the Options menu.

Regarding wording, I think the first menu item shouldn't have anything to do
with html.  It should be something like "As Sent", "Allow sender's message" or
"original message".  I'm not good with coming up with these things, but I think
it should convey something like that.  It's the untouched message.  

The second menu item is harder.  I think "sanitized" and "purified" aren't good
because they can mean different things. Does that mean that we remove all foul
language from these messages as well?  I like "Simple" or "Basic" or "Limited" html.



Ben, you need to update your patch as nsAWritableString and nsAReadableString
has been replaced by nsAString (see bug 131899 for details)
Seems like folks really want this so lets go with it. I'm still concerned about 
users understanding the menu options though. Still trying to think of the 
wording...

View: HTML As -->
                  Original Message
                  Formating Tags Only
                  Plain Text
> allow_classes should be an enum with descriptive names, instead of
> 0,1,2,3,100...makes the code more readable

The content comes directly from an integer pref. Because the options are also
cumulative (e.g. avoid_html is true for allow_classes = 1, 2, 3 or 100), I think
that the current solution is better readable.

> +        clazz == (MimeObjectClass *)&mimeInlineTextHTMLSanitizedClass ||
> +        clazz == (MimeObjectClass *)&mimeInlineTextHTMLAsPlaintextClass ||
> +           /* The latter 2 classes bear some risk, because they use the Gecko
> +              HTML parser, but the user has the option to make an explicit
> +              choice in this case, via html_as. */
> this comment is referring to mimeInlineTextHTMLSanitizedClass and
> mimeInlineTextHTMLAsPlaintextClass? "latter 2" is not as clear as it could be.

It refers to both. I don't see what is unclear or how I could make it clearer
without repeating the classnames.

> all the comments about the charset handling are very scary.

Note that they apply only to users of the new features.

> You should get a review from the i18n people as well.

nhotta?

> What is the exact list of things that "sanitized" mode takes out?

The current list of what is *allowed* (i.e. what the mode *leaves in*) is
defined by a pref. The current default is:

pref("mailnews.display.html_sanitizer.allowed_tags", "html head title body p br
div(lang,title) h1 h2 h3 h4 h5 h6 ul ol li(value,start,compact) dl dt dd
blockquote(type,cite) pre noscript noframes strong em sub sup span(lang,title)
acronym(title) abbr(title) del(title,cite,datetime) ins(title,cite,datetime)
q(cite) a(href,name,title) img(alt,title,longdesc) base(href) area(alt)
applet(alt) object(alt) var samp dfn address kbd code cite s strike tt b i
table(align) caption tr(align,valign) td(rowspan,colspan,align,valign)
th(rowspan,colspan,align,valign)");

It's mostly structural HTML plus a few important presentational tags/attributes.
No embedded content, no scripts, no CSS.

> you need to update your patch as nsAWritableString and nsAReadableString
> has been replaced by nsAString (see bug 131899 for details)

arg! Didn't they *introduce* nsAReadableString within the last year? I guess,
they don't *want* anybody to use the string classes.
I don't worry about I18N issue yet as the proposed patch doesn't not affect
existing code or functionalities. Only new view mode (sanitized or as plain
text) are concerned and that could be fixed after this feaure has landed.
Regarding comment 116: You may wish to allow these tags as well:
* <tbody>
* <thead>
* <tfoot>
* <colgroup>
* <hr> (?)
* <center> (?)
* <map> (?)

And, is there any means to allow <link> for Site Navigation Bar purposes? (I'm
guessing that <link> for CSS purposes would be undesirable)

You may also wish to allow the "accesskey" attribute for <a> and the like.

I presume that <form>, <input>, <textara> and <select> are excluded on purpose, eh?
> And, is there any means to allow <link> for Site Navigation Bar purposes?

We are talking about email!

(The generic sanitizer gets its pref from the calling code, so you can have a
different set of allowed tags for Mailnews and the browser, if the latter will
ever make use of it.)

> (I'm guessing that <link> for CSS purposes would be undesirable)

Right, and the current implmentation can't distingish based on attribute values.

> I presume that <form>, <input>, <textara> and <select> are excluded on
purpose, eh?

Yes. You can set up a web-page and link to it to make people fill out forms.
There's more harm than use for forms in email IMO.

I'm trying to allow the minimal amount of Mozilla code to be triggered to allow
this feature to be workable.
> Regarding wording, I think the first menu item shouldn't have anything to do
> with html.

But the code *is* about HTML. The new options are to disable HTML (more or
less), and that option disables the new options. Not mentioning HTML could imply
more than it's supposed to. For example, if the user chose to ignore colors and
fonts of webpages in Prefs|Appearance, we won't show the colors and fonts "As
Sent" even in the first mode.


I suggest the following:
View > Body:
* Allow Sender's HTML
- Simplify HTML
- As Plaintext

I suggest "Simplify HTML", because "Simple HTML" could be confusing for the case
of plaintext mails. "Simple HTML" would be OK with me, too, though.

I suggest to use "Body" (and not "HTML As"), because the menu might be extended
to other options, e.g. for ROT13, plaintext display or whatever.

ACK / NOK?
Plain text, two words, not one.
FWIW, I like it... I'd just change "Body" to "Message Body"

View > Message Body:
* Allow Sender's HTML
- Simplify HTML
- As Plaintext
Good sugestions for wording. Suggest we keep the wording parallel, when possible:

View>Message Body:

Allow Sender's HTML
As Simple HTML
As Plain Text
Eh, yes, I meant "As Plain Text".

> I'd just change "Body" to "Message Body"

It's called View|Headers, too, not View|Message Headers.


jglick, robinf, ducarroz, bienvenu, even if not everybody is happy about it, can I
nevertheless check it in tonight (after I attached a new patch to change to that
UI wording and make it build against the trunk), and we discuss the UI wording
and I18N of the new modes further after the freeze / 1.0 release? I understand
that I might get an exception to check in after the freeze at midnight, but
better checking it in before the freeze than having it rejected while the trunk
is frozen.
>View:Message Body-->

>Allow Sender's HTML

I'd go with "As" instead of "Allow". "As Sender's HTML" or "As Original HTML"

>As Simple HTML
>As Plain Text
View: Message Body As -->

Original HTML
Simple HTML
Plain Text
Keywords: mozilla1.0+
Attachment #75779 - Attachment is obsolete: true
Attached patch Same patch, with whitespace. (obsolete) — Splinter Review
Changes:
- I used Jen's latest UI suggestion verbatim. I am not happy with it, but no
time left.
- Updated to the latest string API change.
- Merged with other conflict in build file. Had to redo changes to Mac build
file.
- License: Removed the jurisdiction and version change, because nobody from
mozilla.org responded yet. License is now plain MPL/LGPL/GPL boilerplate, with
US jurisdiction and all the other <censored>.
Attachment #75780 - Attachment is obsolete: true
You have r=akkana on the content parts (I implied that earlier but never
actually made it explicit).  Of course I'm assuming there are mac xml files
involved as well as win/linux makefiles.

+    //Write(NS_LITERAL_STRING("<title>"));  -- aValue already contains that!?!
+    Write(aValue);
+    //Write(NS_LITERAL_STRING("</title>"));

The reason for that is probably that it's called from OpenContainer or
BeginStartTag or one of those more general interfaces, so the start and end tag
already gets output.  I have no idea why the content sinks have to have all
those redundant html-specific interfaces -- it bloats the code and makes it
harder to understand.

// I get an infinive loop with | + key + | !!!
typo in "infinite".  It would be good to investigate this eventually (if you
haven't already filed a bug) since it's something that could bite other people too.
If you want to follows the new string guideline in
http://www.mozilla.org/projects/xpcom/string-quickref.html
you should use a const nsAString& for inString in the following new functions:
+nsresult
+HTML2Plaintext(const nsString& inString, nsString& outString,
+               int flags, int wrapCol);
+nsresult
+HTMLSanitize(const nsString& inString, nsString& outString,
+             int flags, const nsAString& allowedTags);
+


Also, jglick last proposal was: "Message Body As". Just change that and you have
R=ducarroz (don't need to post a new patch)
+<!ENTITY bodyMenu.label "Body as">

> mime_find_class: why not cache these two prefs instead of running through this
> code every time we try to find a class?

I don't know how. This is an exported C function, so I have no place to store
state apart from the global  scope (or static). However, the pref might change
during runtime (via the menu).

The function is called only once for each mime part, I think. For a normal
plaintext mail, that means once, for a normal multipart/alternative mail 3 times
(the m/a wrapper, the plaintext and the HTML part). Similar for attachments
(once for m/m + once per attachment). I guess that's OK perf-wise.


Thanks for the review, ducarroz and akk.
Now waiting for sr...
Once again, sorry if these questions were already answered in the bug and I
missed it.

bienvenu suggested you get an i18n review.  Did that happen?

Jean-Francois wrote,
>I don't worry about I18N issue yet as the proposed patch doesn't not affect
>existing code or functionalities. Only new view mode (sanitized or as plain
>text) are concerned and that could be fixed after this feaure has landed.

Can you explain this?  If it's a bug that's going to need to be fixed after
landing, can we fix it before landing to improve the quality of the feature
going in?

My understanding is that the code path for the default mode of leaving the
messages as the sender wrote it is not affected by these changes.  Is that true?

If it wasn't already planned, I'd like to see the sr go through bienvenu,
mscott, or sspitzer.

Because of where are at the end of 1.0, features like this need to have a test
build that QA can look at.  Has that happened?  Who has tested this?

I'd like to see this feature because it's a good solution to a lot of problems.
 But, in addition to the reviews, I'd like to see us take as many of the risk
reducing steps as possible.

> bienvenu suggested you get an i18n review.  Did that happen?

No, no lifesign from nhotta yet. Don't know, who else is qualified.

> If it's a bug that's going to need to be fixed after landing, can we fix it
before landing to improve
> the quality of the feature going in?

Not, if there's no time left. As I said, I'd like to get it in for 1.0, and we
already missed today's freeze deadline (10 am here and the waiting was
completely useless). Every day this is holding any longer, the risk of it
getting the 1.0 train reduces significantly (and my ability to care, too). IMO,
it is much better to have a such feature, even if it doesn't work perfect in all
the uncommon cases, than not to have it at all.

> My understanding is that the code path for the default mode of leaving the
> messages as the sender wrote it is not affected by these changes.  Is that true?

(What a sentence :-) ) Yes, the default code path is not touched at all apart
from a bit pref-checking.

> Because of where are at the end of 1.0, features like this need to have a test
> build that QA can look at.  Has that happened?  Who has tested this?

I didn't know about such a requirement for me nor do I know how to produce test
builds (apart from Linux). Also note that Netscape's QA decided not to care at all.

I'm uploading a Linux testbuild to
<http://www.beonex.de/download/mozilla/mozilla-bug30888-linux.tar.gz>.
Ops, I see that nhotta wasn't on the cc list. I thought I added him long ago.
Doing so and mailing him now.

Anyways, the I18N "bugs" are just
- bad formatting in one edge case in one of the new modes and
- a some code places where I don't exactly know what it should do and where I
intentionally added scary comments, so we can look at that later. My *tests*
didn't show any problems, though, not even with Japanese mails in several
charsets and formats.
Regarding test builds, QA refers to Mozilla QA as much as Netscape QA.  I just
want to make sure that testing is getting done by someone who isn't helping to
develop this (that's not a slight on you or JF, it's just good to have a other
people looking at it).  I had a talk with Asa today and this is a step that will
be key for getting in features that missed the deadline. I will see if there's
anything I can do to help in this area.
I justed downloaded and tested the build mentioned in Comment #133 and it looked
very good. I could show text content of all spam messages I found in my trash
folder :)

I had a message in ther which had only "***** This is an HTML Message ! *****"
as text/plain content, so it was no wonder that one didn't translate HTML to
plaintext ;-)

Overall, a very cool feature and the build looks good from what I saw. Good
work, Ben!
> +/* TODO:
> +  - I18N: Japanese attached HTML pages have the wrong charset, so garbage
> +    gets displayed. Dunno, why.
I think the charset is not wrong as long as you can view it as HTML.
Probably the charset is different from the mail body's charset.
We either parse META charset in the attachment or apply charset auto-detection
(if selected by menu).

> +    The body is displayed correct, however. Same with RFC822 attachments.
> +    Hints appreciated. I don't understand that whole charset handling
> +    in libmime anyways.
Basically, we converts to UTF-8 from a charset specified in the message. There
are special cases when saving as a local file.

> +  - If you Save As File .html with this mode, you get a total mess.
I think the save as .html should use the existing code. Is it expected to be
saved differently with the new mode?

> +  - Print is untested (crashes in all modes).
> +*/
> 


I applied the newest patch, built it and played with it from I18N point
of view. It worked great
for Korean and simplified Chinese (my spam folders don't have any
Japanese spam mails :-).  I thought I had a few in TC, but couldn't
find it :-)) text/html or multipart/alternative
emails. I paid a special attention to whether or not this patch
breaks lang/charset dependent font/glyph selection for CJK emails
which I fixed a while ago. I couldn't find any problem so far. 
 
At one point, I thought 'view body as plain text' had a problem,
but it turned out that text/plain part of multipart/alternative which
is in turn a part of multipart/related was empty. I guess it's decided
that when text/plain is available, it's gotta be used instead of generating
plain text rendering of text/html. That's fine, but later a check may
be added to see if text/plain part has completely empty (only made up
of new lines), in which case plain text rendering of text/html can be
used instead. 

>  - Print is untested (crashes in all modes).

   Hmm, my build(the whole source tree was updated a few days ago...) with your
patch doesn't crash and printing SC and K emails
in all three modes (original html, simple html, plain text) works fine.
It crashed once, but several other attempts didn't crash Mozilla. 

 Once again, it's a wonderful work !! Thank you.
Comment on attachment 76358 [details] [diff] [review]
Suggested fix, version 9. As usual.

I set the review bit for the patch, because both ducarroz (Mailnews part) and
akk (content/ part) gave their official R=.

Japanese attachments:
I noticed that I get the same garbage in the normal mode (in most cases). When
I switch to Autodetect Japanese, all is fine, in all modes :). I'll remove the
comment.

I found a crash for mails with Japanese HTML attachments, which I fixed in my
tree: In parse_eof:
-  if (!textHTMLSan || !textHTMLSan->complete_buffer)
+  if (obj->closed_p || !textHTMLSan || !textHTMLSan->complete_buffer)
    return -1;

I added lots of #ifdef DEBUG_BenB \n printf.... (note: DEBUG_BenB, not DEBUG)
to find it and would like to let them in, for later debugging. I hope that
nobody minds, otherwise I'll remove it.

> I think the save as .html should use the existing code. Is it expected to be
> saved differently with the new mode?

I don't know. I think I'd expect "WYSIWYG" for save as html, i.e. the view mode
changes what is being saved. But I am not sure. I'd certainly like to avoid
special code in  the new classes :).

Currently, it just happens that SaveAsFile .html invokes the mode selected for
viewing.

> Hmm, my build(the whole source tree was updated a few days ago...) with your
> patch doesn't crash

Yes, it's some time ago when I wrote that, possible that the print code has
been fixed in the meantime.
Attachment #76358 - Flags: review+
How do I apply the patch on window? I always get a bunch of patch malformated
errors! do I need to use a magic option?
ok, I have a window build, I'll push it on a public web site soon...
This message when viewed as plain text appears empty. The sanitized version
seems ok. Ben, can you look why the body it's empty when viewing it as plain
text!

I have tried other multibyte messages and some of them appears in plain text
some other not!
The text/plain alternative inside that message is deliberately empty.
J-F, yes, as Vadim said, the message is grossly broken. multipart/alternative is
called that way, because it is up to the recieving UA to chose, which part to
display. I.e. they have to have the same content, just in different formats. If
they are sending us an empty message, we display nothing.

You can work around it by setting
user_pref("mailnews.display.prefer_plaintext", false);
user_pref("mailnews.display.html_as", 1);
see comment 37.
There is no UI for it, but the UI will degrade nicely when it's used, and
chosing a UI option will overwrite the custom setting.
Question: 
Does the new option affect reply quoting?
For multipart/alternative with plain and html parts, which part is used for
reply quoting?

BTW: I have yet to see or hear about such a broken m/a mail (plaintext
alternative empty) that is *not* commercial bulk mail.

> Does the new option affect reply quoting?

Yes. And it should IMO, both for consistency and for security (not sure how much
is active in the composer and I don't want to bet on it).

That option being on would have prevented e.g. the bug where we forwarded JS
that snoops (internal) replies to a mail (at least in other MUAs).

> For multipart/alternative with plain and html parts, which part is used for
> reply quoting?

The same one you see.
There is per account setting for composing as plain or html.
So, for quoting, the view option supercedes the compose option (regarding the
way of selecting the data to quote)?
If the user sets view as plain and compose as html then the plain part is quoted
into the html compose window as pre formatted. Is that correct?
> If the user sets view as plain and compose as html then the plain part is
> quoted into the html compose window as pre formatted.

Yes, but not necessarily "preformatted" (in a <pre> tag), but the same way we
currently quote a plaintext-only mail in the HTML composer.
I generated a Window debug build base on this morning build which include the
patch for this bug.

from inside Netscape firewall:
http://peoplestage.netscape.com/ducarroz/mozilla/Mozilla-30888.zip

from outside, available starting from tomorrow early morning:
http://people.netscape.com/ducarroz/mozilla/Mozilla-30888.zip
I generated a Window debug build base on this morning build which include the
patch for this bug.

from inside Netscape firewall:
http://peoplestage.netscape.com/ducarroz/mozilla/Mozilla-30888.zip

from outside, available starting from tomorrow early morning:
http://people.netscape.com/ducarroz/mozilla/Mozilla-30888.zip
Add self to cc list.

>The content comes directly from an integer pref. Because the options are also
>cumulative (e.g. avoid_html is true for allow_classes = 1, 2, 3 or 100), I think
>that the current solution is better readable.

We don't have enum prefs, so I don't quite see how that's relevant. We file out
lots of things with integer prefs that we deal with internally as enums or
defines. 1,2,3 is more readable than constants or defines? That sounds like
someone who wrote the code, not like someone who might have to maintain it :-)

>+           /* The latter 2 classes bear some risk, because they use the Gecko
>+              HTML parser, but the user has the option to make an explicit
>+              choice in this case, via html_as. */
>this comment is referring to mimeInlineTextHTMLSanitizedClass and
>mimeInlineTextHTMLAsPlaintextClass? "latter 2" is not as clear as it could be.

>It refers to both. I don't see what is unclear or how I could make it clearer
>without repeating the classnames.

How about "the previous 2 classes"? Or moving the comment up and saying "the
next 2 classes" Either would be clearer.

I suspect that there's some misunderstanding on either my or bienvenu's side, so
I mailed him at 30. (trying to save some spam). Somebody said, I just attach it
here nevertheless, so I'll do.

David,

>http://bugzilla.mozilla.org/show_bug.cgi?id=30888
>
>bienvenu@netscape.com changed:
>
>1,2,3 is more readable than constants or defines?
>
The relevant code is:

+  PRInt32 allow_classes = 0;  /* Let only a few libmime classes
+       process incoming data. This protects from bugs (e.g. buffer overflows)
+       and from security loopholes (e.g. allowing unchecked HTML in some
+       obscure classes, although the user has html_as > 0).
+       This option is mainly for the UI of html_as.
+       0 = allow all available classes
+       1 = Use hardcoded blacklist to avoid rendering (incoming) HTML
+       2 = ... and images
+       3 = ... and some other uncommon content types
+       100 = Use hardcoded whitelist to avoid even more bugs(buffer overflows).
+           This mode will limit the features available (e.g. uncommon
+           attachment types and inline images) and is for paranoid users.
+       */

and later (in the processing; a bit earlier in the source file):

+  if (allow_classes == 0)
+    return PR_TRUE;
+  PRBool avoid_html = (allow_classes >= 1);
+  PRBool avoid_images = (allow_classes >= 2);
+  PRBool avoid_strange_content = (allow_classes >= 3);
+  PRBool avoid_bugs = (allow_classes == 100);


Don't the variables and the comment make the meaning clear already?

How should I call the enum fields?
enum avoid{ nothing = 0, avoid_html_only = 1, avoid_html_and_images = 2, 
avoid_html_and_images_and_other_strange_content = 3, use_whitelist = 100};
?

>How about "the previous 2 classes"?
>
OK.
the fact that this has a ui, but doesn't work for things like save as html, or
some i18n messages/charsets scares me a lot. Is that an accurate assessment from
the comments I see in the code? 

I've looked over all this, but I don't have the energy to deal with the whole sr
(I've been fighting a cold, and losing), so I'm just going to sr the mail/news
stuff, and you can get someone else to sr the other modules.

A lot of these comments have to do with readability, which is a pre-requisite
for maintainability - I know it's sometimes hard for the author of the code to
see that some things are confusing since he or she is so involved with it, but
bear with me.

the variable name "allow_classes" - that's an uninformative name, if I
understand the code correctly. A better name would be
"types_of_classes_to_allow, or even more accurately,
"types_of_classes_to_disallow". Does the pref "allow_processors" have the same
meaning? It's a bit of a confusing name - if I set it to 0, does that mean no
processors are allowed? That's not the feeling I get from the code.

avoid_bugs would perhaps be better called "allow_only_vanilla_classes"


+              clazz == (MimeObjectClass *)&mimeInlineTextHTMLClass
+                         /* Should not happen - we protect against that below.
+                            Still for safety... */
I don't understand this comment - I don't see a check for
mimeInlineTextHTMLClass in the code below. Perhaps you mean in the calling code,
mime_find_class()?


Why are we using naked ints here instead of PRInt32?:

+HTML2Plaintext(const nsString& inString, nsString& outString,
+               int flags, int wrapCol);
+nsresult
+HTMLSanitize(const nsString& inString, nsString& outString,
+             int flags, const nsAString& allowedTags);


Please don't use PR_ASSERT - it aborts the app in debug windows builds - use
NS_ASSERTION instead. I don't care that existing mime code uses it - they should
all be converted to NS_ASSERTION.

I believe these flags are defined as PRUint32, not int - at least the code in
nsPlaintextSerializer uses them that way
+  int flags = nsIDocumentEncoder::OutputFormatted
+    | nsIDocumentEncoder::OutputWrap
+    | nsIDocumentEncoder::OutputFormatFlowed
+    | nsIDocumentEncoder::OutputLFLineBreak
+    | nsIDocumentEncoder::OutputNoScriptContent
+    | nsIDocumentEncoder::OutputNoFramesContent
+    | nsIDocumentEncoder::OutputBodyOnly;


+  if (content_type)
+  {
+    char* charset = MimeHeaders_get_parameter(content_type,
+                                              HEADER_PARM_CHARSET,
+                                              NULL, NULL);
+    PR_DELETE(content_type);

shouldn't this be PR_FREE(content_type) and PR_FREE(charset)? We don't care
about nulling out these local variables after we free them since no one uses them.


How can you remove this null cp check? Is it hurting something to leave it in?
@@ -180,8 +180,6 @@
         (cp = PL_strncasestr(cp, "CHARSET=", length - (int)(cp - line))) 
         ) 
     {
-      if (cp)
-      {
> doesn't work for things like save as html

Save as HTML in the new modes doesn't work, that's correct, IIRC. Considering
that some argued to remove that feature altogether (i.e. it is not considered an
important feature), I don't worry too much about it for now.

> or some i18n messages/charsets scares me a lot.
> Is that an accurate assessment from the comments I see in the code?

Please see comment 138.

> (I've been fighting a cold, and losing)

Oh, that explains it. Thanks for responding nevertheless. I hope, you'll be
healty again soon.

> so I'm just going to sr the mail/news stuff

Thanks, that's all wanted :). I already asked jst to superreview the
content/ part.

> A lot of these comments have to do with readability, which is a pre-requisite
> for maintainability

Full agreement :).

> the variable name "allow_classes" - that's an uninformative name, if I
> understand the code correctly. A better name would be
> "types_of_classes_to_allow, or even more accurately,
> "types_of_classes_to_disallow"

I see what you mean, I will change it.

Does that take care of your above concerns?

> Does the pref "allow_processors" have the same meaning?

Yes. (I called it "processors", not "classes", as even power users probably
don't know, what "classes" means.)

> It's a bit of a confusing name - if I set it to 0, does that mean no
> processors are allowed? That's not the feeling I get from the code.

"disallow_processor_types"?

> avoid_bugs would perhaps be better called "allow_only_vanilla_classes"

OK.

> Perhaps you mean in the calling code, mime_find_class()?

Yes, I will change "below" to "in mime_find_class()".

> Why are we using naked ints here instead of PRInt32?

I copied that code from test code. I will change it to appropriate types
(whatever XPIDL generates).

> Please don't use PR_ASSERT - it aborts the app in debug windows builds - use
> NS_ASSERTION instead.

OK.

> I believe these flags are defined as PRUint32, not int

Might be, same as above.

> +    PR_DELETE(content_type);
>
> shouldn't this be PR_FREE(content_type) and PR_FREE(charset)?

Will change.

> How can you remove this null cp check? Is it hurting something to leave
> it in?

See discussion above. We already do check against that right before, in:

if ([...]
    (cp = PL_strncasestr(cp, "CHARSET=", length - (int)(cp - line))) 
   )
{
If Save as HTML in the new modes doesn't work, it should be disabled in the new
modes. Or fixed - we don't want to knowingly check in code that breaks something
at this point. People will get confused, file bugs, etc.

>> It's a bit of a confusing name - if I set it to 0, does that mean no
>> processors are allowed? That's not the feeling I get from the code.

>"disallow_processor_types"?
I think disallow_mime_class_handlers or something like that would be better.
"processors" means CPU's to me. Does "disallow_mime_class_handlers" capture the
sense of what's going on? I think anyone smart enough to want to fool with this
will understand about mime classes/types.

I see now why you removed the cp check, thx.
Regarding David's comments about the items that aren't working, we need to
disable UI for anything not working and hook up once working.   Are there
dependent bugs to have those items working for the Mozilla 1.0 where this
feature will be checked in (once approved)?   cc msanz/andreas since it mentions
i18n items not working in david's comment 153.

I agree with lchiang.  We should get the UI in the right state before checking in.  

I think this is an interesting feature and if it had landed earlier, I would be
much more in favor of taking this.  I think that taking this now for either
Mozilla 1.0 or Mach V is risky.  I think this should be landed as part of
1.1alpha and given time to bake on the trunk.
I think I can make SaveAs save the original HTML in all modes.
- In NS_NewSanitizingHTMLSerializer(), no need for the QI, a simple assignment
to the out param and an NS_ADDREF(*aSerializer) is what you want in stead.

- In mozSanitizingHTMLSerializer::GetParserService(), no need for the QI there,
assign and NS_ADDREF(). And no need for the .get() on mParserService either.

- Fix the indentation at:

+mozSanitizingHTMLSerializer::AppendText(nsIDOMText* aText, 
+                                  PRInt32 aStartOffset,
+                                  PRInt32 aEndOffset, 
+                                  nsAString& aStr)

- In mozSanitizingHTMLSerializer::SetDocumentCharset():

+  // No idea, if this works - it isn't invoked by |TestOutput|.
+  nsAutoString charsetline(NS_LITERAL_STRING(
+        "\n<meta http-equiv=\"Context-Type\" content=\"text/html; charset="));
+  charsetline += aCharset;
+  charsetline += NS_LITERAL_STRING("\">\n");
+  Write(charsetline);
+  return NS_OK;

You could avoid a string copy and nsAutoString if you'd use
NS_NAMED_LITERAL_STRING() for the first literal here, and then do:

  Write(the_literal + aCharset + NS_LITERAL_STRING(...));

- In mozSanitizingHTMLSerializer::DoOpenContainer():

+    Write(nsAutoString(tag_name));

Where tag_name is a const PRUnichar *, use nsDependentString() in stead of
nsAutoString, *much* cheaper.

- Hmm:

+            Write(NS_LITERAL_STRING("="));
+            Write(NS_LITERAL_STRING("\"") + value + NS_LITERAL_STRING("\""));

How about combining the "=" and "\"" strings into "=\"" and loosing the first
Write() call?

- In mozSanitizingHTMLSerializer::DoCloseContainer():

+    Write(nsAutoString(tag_name));

Again, use nsDependentString().

- In mozSanitizingHTMLSerializer::DoAddLeaf():

+    Write(NS_LITERAL_STRING("&"));
+    Write(aText);

Make  that one Write() call and use the + operator.

sr=jst if you fix all that.

I don't mean any disrespect to anyone here, but I've gotta say, I don't really
see or understand the value added by this in mozilla. But if mozilla.org thinks
this they want it, fine with me.
Thanks for the sr jst!

I addressed all of jst's comments. Some comments:

> In NS_NewSanitizingHTMLSerializer() [...]
> In mozSanitizingHTMLSerializer::GetParserService()

These were copied almost verbatim from nsPlaintextSerializer. Changed it here
to jst's suggestions nevertheless.

> You could avoid a string copy and nsAutoString if you'd use
> NS_NAMED_LITERAL_STRING() for the first literal here, and then do:
> Write(the_literal + aCharset + NS_LITERAL_STRING(...));

I used NS_LITERAL_STRING and wrote it directly into the Write() expression.

> +    Write(NS_LITERAL_STRING("&"));
> +    Write(aText);
> Make	that one Write() call and use the + operator.

I can't. aText is a nsAString& from the caller. In a similar case higher in the

same file, I got an infinitive loop with the + operator. (nsAutoString is known

to work, so no problem in most cases, but here, we don't know, which
implementation we get.)
I added a comment to that extend.

> sr=jst if you fix all that.

I assume that all that is OK with jst and I mark it sr.

> I don't really see or understand the value added by this in mozilla.

It provides a useable workaround for
- bug 28327 (which - for me - is the top usability problem of Mailnews)
- bug 22994 (see <http://news.com.com/2100-1023-875992.html>, yesterday's
  CNET News.com top story)
- bug 109249
- bug 18427
- several other known bugs
- yet unknown security bugs.


Will attach the mailnews part later.
Attachment #76358 - Attachment is obsolete: true
Attachment #76361 - Attachment is obsolete: true
*** Bug 40572 has been marked as a duplicate of this bug. ***
I addressed all of bienvenu's comments. New patch attached.

SaveAsFile|.html now saves the Original HTML, no matter which mode is being
selected for display.

But that isn't what took so long. I was trying to fix some bugs I found, but
failed. :-(
I noticed that the "text/html" without charset problem mentioned earlier
reappeared after preventing the crash in JPN attachments. Several days, a few
major changes (which I all discarded eventually) later and with a lot of hairs
less, I ended up changing almost nothing apart from a few comments and a few
safety checks. I couldn't work around the root problem in all cases at once, so
"text/html" msgs without charset (I think those msgs are invalid and usually
bulk mail) still won't display. Somebody will need to fix bug 126887.
Also, I am having a problem with m/r. ducarroz said, it works for him. No idea,
what's going on. Filed bug 137027.
I have no time to dig further into the depths of libmime, so I give up. I'd
need external help with those problems.

Summary of outstanding known problems:
- bug 126887
- bug 137027
- bug 125983
- bug 122876 (minor)
None of them impact non-users of the new prefs.

I'd like to check the current state into the trunk, bake it there for 2 or 3
calendar days and then check it into the mozilla1.0 branch, so it gets into
RC1.
Attachment #74098 - Attachment is obsolete: true
Attachment #77862 - Flags: superreview+
Attachment #77862 - Flags: review+
>"text/html" msgs without charset (I think those msgs are invalid and usually
>bulk mail) still won't display.
Is this specific to the new mode only, or view as html is also affected by the
patch?
> Is this specific to the new mode only

Yes. The normal Original HTML mode, the one that's used in current builds, is
not at all influenced by the new code, apart from pref-checking, of course.
Comment on attachment 78837 [details] [diff] [review]
Suggested Fix, version 11, Mailnews part, diff -uwN

please fix this spelling : syncronously -> synchronously

+#ifdef DEBUG_BenB
+printf(" E4\n");
+#endif

do we still need all of these?

sr=bienvenu
Attachment #78837 - Flags: superreview+
Thanks for the sr. Do I need another r=?

> do we still need all of these?

Unfortunately, yes, I need them to debug the odd call sequence that causes some
the remaining bugs. I know that they decrease readability, and I will remove
them when I don't need them anymore.
no, you don't need another r= for the mail part.
Fix checked into trunk. Yay, finally!

Thanks to all reviewers and all others involved.

This is not yet in the 1.0 branch. I still would like to check it into the
branch. If not, it would not be in any Mozilla 1.0 releases, but only trunk
nightlies and 1.1alpha.

I would like to revisit the UI wording later.

Please test the feature and try to circumvent it, i.e. get arbitary HTML
rendered although the new options are active. If so, please file a new bug
against Mailnews/MIME, owner me, and mention it here.

Some documentation can be found at <http://www.beonex.com/temp/index.html>. When
I have ftp access to my private server again, I will move it to
<http://www.bucksch.com/1/projects/mozilla/108153>.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Ben - I noticed you haven't asked for branch checkin yet.  Please do so after RC1
I am glad this get commited into the tree.
Several outlook users have asked me whether mozilla has this feature so I
suggest advertising this in the release notes when done.
Does outlook [express?] have this feature?
First I must say that this feature is _magnificent_. The first message I tried
it on (the one I attached here) just didn't show anything when I chose "Simple
HTML". Is there a reason for this? 

Another minor thing is that the plain text menu choice has double "as" ("View
Message Body as As Plain Text").
Filed bug 138125.
Looked over the content (and some of the mailnews) portion of the patch. 
Overall, I'm leaning towards RC2 approval.  1 nit:

I noted the hard-coded 1000-character constant for attribute length.  Why?  I
saw the justification, and I'm not sure I agree.  What limits do we use
elsewhere for attributes, if any?  If they exist, can we use the same limits? 
Could the limit in this code be eliminated?
Drivers want to see more baking/testing of this bug on trunk before considering
it for branch checkin.  Anyone interested in this bug please look for bugs or
possible regressions associated with it on trunk.
4th message of intl mail smoketest is blank when viewing as plain or simple HTML.
http://www.mozilla.org/quality/intl/tests/data/mailsmoketest.zip
Windows 98SE Build 2002041803

Attached pics will not display inline no matter what View setting is in "View 
Message Body as"
Jay, your problem is probably not releated to this feature, please file a new
bug and provide a test message. Thanks
Verified working on 2002041808 OS/2 trunk. Fabulous feature.
see bug 138333 for possible problem.
Status update (partly for drivers):
I had no major bug reports apart from the dreaded occasional blank msg display
for some messages (see bugs referenced above; workaround: switch to Original
HTML mode). I don't know how to fix that without help.
Otherwise, the feedback was very positive.
I don't know, if it should be checked into the branch. Many people said they
hope it will get in. I leave it up to drivers to decide, if it should go into
1.0 with a release-note or should wait for 1.1.
Whiteboard: Done, waiting for review → Checked into trunk
Status update:
I mailed drivers, askingif it should go in. I got differing opinions, no final
decision. That was about a week ago.
Keywords: patch, review
Whiteboard: Checked into trunk → Checked into trunk. Should go in 1.0?
Ben - do you know if encrypted mail works as well?  For example, if the user
chooses "View as plaintext" on encrypted mail, they will get the expected
behavior for this feature. (I'm not sure if encrypted mail gets treated
differently than non-encrypted once it's decrypted.)
> Ben - do you know if encrypted mail works as well?

No, I have never used S/MIME in Mozilla. But it should work.
I hope for 1.0, but doubt it since we're already at fc3.
In 1.0?
The keyword is "mozilla1.0+"
but now we are at RC3 (the last release 
candidate)...
Someone know?
*** Bug 148330 has been marked as a duplicate of this bug. ***
It won't be in 1.0.
too bad but at least we will be able to block remote images (bug 97055) and
block plugins (bug 147877) in email
Beonex Communicator 0.8-stable <http://www.beonex.com/communicator>, based on
Mozilla 1.0 final, contains this bugfix.
I can find this feature in 1.1a, but not in trunk! Why?! It was marked as
"checked into trunk" already?
Another question is whether it'll be checked into 1.0.1 branch?
Calvin: I'm running 2002062504 TRUNK here (on Win2k), and I see the feature :-/.
(View -> Message Body As -> Plain Text)

That doesn't appear for you?
Win98SE Mozilla 2002062508

Appears here and for every trunk previously for as long as I can remember back
when it was first included.
For solaris. I can't find the menu on nightly build 0624. And my own build of
trunk is always crash when switch between the formats. But 1.1a is ok.
I'll get a brand new trunk and try again.
“You can now convert all your e-mail to plain text,” [using Outlook 11]
Microsoft’s Marks said, another way of thwarting unwanted e-mail cookies and Web
beacons.
<http://www.msnbc.com/news/829223.asp?0si=&cp1=1>
Same feature available in MSOE comign with MSIE6SP1.

Excuse me, but: MUAHAHAHAHAHA!!!!! :-)
qa contact change
QA Contact: asa → esther
Using trunk builds 20021206 on winxp, macosx and linux we do have the menu item
to View messages as plain text.  It works and I checked the case where the
message has been encrypted.  After giving the cert password for encryption the
HTML message displays as plain text.  Also, I tested an HTML message being
viewed as plain text when saved as html is saved as html even though it's being
viewed in plain text.  Verified.  If someone has a different scenario that
doesn't work please log a new bug with the scenario and reference this one.
Status: RESOLVED → VERIFIED
The bug is fixed, but when the preferences are set to plain text, there is
nothing to indicate that their is HTML content present.

The HTML section should show up in the attachment box, or there should be some
other indication of it's presence.
Ben,

What adding another options which further restricts the classes which are allowed?

In PRBool mime_is_allowed_class(const MimeObjectClass *clazz,...
I am concerned about the following:
clazz == (MimeObjectClass *)&mimeMultipartAppleDoubleClass ||
clazz == (MimeObjectClass *)&mimeMessageClass ||
clazz == (MimeObjectClass *)&mimeInlineTextHTMLSanitizedClass ||
clazz == (MimeObjectClass *)&mimeInlineTextHTMLAsPlaintextClass ||
Especially AppleDouble class - isn't it kind.

Compatibility with existing behavior will still be maintained.

Georgi, I filed bug 191928 about the question you raised and answered you there.
*** Bug 119266 has been marked as a duplicate of this bug. ***
*** Bug 178548 has been marked as a duplicate of this bug. ***
Product: MailNews → Core
*** Bug 121853 has been marked as a duplicate of this bug. ***
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: