Closed Bug 44439 Opened 24 years ago Closed 24 years ago

Support headers

Categories

(Core :: DOM: Serializers, defect, P3)

defect

Tracking

()

VERIFIED FIXED

People

(Reporter: BenB, Assigned: BenB)

References

Details

(Whiteboard: Fixed.)

Attachments

(10 files)

Reproduce:
1. Load the testcase in the editor
2. Debug|OutputText

Actual result:
<<h1

foo h2

foo h1

foo h2

foo h3

foo h3

foo h4

foo h5

foo h6

foo
>>

Expected result:
Something more intelligent, e.g. Lynx' output:
bash$ lynx -dump header.html

                                      h1
                                       
   foo
   
h2

   foo
   
                                      h1
                                       
   foo
   
h2

   foo
   
  h3
  
   foo
   
  h3
  
   foo
   
    h4
    
   foo
   
      h5
      
   foo
   
        h6
        
   foo


Or something with numbers, like


  1. h1

foo

  1.1 h2

foo

  2. h1

foo

  2.1 h2

foo

  2.1.1 h3

foo

  2.1.2 h3

foo

  2.1.2.1. h4

foo

  2.1.2.1.1. h5

foo

  2.1.2.1.1.1. h6

foo


etc. Since we output normal text at column 0, I tend towards the numbering
scheme. Suggestions welcome, but fast, please :).
Status: NEW → ASSIGNED
Target Milestone: --- → M17
Attached file Simple testcase
No proposed rendering in the HTML 4.0 spec, but
<quote src="http://www.w3.org/TR/REC-html40/struct/global.html#h-7.5.5">
HTML does not itself cause section numbers to be generated from headings. This
facility may be offered by user agents, however.
</quote>
Fixed. Used the numbering version.

akk, can you review, please?
Keywords: patch, review
Whiteboard: Fixed.
Attached patch Fix, version 1Splinter Review
Summary: Support headers better → Support headers
I'm not sure that numbering headers is the best since that inserts content with 
a meaning that the author might not have intended. 

Imagine the following:
----------
A sends a html-mail with headers converted to plain text to B. The mail gets 
numbered headers.

B answers. "I agree with you. Especially 1.1.3 was very interesting". 

This will confuse A very much since he didn't send anything with "1.1.3". 
------------

or just mailing any web page where the author has used <h#> tags for formatting 
rather than for logical organizing of the content. Then our numbering could be 
all wrong in respect of the content in the mail.

I think your other proposal, using indentations is better. In the case of a 
wrapping text, it should even be simple to center a header. (Left as an exercise 
for the interested reader).
Attached file A harder testcase
Daniel,

the problem with the lynx style is that it does not scale. Lynx outout of a
harder testcase:

                                      foo
                                       
   foo
   
foo

   foo
   
                                      foo
                                       
   foo
   
foo

   foo
   
  foo
  
   foo
   
  foo
  
     foo
     
    foo
    
   foo
          foo
          
   foo
          foo
          
      foo
      
   foo
   
        foo
        
   foo

The structure is completely unreadable. And this example is not even
far-fetched, I write such documents. OK, not in that intense and you can use the
content to guess the structure, but that defeats the purpose. Structure should
be clearly and unambiguously *visible*.

As we write normal text at colunm 0, we have to indent headers more. This would
look like:

                                      foo
                                       
foo
   
  foo

foo
   
                                      foo
                                       
foo
   
  foo

foo
   
    foo
  
foo
   
    foo
  
     foo
     
      foo
    
foo
       foo
          
foo
       foo
          
        foo
      
foo
   
          foo
        
foo

A bit better for normal text plus headers, but even less readable for headers
plus indented content (<blockquote>, <dl> , <ul> etc.).

As for your first example, there are two cases: Send as Plaintext and
multipart/alternative.
In the former case, the sender might have a copy looking the same in his Sent
folder, so he can see the numbering there. Unfortunate, but possible.
In the second case, the recipient used a non-HTML-compliant mailer. He should be
used to multipart msgs and know that he might see different output than the
sender, so he will propably refere to the header name, not number.

As for your second example, I was aware of that. If people abuse HTML, there is
not much we can do. The problem with abusing HTML is *exactly* that it won't
work well with configurations different from the author. If authors get aware of
that: fine with me :). Note, that the Composer says "Heading 1" etc. in the UI,
not "bigger" or so. I will add a comment to the askSendFormat dialog that the
plaintext version might look different from what the author saw in the composer.

The HTML 4.0 spec authors were well aware of abuse of HTML. Nevertheless, they
explicitly allow numbering of numbers.
Ah, sample output of my implementation:

  1. foo
  
foo


  1.1. foo
  
foo


  2. foo
  
foo


  2.1. foo
  
foo


  2.1.1. foo
  
foo


  2.1.2. foo
  
    foo


  2.1.2.1. foo
  
foo

foo

foo

foo


  2.1.2.1.1. foo
  
foo


  2.1.2.1.1.1. foo
  
foo

(We don't support <dl> yet.)

Note, that my implementation is done. I don't really want to dump it, make
correct HTML unreadable, just to make illegal HTML more readable.
I saw your implementation and it looked good but IMHO this doesn't improve 
Mozilla. That W3C allows a user agent to number headers is not enough reason to 
do it.

Numbered headers can be useful but it's a feature that should be controlled by 
the author. In this case we insert numbers with no feedback to the sender until 
it's done and sent and not even then if the user doesn't look in the sent mail 
folder, and honestly, how many do check their mails there after they are sent.

Your examples were quite unrealistic you know. No text and headers consists 
solely of the word foo. Normally the text and the headers contribute to the 
understanding of the logical structure. Indentation could be a help to that 
understanding. A help that is discrete, but it's there and it won't disturb the 
content.

(Another example would be what would happen if someone already numbered headers 
manually before they are sent)
> Your examples were quite unrealistic you know.

I filed and fixed this bug exactly because 4.x doesn't convert headers well and
the plain text version of my texts were not easily parsable/understandable.* I
often have paragraphs < 1 line and I happily use headers, lists etc. in the
wildest combinations. In mail/news. So, it was a drastic example, but it has a
valid, realistic and severe point.

> Normally the text and the headers contribute to the 
> understanding of the logical structure.

I intentionally include and rely on structure in my composition. You have a hard
time to understand my texts, if the structurual information is not preserved.


*This gives me 3 options:
- Directly composing in plain text (including manual wrapping for lists)
- Writing the HTML version with the plain text version in mind and crippling the
HTML version by doing so. (This requires detailed knowledge about the HTML->TXT
converter - something only power users have.)
- Not caring about the plain text version.
Surely, none of these are acceptable.
I would not want the output system numbering headers which did not appear
numbered in the html.  That doesn't make sense at all.  If I wanted it numbered,
I'd use something like a numbered list -- or I'd just put in numbers and make
them part of the header.

I agree that the old code isn't working -- the text and the header shouldn't
appear on the same line, we should have a newline, at least -- but adding
numbers doesn't seem right.

As an alternate suggestion, we could add an identifier like "*" (since html
headers usually appear bold by default), or perhaps something new, like "<<" and
">>", around headers when converting to formatted plaintext, so <h1>foo</h1>
would look like <<foo>>.  You could even add more of them depending on the
level, e.g. <foo> is an h4, <<<<foo>>>> is an h1, etc.
> I would not want the output system numbering headers which did not appear
> numbered in the html.  That doesn't make sense at all.

It does make sense. Both me and the HTML spec authors came up with the same
proposal.

It might not make sense for *your* documents, which is still a legal argument.
can you explain why you think, this made no sense? The two reasons Daniel gave
were valid, but IMO no reason to drop numbers.

BTW: I just added a clear statement to the askSendFormat dialog: "[...] the
plaintext version might look different from what you saw in the composer".
Suggestions for rephrasing welcome.

> If I wanted it
> numbered, I'd use something like a numbered list -- or I'd just put in numbers
> and make them part of the header.

See <http://www.bucksch.org/1/projects/mozilla/31906> rendered with that patch.
Although I do use numbers for headers in one section (creating 2 numbers for
each of those headers), it still looks better than rendered by lynx.

> You could even add more of them depending on the
> level, e.g. <foo> is an h4, <<<<foo>>>> is an h1, etc.

This is not obvious. This was exactly the problem, which led my to my proposal:
Making the hierarchy obvious (without looking at the content). This is a
*requirement*.
The point of doing formatted plaintext is to make the plaintext output look as
much as possible as the html that produced it, so that we're as close to wysiwyg
as possible.
IMO wrong. Impossible. How do you show bold in plaintext? Font sizes? You will
loose a lot of information, if you try to emulate the look of a graphical
display.

The goal should be to carry over as much *information* as possible and output it
in a way as if a human had written it directly in plaintext.

And RFCs, a good example for formatted plaintext, use numbers for headers.

Mamybe, we should discuss this in a newsgroup?
Yes, bring it up in a newsgroup (mailnews, certainly, and probably crosspost to
editor).  If a majority of people say they want numbers added to their headers,
I'll go along, though I still won't like it myself (I'll probably stop using
headers and use bold instead, which isn't a big deal).
Posted to .mail-news and .editor.
Attached patch Fix, version 2Splinter Review
I gave in and implemented a more Lynx-like rendering method. I output 2 lines
before and 2 after the header and indent the header text 2x columns for h(x).
I.e. I decided not to indent h1 for now, because the current implementation
- is more consistent across header levels
- reduces the risk of confusion between h1 and h6
- is easier to implement

I did *not* insert and new characters like e.g. an underline. So, the current
implementation bears the risk of confusion between indention and headers (as
already pointed out). Hopefully, that is not that bad in practice, since we
output 2 lines before the header - it will be a problem, if either the user or
the editor insert two lines before a normal indention.

I added a pref ("network.converter.html2txt.numbered_headers", default off for
now) to switch to the implementation with numbered headers.

The patch also contains a pref ("network.converter.html2txt.structs", default on
for now) for the output of structured phrases (strong, em, code (new), sub, sup,
b, i, u), i.e. either the 4.x or the previous Mozilla behaviour.

The patch also changes the <img> implementation: We now don't output both the
alt and src (URI) attribute (if existant) anymore, but the alt, title *or* src
attribute (in decreasing order of preference, depending on what exists).
Attached patch Fix, version 3Splinter Review
akk, can you review now, please?

Example output with numbered headers pref off:

<example content="simple_testcase">
  h1
  
foo


    h2
    
foo


  h1
  
foo


    h2
    
foo


      h3
      
foo


      h3
      
foo


        h4
        
foo


          h5
          
foo


            h6
            
foo
</example>

<example content="harder_testcase">
  foo
  
foo


    foo
    
foo


  foo
  
foo


    foo
    
foo


      foo
      
foo


      foo
      
    foo


        foo
        
foo

foo

foo

foo


          foo
          
foo


            foo
            
foo
</example>
> I output 2 lines before and 2 after the header

as/2 after/1 after

> I.e. I decided not to indent h1 for now

s/indent/center


Note that I output brackets around the alt/title/src attribute of <img> (as I
did before).
I added some support for definition lists (<dl>, <dt>, <dd>), <th> and <q>
(hardcoding western quotation marks for the latter).

Also fixed a bug where the converter gets confused, if a normal
<blockquote> is inside a <blockquote type=cite> (or the other way
around, I think), due to a completely broken algorithm (catched this
while reading source).

While reading the HTML 4.0 spec, I noticed that the "alt" attribute is
*required* to be specified in the document and *required* to be
rendered, if the img is not rendered. Interesting.
<quote
src="http://www.w3.org/TR/REC-html40/struct/objects.html#edef-IMG">
User agents must render alternate text when they cannot support images,
they cannot support a certain image type or when they are configured not
to display images.
</quote>
I changed the rendering if <img> again so we don't output anything, if the value
for alt is empty (|alt=""|) - I read that somewhere. I still put "["/"]" out
around non-empty alt or title text.

I noticed that unknown tags are cosidered blocks, i.e. "unknown" inline tags are
rendered with linebreaks. I didn't fix that yet.

Sorry for overloading this patch, but this is the result, if the stuff
lies around for so long.
Attached patch Fix, version 4Splinter Review
Akk, I also uncommented the following code:

// Else make sure we'll separate block level tags,
// even if we're about to leave before doing any other formatting.
// Oddly, I can't find a case where this actually makes any difference.
//else if (IsBlockLevel(type))
//  EnsureVerticalSpace(0);

I *did* see a difference for <dt> and <dl>.

Ah, and I added some inline tags, so that we don't output linebreak around them,
i.e. hotfixed the default-block problem described above. Note that this problem
exists right now in the tree, it is *not* introduced by the change above.
Attached patch Fix, version 5Splinter Review
Added new mode for akk, Joe et al: No indention at all. Per akk, removed
"network." from prefs. See default pref diff for details.
Keywords: reviewapproval
Checked in.

Summary:
Support headers (3 modes), <dd>/<dt>, <q>, <code>, <th>.
Improved <img>, </blockquote>.
Improved "unknown" block and (some) inline tags.
Pref for structured phrases.

Opinion:
Now tables and some other minor improvements, and we have a decent HTML->TXT
converter :).
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
Target Milestone: M17 → M18
*** Bug 41952 has been marked as a duplicate of this bug. ***
verified in 9/13 build.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: