Closed Bug 42781 Opened 25 years ago Closed 24 years ago

Invalid tags destroyed/discarded by composer

Categories

(Core :: DOM: Editor, defect, P3)

defect

Tracking

()

VERIFIED FIXED

People

(Reporter: timeless, Assigned: akkzilla)

References

()

Details

(Keywords: testcase, Whiteboard: relnote-user)

2000061518 from bug 41959 test case http://bugzilla.mozilla.org/showattachment.cgi?attach_id=9848 My understanding is that this should render as [centered]test[/] on a stary background. We have the background working, but the text `test` is missing. Composer shows no text. fwiw The cursor is centered. <body class="mainpage"><br> </body>
This blocks verification of the testcase from 41959.
Blocks: 41959
This page is invalid. You've provided a <doctype HTML 4.0 strict> but your page is not compliant. To make it compliant, you must do this: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Strict//EN" "http://www.w3.org/TR/REC-html40/strict.dtd"> <HTML> <HEAD> <TITLE>The Nexus</TITLE> <LINK REL="STYLESHEET" HREF="http://sucs.swan.ac.uk/~firefury/beta/css/main.css" TYPE="text/css"> </HEAD> <BODY CLASS="mainpage"> <p>test </BODY> </HTML>
Status: NEW → RESOLVED
Closed: 25 years ago
Resolution: --- → INVALID
Netscape 4's composer gladly preserves the following html [it adds random jibberish, but it doesn't delete anything]: <html><body><invalid></body></html> If a mistake is made in writing an html page [eg omiting a <p>] the flawed html should survive composer so that it can be fixed. If the answer is obvious, maybe it should be fixed [add, don't remove].
Status: RESOLVED → REOPENED
Component: Parser → Compositor
Keywords: 4xp
OS: Windows 2000 → All
Hardware: PC → All
Resolution: INVALID → ---
Summary: Text lost in <body> tag → Invalid tags destroyed/discarded by composer
change component to editor (was compositor)
Component: Compositor → Editor
The chief issue here is that the doctype specifies STRICT. In mozilla, documents with a strict <doctype> MUST conform to the w3c STRICT dtd. (This is an essential part of our standards story). The spec is clear about unknown tags -- the user agent is free to do with them as they see fit. In the strict world, they get dropped because we couldn't possible predict the indended use of the tag. Marking invalid.
Status: REOPENED → RESOLVED
Closed: 25 years ago25 years ago
Resolution: --- → INVALID
Dawn just called this bug to my attention. I just today checked in a fix for bug 38154, which fixes the output system so that if we get an unknown tag like <foo></foo>, we'll preserve it into the output (and strip off the _moz-unknown attribute that got added somewhere along the line). That was an output system bug. But of course, that assumes that we get the tag in the first place. If the parser isn't putting the tag into the content model because the DTD is specified as strict, then the editor and output system won't know anything about it and the fix for 38154 won't help. You might want to check on that bug and see if that was really the bug you were hoping to see fixed, and see if things are working better for you now.
As akkana points out, we only drop tags in the strict/transitional DTDs. HTML 3.2 documents will store unknown tags as containers in the content model.
If someone lies about their document's doctype (as can easily happen with shared documents) we shouldn't punish them by quietly deleteing stuff from their document. A much better solution would be to change the doctype.
I really hate to see us drop something just because we don't know what it is. Sounds rather harsh to me.
Since the user agent is free to do with unknown tags as it pleases, why not treat them as unknown containers? Just because you can't predict its intended use, does that mean it should completely be stripped out? Actually, at http://www.w3.org/TR/html4/appendix/notes.html#notes-invalid-docs it's recommended that: "If a user agent encounters an element it does not recognize, it should try to render the element's content.", to facilitate experimentation and interoperability between implementations of various versions of HTML. Extrapolating that to composer, I would interpret it as keeping the unknown element(s). One instance where it helps being able to use custom tags is in templates. For a nice example project, see http://freemarker.sourceforge.net/, which uses custom tags for things like <if foo="bar">some html code</if>. Being able to edit templates in composer (and leaving the custom tags alone) greatly enhances the value of Mozilla. Yes, in the above example, one could leave out the DTD, or specify a 3.2 DTD, but if the rest of the code is 4.0 DTD Strict, and after parsing and generating from the template one does get pure 4.0 DTD Strict html, and taking into account the recommendation quoted above, what would be wrong with being lenient towards these unknown tags, even in 4.0 DTD Strict?
The parser is stripping out invalid elements based on the DTD specifed, marking verified
Status: RESOLVED → VERIFIED
"Conservative in what you generate, but lenient in what you accept" is an ancient protocol design maxim. Peter Annema makes this point eloquently, and with support from relevant standards. I don't think it's right for this bug and its commmentary to be marked invalid. I'm reopening and reassigning for further consideration. /be
Status: VERIFIED → REOPENED
Resolution: INVALID → ---
David, are you willing to adjudicate this conflict? /be
Assignee: rickg → dbaron
Status: REOPENED → NEW
There are serious problems with "Be conservative in what you generate, but lenient in what you accept" when the large majority of those who are generating determine the validity of what they generate by whether it is accepted by a few (or even one) user-agent. This is why more recent web standards (e.g., CSS, XML) have moved towards much stricter error handling rules. However, there certainly is a tradition of lenient handling of HTML and we have some degree of obligation to meet past expectations that future browsers would continue to be lenient. However, I have to say I'm somewhat reluctant to support throwing out character data (as opposed to tags) that are not allowed, although I'm not sure what the parser should do to make them valid.
If the document reads strict it should be treated as such, the content provider has to specifically code in strict. A strict doctype is not the default. The default is transitional. If a content provider states strict, then that is exactly how the document should be processed. If the provider wants a more global, universal rendering of the document, then usage of the default is the appropriate doctype to use. Remember -- it is a concious decision to state strict. And the spec is pretty clear on how stirct should be handled. If the expectation is to allow anything and support anything regardless of the doctype specified, then what is the point behind being standards compliant? If the content provider wants a 4.x level of standards support, then stick witht he default.
I'd like to see the spec that is "pretty clear on how strict should be handled." See http://www.w3.org/TR/html4/appendix/notes.html#h-B.1
Adjudication isn't necessary; we're doing the right thing by preserving the content and ignoring the tags.
Trying to guess how to treat unknown tags is dubious at best. We *could* just assume unknown tags are container elementss with required endtags, of a type that is appropriate within their parent container. This would allow them to be maintained in the document. If push comes to shove, it's trivial to do this. However, authors that specify strict are explicity saying "hold my markup to a higher standard". In the XML world, if someone get's their markup wrong the browser simply refuses to display the document (and shows an error message instead). In HTML, the rule is to always show something, but since they've asked us to be strict we should make them comply with the DTD. Dawn: if someone tells us to apply the strictDTD, even by mistake, we should honor their request. They'll bring the document up in the browser and see that it (does or doesn't) render as they expect. No one is lying, and no one is being punished. We're following the spec as we've been asked to do. People will inevitably make mistakes, but that doesn't mean we should throw out the standards.
David: simple -- look at the DTD, look at the content model for the elements specified, it's clear as to what elements are acceptable within any and all allowable elements of the DTD.
beppe: You can say the exact same thing for HTML4 transitional, HTML 3.2, and HTML 2.0.
Suppose I edit someone else's page (which I do often, especially when i'm trying to figure out why moz/nc don't like a page). I want to be able to edit the page. If the first chance i get to alter the source, the page is already destroyed, that's really helpful</sarcasm> [please don't strip that]. It would be really nice if Mozilla Composer told me that the document failed to match its dtd and `would it be ok if the dtd was dropped and the document reparsed?` I also like the idea of being able to use a composer to see problems with a document, often our browser view-source doesn't work, and even if it does there's no guarantee the source is even close to readable in the source window. In netscape4 we had yellow tags for stuff we didn't recognize, if we also had say red tags for stuff we thought was broken (imo this should include the dtd, but apparently the <head> stuff isn't really editable in the composer view.) people could see problems and fix them. If you don't like that red tag stuff i could make it a separate RFE, but i really think that composer should be able to say the dtd isn't valid and offer to handle it some other way. yes I know that people can save a document and then strip the dtd, and then use composer, but that's silly when composer could enable us to do it. Remember, this isn't Browser, this is Composer, we're supposed to let people work on a document, including fixing stuff.
David: and that is the point, the DTD clearly specifies what is acceptable structure within the file, if you wish to alter that structure, then either 1. alter the DTD and publish it, and point to the new 'public' DTD in the doctype, or 2. add additional rules to the file that define the new elements. Simple, basic SGML constructs.
timeless: yes, it would be nice to somehow mark the areas that are bad in the file, especially in Composer. It would also be nice to throw up a dialog taht let the user know that the DTD specified and the elements used are not in synch with each other. Giving the user the opportunit to either 1. remove the doctype, or 2. show them where the offending elements are located. The file however, goes through the parser before it reaches us. So, that will take coordination with the parser folks to see if we can work through that issue. I do like the red unknown tag idea, and maybe have the background greyed out or something to show them how far reaching the offending code is. We should certainly look into that, but we won't be able to really investigate this at the present. One issue that we need to take into account, is that if there is a doctype specified and if the user specifies to keep the doctype in, then we should not output any element that is not within the specified DTD. We should only emit, valid documents in respect to the specified DTD and that becomes more and more important as we begin to support other element structures such as xml.
beppe (responding to 10:22 comment): What does the DTD have to do with this? I'm not saying the documents aren't wrong. What specifies the *error handling* behavior? What says error handling for documents with HTML 4.0 Strict DTDs should be different from error handling for documents with HTML 2.0 DTDs? AFAIK, there is absolutely no spec-based argument to make a distinction between HTML 4.0 strict and other DTDs, and you implied above that there was. The only reason we're doing stricter error handling for documents with HTML 4.0 strict DTDs and not other DTDs is because documents with HTML 4.0 strict DTDs are not yet common on the web. We assume that (almost) any strict HTML written on the web will be written considering our behavior. This will help the web in general *if* people move to HTML 4.0 strict that works on Mozilla. If the assumption is wrong that HTML 4.0 strict documents will be written with Mozilla in mind, then perhaps we need to reconsider our behavior. (Right now I think that assumption is close enough that we're OK. But, I'm not so sure that people will write strict HTML for Mozilla.)
David: you're joking right? The DTD has everything to do with how we handle the structure of the file. That is what is used to determine what gets stripped and what doesn't. Why do you think we have a parser? That is what this topic is about -- the parser stripping out elements. ANd where do you think the parser gets that set of rules from? Yep, you betcha -- the DTD.
No, I'm not joking. Yes, the rules in the HTML 4 strict DTD are slightly different from other DTDs. However, we're not doing this type of error handling for other DTDs. The DTD does *not* specify error handling behavior. It only describes which documents are valid and which are not.
Oh, I see what you're trying to ask. Rick did try to preserve the Transitional constructs but we had to revert back to quirks mode because of the issue with 4.x Composer placing the doctype. In addition, we had to make a call as to how we could be backwards compatible and still move forward. I believe the decision was that transitional would be more forgiving and allow for the awful page consrtuction that is out there. If transitional did adhere to the letter of the law, then a vast majority of the pages would not parse. So, an unfortunate choice had to be made -- do you break the vast majority of pages or do you provide two levels of adherence? Since transitional is the anything goes support, the strict is left for the true standards compliance support. Doctype statements that are less than the current level should also trigger a dialog informing the user that they are editing a document with an obsoleted DTD reference.
In non-strict mode, the parser adds a _moz-userdefined attribute to tags it doesn't recognize (which the output system then strips out upon output). We could use this to set up a style sheet (similar to the "show all tags" mode the editor already has) which showed "red tag" warnings for tags with this attribute. This should be filed as a separate RFE; cmanske would probably best know how to do this, but may not have time, so it might require help from outside. If we could get the tags in strict mode, or offer a dialog suggesting to the user that the doctype needs to be changed or else we'll throw out nonconforming data, this might solve most of the problems.
<offtopic> beppe, wasn't the suggestion with regard to the Composer 4.x compatibility issue, that <!doctype ...> is case sensitive, and Composer 4.x uses the incorrect case, so it wouldn't/shouldn't be treated as Transitional anyway? </offtopic>
yes taht was mentioned, but that is not how the problem was resolved, setting transitional back to quirks is what happened.
What needs to be done for this bug? Some (random?) thoughts: I think SGML-based HTML is dead. That is, I think it will always be tag soup, and attempts to get people to conform to SGML won't help. Our browser doesn't even accept lots of correct HTML. However, I don't want to force this dismal forecast on others and make it a reality, so I'm willing to accept the Strict DTD's changes, as far as the browser is concerned, as something that *could* help improve the quality of SGML-based HTML on the web. However, for the Editor, these changes cause serious problems. I like Dawn Endico's suggestion of 2000-06-16 16:39 (if there are errors when loading a page into the editor, we should ask the user whether the page should be parsed loosely and the DOCTYPE changed). It seems like a simple solution that wouldn't disturb much else. It requires notification of errors from the parser and some code in the editor to check if there were parser errors. I imagine it can't be too hard for the parser to notice when it's dropping things. Notification of errors would also be a very good thing for authors. It would at least give us a place to point when authors complain that we're dropping markup/data. These errors could be shown in something like the JS console. Should the JS error console be turned into a general error console, or should there be separate ones for different things? I tend to like the idea of one big error console. Is there a console service that allows these errors to be shown easily?
For lack of any responses to my previous comment... Since I think any proposed solution to this bug that involves keeping the Strict DTD requires some sort of error notification from the DTD, I'm assigning this to Harish so he can make the parser remember that there were errors (and maybe even what they were).
Assignee: dbaron → harishd
Reasonable: Dawn Endico 2000-06-16 19:39; David Baron 2000-07-01 00:11; Akkana 2000-06-21 13:11 ------- I still stand by: timeless@bemail.org 2000-06-16 01:45; timeless@bemail.org 2000-06-21 10:17 wrt David Baron 2000-07-01 00:11 Yes there are console services, but i'm not sure i like that. I'd prefer red tags that have hints describing why they're red. I agree w/ your conclusion that parser needs to tell editor if the document failed and then allow it to be parsed as loose. <offtopic> Maybe editor could also use the little widget describing quality of conformance. [I don't remember the bug #] </offtopic>
Strict DTD will not be supported in mozilla. Marking bug INVALID.
Status: NEW → RESOLVED
Closed: 25 years ago24 years ago
Resolution: --- → INVALID
Please don't mark this bug as invalid. If parser has fixed its faults in this then feel free to reassign this to Akkana in Editor to be marked as fixed, iirc we currently preserve the junk (something akin to my ongoing requests and comments by jag and dbaron) /me is sick of people invalidating bugs.
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
Timeless, since the bug was on my plate I took the liberty of marking this bug INVALID. In my understanding strict DTD is the cause of the problem and since this DTD will not be supported in Mozilla the problem,in question, should go away. Isn't that correct? Per timeless, reassigning to akkana.
Assignee: harishd → akkana
Status: REOPENED → NEW
This really was a parser bug, not an editor bug. I'll mark it fixed if you want, and if you say the parser is no longer destroying invalid tags, but I'm not really the right person to do it: I don't know any more than having read your various comments saying we're not supporting strict DTD (and I'm not entirely clear what that means -- does it mean that we've fixed this bug by deciding not to discard unknown tags? Or something else?) Timeless, what are you seeing? Is it working now for your test case?
------- Additional Comments From rickg@netscape.com 2000-06-20 23:34 ------- Adjudication isn't necessary; we're doing the right thing by preserving the content and ignoring the tags. ->test is still destroyed. IMO test is content. ------- Additional Comments From David Baron 2000-07-01 00:11 ------- What needs to be done for this bug? Some (random?) thoughts: ->Editor: Please trash the DTD before giving it to Parser. That's the end of the story. ------- Additional Comments From harishd@netscape.com 2000-08-27 12:21 ------- Strict DTD will not be supported in mozilla. Marking bug INVALID. ->I'm not going to try to figure out what a strict DTD is. ->Editor: Don't Support DTDs I'll be happy. ->Parser: Ignore any dtd's editor gives you. I'll be happy ->Strict: Die before parser mangles the content. I'll be happy ->Coffee: I need some. ------- Additional Comments From Akkana 2000-08-28 15:52 ------- Timeless, what are you seeing? Is it working now for your test case? ->Editor: Parser is killing test. ->Parser: Not supporting something shouldn't mean authorization to mangle it. ->Akkana: please reassign to someone in Editor to arrange for the DTD to be withheld. ->Brendan: These people don't believe in leniancy. This is a serious problem. Marking relnoteRTM. Shreading documents must be documented. We should not shread documents. Not supporting something does not justify shreading it. Priority: Critical due to dataloss. Editor, if you don't like the idea of trashing the dtd, preserve it but don't give it to parser.
Severity: normal → critical
Keywords: relnoteRTM
It was decided that we would no longer use the strict DTD for HTML.
so, we don't support strict anymore -- Harish, does that now mean unknown and/or invalid elements will not be stripped and that the element and data will be preserved? If so, then Harish this bug is your call to respond to
Assignee: akkana → harishd
Harish, does that now mean unknown and/or invalid elements will not be stripped and that the element and data will be preserved? That's correct. But this will happen only when bug 50070 gets fixed.
Status: NEW → ASSIGNED
BTW, timeless, the editor doesn't have control over what it gives to the parser. The editor doesn't even get created until the document is finished loading and the dom tree is fully created. So we're at the parser's mercy on this one.
A long time ago
Status: ASSIGNED → RESOLVED
Closed: 24 years ago24 years ago
Resolution: --- → FIXED
changing qa contact to sujay
QA Contact: janc → sujay
Refuse to verify fixed testcase1: <html><body><invalid></body></html> result: document is shreaded. testcase2: url. result: I can't actually view the source for that page :O I don't understand what in the world is going on here. QA: thanks for reminding me about this bug, i'm sorry that you and everyone else are stuck dealing w/ it. Could you please file and cc bugs or find bugs matching the two above problems? I'm tired of this and i only just looked at it for 5 mins. --very sorry-- maybe i should never use editor. --very sorry, running away-- --very sorry--
Keywords: testcase
timeless, REOPEN if you think this bug is not fixed....
Here is the content model for a the simple testcase that timeless provided: docshell=00C7E370 html@02EF4798 refcount=6< head@02EF46A8 refcount=2< > Text@02D0EAD0 refcount=3<\n > body@02D0B4B8 refcount=3< Text@02D42800 refcount=3<\n > invalid@02D43FC8 _moz-userdefined= refcount=3< Text@02D43D50 refcount=3<\n > > > > The invalid tag is in the content model. That is, parser did not discard it! This, IMO, is a composer issue not parser. Giving bug to Akkana and reopening the bug.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Reassigning to akkana.
Assignee: harishd → akkana
Status: REOPENED → NEW
Can someone explain what the bug is at this point? If I create a file containing timeless' example, and (in a branch build, haven't tested on the trunk) I edit that file, then when I output it, I get (after adding a title): <html> <head> <title>invalid</title> </head> <body> <invalid></invalid> </body> </html> In other words, the only differences are formatting, title, and the addition of a </invalid> close tag. If I go into html source mode in the editor, or do OutputHTML, I see the same thing. Is the close tag the problem? (I'm not clear how the output system can determine from the content model whether a close tag is needed on an unknown tag; we have a few special cases, like <p>, where we know not to add a close tag. Is there a bug in the trunk that I'm not seeing on the branch? What am I missing? Adding anthonyd since he's inheriting output system bugs.
Sorry, I haven't been creating files. To reproduce: run composer view html source enter my testcase [or yours] view normal edit mode. Yours shows that the page is being considered (the title survives). this is w/ 11/01-04 w32 talkback.
Whiteboard: relnote-user
so, timeless -- are you still seeing the elements getting stripped? In a current build (release build) I displayed the sample page from the original entry in the browser, selected to edit page and it all renders correctly and I'm allowed to edit without incident.
Status: NEW → RESOLVED
Closed: 24 years ago24 years ago
Resolution: --- → FIXED
Timeless, please verify....thanks...
I'm having a real hard time verifying because composer seems to be remembering random things after they're destroyed. 2000111004 [i know it isn't current] Steps: Start with the original testcase http://bugzilla.mozilla.org/showattachment.cgi?attach_id=9848 in Navigator. File> Edit Page. The content has survived. View>HTML Source Delete the <link> entry. The source should now be <html> <head> <title>The Nexus</title> </head> <body class="mainpage"> test </body> </html> Select View>Normal Edit Mode afaik, the background should be gone because there is no longer a style sheet that gives it features for 'mainpage'. In practice, the page looks like it did when i first loaded it in Composer. Back to HTML Source. cut ' class="mainpage"' The page should now look like: <html> <head> <title>The Nexus</title> </head> <body> test </body> </html> Go to normal edit mode I see: test [left aligned as expected], on a cyan blue background w/ purple stars. ?? I am very confused. Yes I know this is really a separate bug, but the thing is the way I intended to verify was to simply enter my testcase in Show HTML, when in fact that is still practicing voodoo. Next. Create a new blank html document [File>New Composer Page] Go to HTML Source <html> <head> <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"> </head> <body> <br> </body> </html> - Replace with: <html><body><invalid></body></html> Go to Normal Edit Mode. Return to HTML Source. Result: the text you've entered is no longer there, you have the default blank page text. <invalid> is definitely not there. I will gladly verify this bug, if and when I can follow the steps I have just described and do get the expected output. Next: Go back to normal edit mode. Insert>HTML... type <invalid> go back to HTML Source <html> <head> <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"> </head> <body> <br> <br> </body> </html> Now each time i switch back and forth I get another <br>. These are probably all bugs unrelated to my complaint, except for my concern that I can't insert <invalid> in composer. Switch to View>Show All Tags Insert>HTML... <invalid> Result: nothing happens. I can reproduce most of these steps on netscape6rtm [i'm not going to reproduce them all now, but I am certain they will]. My computer has a session available for interaction, if you want to watch as I take these steps, or show me what I should do then feel free to contact me on irc [Asa and others can also show you how to use my computer]. Reopening per total lack of ability to insert <invalid> I did verify that inserting <b>test</b> does work. So insert html can work.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Timeless -- can you please file a separate bug for the new issue? I see it too, and we need to look at it, but it's definitely not the same bug as this, and we need a new bug to track it. Go ahead and assign it to me (but cc cmanske, he does a lot of work with view source), and I'll triage.
Status: REOPENED → RESOLVED
Closed: 24 years ago24 years ago
Resolution: --- → FIXED
I talked to Charley about this issue: he said it wasn't surprising that this happened, and that it might be very hard to fix this, because the style information is separate from the rest of the document, and doesn't get destroyed when the editor reloads the document. Definitely a separate bug, which should probably be assigned to him (cmanske@netscape.com) since it sounded like he was already thinking about ways to reload the document more completely. But cc me and sfraser@netscape.com.
Timeless, please verify this one and mark verified fixed.. I am crossing my fingers this time.
The basic issue here seems to be fixed. When I first open the file, the unknown tags are there (linux build 2001-01-29-08). There are ways to make composer drop them, and we will likely end up with a bunch of bugs on that (I just filed bug 67007, for example). Verified Linux build 2001012908. timeless says it works for him on windows as well. Marking verified.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.