Closed Bug 188609 Opened 22 years ago Closed 20 years ago

Need separate tokenizer for View Source

Categories

(Core :: DOM: HTML Parser, defect)

defect
Not set
minor

Tracking

()

RESOLVED WONTFIX
Future

People

(Reporter: choess, Assigned: mrbkap)

References

Details

Currently "View Source" passes through the HTML tokenizer (so we can do syntax
highlighting), which often results in mangling of page source. We've discussed
creating a separate tokenizer before; while I doubt anyone has time to do this
now, I wanted to get this bug on the table (and also track the bugs caused by
our current setup).
Blocks: 57724
Or we could just pass this crap of to the system text editor....
Blocks: 189202
Status: NEW → ASSIGNED
Target Milestone: --- → Future
Blocks: 172947
Blocks: 182215
Blocks: 204573
I've been thinking about this, and I may even get a patch going at some point. I
was wondering about a couple of questions.

First, would it be best if the new tokenizer could leverage the old tokenizer's
code in some way? This may be tricky, and keeps some of the complexity of view
source in the tokenizer, but code duplication is a bad thing, and some may be
avoidable. Is it worth the extra complexity?

Second, how forgiving should the new tokenizer be? If there is a missing '>' at
the end of a tag, should it still colorize subsequent tags like there was a '>'
(though, of course, it would not add the '>') or colorize subsequent words as
though they were attributes? I think this question comes down to who is the
'target audience'. By being less forgiving, it makes mistakes easier to see for
developers, but for others, may make a page less readable.
Any sharing of code would be great.

As for forgiveness, I'd love it if we could blink erroneous things and tokenize
as if the errors were corrected....
Assigning to me since I'm working on this.

bz, how forgiving should view-source be with XML constructs like CDATA?
Currently, we make a mockery of it (see bug 84430). A real XML parser would be
completely unforgiving, so I'm inclined to make view-source the same way.
However, the current method of handling it in CCDataSection is basically: find
the string ]]. If the next character is a '>', good, we're done. Otherwise, skip
everything until we find a '>' and then quit. I could follow this route (except
adding everything between the ]] and > to our string, or be ultra-strict. What
do you think?
Assignee: harishd → mrbkap
Status: ASSIGNED → NEW
I have no real opinion past "we sholdn't lose content"....  
The more I think about this, the more I am not convinced this is the right way
to go. I'm starting to feel that having two tokenizers doing almost exactly the
same thing isn't worth the upkeep costs/duplicated code.

The remaining bugs should be fixable with the current tokenizer (though a few
may need to be handled in the view-source DTD). If I run into any major problems
with the existing tokens, I'll just derive from CHTMLToken or something like that. 

We're never going to be able to show the unparsed HTML (just a representation),
but it's certainly possible to make that representation look exactly like the
original with the current codebase.

As an aside, I'm now trying to collect all view-source bugs and resolve out the
dupes, so if you see any not attatched to either this bug or bug 57724, please
attach them.
Officially marking WONTFIX. The last of the real "major" munging errors is now
fixed. We don't need another tokenizer. Yay!
Status: NEW → RESOLVED
Closed: 20 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.