Closed Bug 188609 Opened 22 years ago Closed 20 years ago

Need separate tokenizer for View Source

Categories

(Core :: DOM: HTML Parser, defect)

defect
Not set
minor

Tracking

()

RESOLVED WONTFIX
Future

People

(Reporter: choess, Assigned: mrbkap)

References

Details

Currently "View Source" passes through the HTML tokenizer (so we can do syntax highlighting), which often results in mangling of page source. We've discussed creating a separate tokenizer before; while I doubt anyone has time to do this now, I wanted to get this bug on the table (and also track the bugs caused by our current setup).
Blocks: 57724
Or we could just pass this crap of to the system text editor....
Blocks: 189202
Status: NEW → ASSIGNED
Target Milestone: --- → Future
Blocks: 172947
Blocks: 182215
Blocks: 204573
I've been thinking about this, and I may even get a patch going at some point. I was wondering about a couple of questions. First, would it be best if the new tokenizer could leverage the old tokenizer's code in some way? This may be tricky, and keeps some of the complexity of view source in the tokenizer, but code duplication is a bad thing, and some may be avoidable. Is it worth the extra complexity? Second, how forgiving should the new tokenizer be? If there is a missing '>' at the end of a tag, should it still colorize subsequent tags like there was a '>' (though, of course, it would not add the '>') or colorize subsequent words as though they were attributes? I think this question comes down to who is the 'target audience'. By being less forgiving, it makes mistakes easier to see for developers, but for others, may make a page less readable.
Any sharing of code would be great. As for forgiveness, I'd love it if we could blink erroneous things and tokenize as if the errors were corrected....
Assigning to me since I'm working on this. bz, how forgiving should view-source be with XML constructs like CDATA? Currently, we make a mockery of it (see bug 84430). A real XML parser would be completely unforgiving, so I'm inclined to make view-source the same way. However, the current method of handling it in CCDataSection is basically: find the string ]]. If the next character is a '>', good, we're done. Otherwise, skip everything until we find a '>' and then quit. I could follow this route (except adding everything between the ]] and > to our string, or be ultra-strict. What do you think?
Assignee: harishd → mrbkap
Status: ASSIGNED → NEW
I have no real opinion past "we sholdn't lose content"....
The more I think about this, the more I am not convinced this is the right way to go. I'm starting to feel that having two tokenizers doing almost exactly the same thing isn't worth the upkeep costs/duplicated code. The remaining bugs should be fixable with the current tokenizer (though a few may need to be handled in the view-source DTD). If I run into any major problems with the existing tokens, I'll just derive from CHTMLToken or something like that. We're never going to be able to show the unparsed HTML (just a representation), but it's certainly possible to make that representation look exactly like the original with the current codebase. As an aside, I'm now trying to collect all view-source bugs and resolve out the dupes, so if you see any not attatched to either this bug or bug 57724, please attach them.
Officially marking WONTFIX. The last of the real "major" munging errors is now fixed. We don't need another tokenizer. Yay!
Status: NEW → RESOLVED
Closed: 20 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.