Closed
Bug 188609
Opened 22 years ago
Closed 20 years ago
Need separate tokenizer for View Source
Categories
(Core :: DOM: HTML Parser, defect)
Core
DOM: HTML Parser
Tracking
()
RESOLVED
WONTFIX
Future
People
(Reporter: choess, Assigned: mrbkap)
References
Details
Currently "View Source" passes through the HTML tokenizer (so we can do syntax highlighting), which often results in mangling of page source. We've discussed creating a separate tokenizer before; while I doubt anyone has time to do this now, I wanted to get this bug on the table (and also track the bugs caused by our current setup).
Comment 1•22 years ago
|
||
Or we could just pass this crap of to the system text editor....
Assignee | ||
Comment 2•20 years ago
|
||
I've been thinking about this, and I may even get a patch going at some point. I was wondering about a couple of questions. First, would it be best if the new tokenizer could leverage the old tokenizer's code in some way? This may be tricky, and keeps some of the complexity of view source in the tokenizer, but code duplication is a bad thing, and some may be avoidable. Is it worth the extra complexity? Second, how forgiving should the new tokenizer be? If there is a missing '>' at the end of a tag, should it still colorize subsequent tags like there was a '>' (though, of course, it would not add the '>') or colorize subsequent words as though they were attributes? I think this question comes down to who is the 'target audience'. By being less forgiving, it makes mistakes easier to see for developers, but for others, may make a page less readable.
Comment 3•20 years ago
|
||
Any sharing of code would be great. As for forgiveness, I'd love it if we could blink erroneous things and tokenize as if the errors were corrected....
Assignee | ||
Comment 4•20 years ago
|
||
Assigning to me since I'm working on this. bz, how forgiving should view-source be with XML constructs like CDATA? Currently, we make a mockery of it (see bug 84430). A real XML parser would be completely unforgiving, so I'm inclined to make view-source the same way. However, the current method of handling it in CCDataSection is basically: find the string ]]. If the next character is a '>', good, we're done. Otherwise, skip everything until we find a '>' and then quit. I could follow this route (except adding everything between the ]] and > to our string, or be ultra-strict. What do you think?
Assignee: harishd → mrbkap
Status: ASSIGNED → NEW
Comment 5•20 years ago
|
||
I have no real opinion past "we sholdn't lose content"....
Assignee | ||
Comment 6•20 years ago
|
||
The more I think about this, the more I am not convinced this is the right way to go. I'm starting to feel that having two tokenizers doing almost exactly the same thing isn't worth the upkeep costs/duplicated code. The remaining bugs should be fixable with the current tokenizer (though a few may need to be handled in the view-source DTD). If I run into any major problems with the existing tokens, I'll just derive from CHTMLToken or something like that. We're never going to be able to show the unparsed HTML (just a representation), but it's certainly possible to make that representation look exactly like the original with the current codebase. As an aside, I'm now trying to collect all view-source bugs and resolve out the dupes, so if you see any not attatched to either this bug or bug 57724, please attach them.
Assignee | ||
Comment 7•20 years ago
|
||
Officially marking WONTFIX. The last of the real "major" munging errors is now fixed. We don't need another tokenizer. Yay!
Status: NEW → RESOLVED
Closed: 20 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•