Open Bug 1536835 Opened 4 years ago Updated 8 months ago

Consider adding .html and .xhtml to the list of "js" extensions

Categories

(Webtools :: Searchfox, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

People

(Reporter: bzbarsky, Unassigned)

References

Details

See https://bugzilla.mozilla.org/show_bug.cgi?id=1517483#c16 -- our "search for IDL method" bits don't find uses in .html (e.g. tests).

Adding .xhtml should not be too bad; we can add an XHTMLParser subclass of XMLParser and so forth.

Adding .html would involve importing an HTML parser implemented in JS, which I don't think mozsearch has so far. Ideally one with a SAX interface, so most of the rest of the bits could be reused... I poked around a bit just now but haven't found anything obvious. Maybe Henri knows of sometjong?

Flags: needinfo?(hsivonen)

For reference, here is the code we used to parse JS out of the various files we currently support: https://github.com/mozsearch/mozsearch/blob/98eaf39482e3b5e0d46d5640f3e7a6d27f91eca6/scripts/js-analyze.js#L1243

https://hg.mozilla.org/projects/htmlparser/ at least used to be compilable to JS using GWT. I haven't invoked the GWT build in many years, so there might be some bitrot to fix. IIRC, David Flanagan wrote an HTML parser directly in JS in the B2G days.

Flags: needinfo?(hsivonen) → needinfo?(djf)

Mine actually predates B2G. This was part of Andreas's dom.js project: he thought it might be useful for Servo, and since Rust wasn't stable back in 2011, he had me working on implementing the DOM in js. It never got adopted for anything, but the parser was robust, derived directly from the spec, and actually passed all the test suites I could find. The parser itself is at https://github.com/andreasgal/dom.js/blob/master/src/impl/HTMLParser.js I'm not sure how useful it would be on its own without the rest of dom.js

I would guess that today, something like JSDOM https://github.com/jsdom/jsdom would be your best bet. I haven't used it myself.

Flags: needinfo?(djf)

Compiling html5ever / xml5ever to wasm would also be a pretty cool project :P

For completeness, the b2g email app had a hacked-up version of JResig's hacked up HTML parser thing that we used for whitelist-based HTML sanitization on workers that worked okay enough. See https://github.com/mozilla-b2g/bleach.js/blob/worker-thread-friendly/lib/bleach.js#L308 if morbidly curious.

https://lists.mozilla.org/pipermail/dev-platform/2019-May/024045.html is relevant to this bug.

On the topic of HTML parsing, I don't know if we need to have the parser in JS. It could be in some other language (e.g. rust), as long as it can somehow pass the JS bits to the JS analyzer. Presumably it wouldn't be too hard to take html5ever and write a wrapper around it such that when fed a (X)HTML file, it just strips away all the non-JS stuff and leaves the JS stuff at the same line/column numbers as in the original file, which we can then feed to the js analyzer.

I suspect using a rust HTML parser will be faster than a JS parser, but I could be wrong. I don't know how the state of the art implementations compare in terms of speed. And performance will matter here, because we're talking about parsing thousands of files.

We're now using https://crates.io/crates/lol_html in searchfox for other purposes and it works pretty well, so that seems like a likely candidate. I think we'll likely try and most to using lsif-node for JS parsing (bug 1740290), so it'd be straightforward to normalize lines/columns there "manually" if we can't just emit magic comments like a pre-processor or easily do sourcemap stuff[1].

1: I know very little about source maps at this moment so that might not even be the right spec.

See Also: → 1740290
You need to log in before you can comment on or make changes to this bug.