Open Bug 1536835 Opened 6 years ago Updated 2 years ago

Consider adding .html and .xhtml to the list of "js" extensions

Tracking

(Not tracked)

Status:

NEW

People

(Reporter: bzbarsky, Unassigned)

References

Details

Boris Zbarsky [:bzbarsky]

Reporter

Description

•

6 years ago

See https://bugzilla.mozilla.org/show_bug.cgi?id=1517483#c16 -- our "search for IDL method" bits don't find uses in .html (e.g. tests).

Adding .xhtml should not be too bad; we can add an XHTMLParser subclass of XMLParser and so forth.

Adding .html would involve importing an HTML parser implemented in JS, which I don't think mozsearch has so far. Ideally one with a SAX interface, so most of the rest of the bits could be reused... I poked around a bit just now but haven't found anything obvious. Maybe Henri knows of sometjong?

Flags: needinfo?(hsivonen)

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 1

•

6 years ago

For reference, here is the code we used to parse JS out of the various files we currently support: https://github.com/mozsearch/mozsearch/blob/98eaf39482e3b5e0d46d5640f3e7a6d27f91eca6/scripts/js-analyze.js#L1243

Henri Sivonen (:hsivonen) (temporarily away from Bugzilla)

Comment 2

•

6 years ago

https://hg.mozilla.org/projects/htmlparser/ at least used to be compilable to JS using GWT. I haven't invoked the GWT build in many years, so there might be some bitrot to fix. IIRC, David Flanagan wrote an HTML parser directly in JS in the B2G days.

Flags: needinfo?(hsivonen) → needinfo?(djf)

David Flanagan [:djf]

Comment 3

•

6 years ago

Mine actually predates B2G. This was part of Andreas's dom.js project: he thought it might be useful for Servo, and since Rust wasn't stable back in 2011, he had me working on implementing the DOM in js. It never got adopted for anything, but the parser was robust, derived directly from the spec, and actually passed all the test suites I could find. The parser itself is at https://github.com/andreasgal/dom.js/blob/master/src/impl/HTMLParser.js I'm not sure how useful it would be on its own without the rest of dom.js

I would guess that today, something like JSDOM https://github.com/jsdom/jsdom would be your best bet. I haven't used it myself.

Flags: needinfo?(djf)

Emilio Cobos Álvarez (:emilio)

Comment 4

•

6 years ago

Compiling html5ever / xml5ever to wasm would also be a pretty cool project :P

Andrew Sutherland [:asuth] (he/him)

Comment 5

•

6 years ago

For completeness, the b2g email app had a hacked-up version of JResig's hacked up HTML parser thing that we used for whitelist-based HTML sanitization on workers that worked okay enough. See https://github.com/mozilla-b2g/bleach.js/blob/worker-thread-friendly/lib/bleach.js#L308 if morbidly curious.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 6

•

5 years ago

https://lists.mozilla.org/pipermail/dev-platform/2019-May/024045.html is relevant to this bug.

On the topic of HTML parsing, I don't know if we need to have the parser in JS. It could be in some other language (e.g. rust), as long as it can somehow pass the JS bits to the JS analyzer. Presumably it wouldn't be too hard to take html5ever and write a wrapper around it such that when fed a (X)HTML file, it just strips away all the non-JS stuff and leaves the JS stuff at the same line/column numbers as in the original file, which we can then feed to the js analyzer.

I suspect using a rust HTML parser will be faster than a JS parser, but I could be wrong. I don't know how the state of the art implementations compare in terms of speed. And performance will matter here, because we're talking about parsing thousands of files.

Andrew Sutherland [:asuth] (he/him)

Comment 7

•

3 years ago

We're now using https://crates.io/crates/lol_html in searchfox for other purposes and it works pretty well, so that seems like a likely candidate. I think we'll likely try and most to using lsif-node for JS parsing (bug 1740290), so it'd be straightforward to normalize lines/columns there "manually" if we can't just emit magic comments like a pre-processor or easily do sourcemap stuff[1].

1: I know very little about source maps at this moment so that might not even be the right spec.

Bugzilla

Quick Search

Consider adding .html and .xhtml to the list of "js" extensions

Categories

(Webtools :: Searchfox, enhancement)

Tracking

(Not tracked)

People

(Reporter: bzbarsky, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7