Closed Bug 1232011 Opened 9 years ago Closed 8 years ago

binary to text transform for wasm view source

Categories

(Core :: JavaScript Engine: JIT, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: jsantell, Unassigned)

References

Details

Attachments

(1 file)

      No description provided.
Could you please clarify if this bug deals with viewing the 'wasm' text source code (a decoding of the deployed source code) and/or the upstream source code that has been source-to-source compiled/assembled to 'wasm' source code?
I believe its the former (decoding of deployed source), but Luke knows more
With wasm, the binary format (https://github.com/WebAssembly/design/blob/master/BinaryEncoding.md) comes over the wire to the browser.  In any place where devtools currently show JS source, the devtools should display the wasm text format (https://github.com/WebAssembly/design/blob/master/TextFormat.md) by generating it from the binary format according to the spec-defined algorithm.
Blocks: wasm
No longer blocks: wasm
Random question: should this be implemented in C++ or in JS? If it's only for debugger purposes, I guess JS is enough; if there any other uses (and I can't find any, other than testing purposes -- although a JS library could be used as well for testing), C++ would make sense. Thoughts?
You're right that, since this won't be perf-sensitive, we could use JS, however there are a few practicalities that make me think it'll be a bit simpler to do in C++:
 - There will already be some factored-out .wasm decoding code shared between wasm Ion/baseline compilers that we could share for the binary-to-text algo.
 - We'd want to call this algo from 3 places: a shell function used for writing unit tests (so we could write our unit tests in text), Debugger.Script.prototype.source, a JS_FRIEND_API called by the non-debugger View Source path

Also, since the algo won't be triggered by plain content, it won't be as security-sensitive.
Assignee: nobody → evilpies
Blocks: wasm
There are a few additional constraints here.

1. The translation should preserve a bi-directional mapping between code locations and source locations. This would allow us to set breakpoints and highlight individual expressions when single-stepping.

2. Since the text format can be very large, it would be nice to be able to decode the binary format incrementally.

3. Should |decode(encode(decode(encode(text)))) == text|? I think it should, even though |decode(encode(text)) == text| need not necessarily be true.

Are there any other constraints?
Attached file JavaScript Prototype
I am probably not the best person to actually complete this, but I wrote a prototype in JS that should be able to handle almost everything that is currently implemented.
(In reply to Michael Bebenita [:mbx] from comment #6)
> 1. The translation should preserve a bi-directional mapping between code
> locations and source locations. This would allow us to set breakpoints and
> highlight individual expressions when single-stepping.

One way this could be expressed as having an optional Vector<uint32> outparam v[i] gave you the bytecode offset of line i.  Probably we can hold off on implementing this until we need it though.

> 2. Since the text format can be very large, it would be nice to be able to
> decode the binary format incrementally.

This does seem like a cross-cutting concern we'd want to incorporate in the algorithm before getting too far.  An initial use case could be that the error message displayed could include the preceding N lines of text and a cursor at the point when things when bad.  It'd also be nice if at each point where validation could go bad we could display the offending byte with several bytes of surrounding context.

> 3. Should |decode(encode(decode(encode(text)))) == text|? I think it should,
> even though |decode(encode(text)) == text| need not necessarily be true.

Agreed that the simpler |decode(encode(text)) == text| won't be true due to whitespace and names, but I don't think adding another decode(encode(...)) would make it true.  Related, I think we should have a top-level symbol section which simply has a list of function names and list of list of local names.  With these, you could define a whitespace-stripping 'canonicalize' and then we should have: |decode(encode(text)) == canonicalize(text)|.  (Technically, the s-expr format names some other things (like exports and signatures) that we'd have to consider whether it was worth it to name or strip in canonicalize.)
(In reply to Tom Schuster [:evilpie] from comment #7)
Cool, thanks!  We should chat more and figure out what our plan should be here.  One thing to consider is that the binary format is going to churn a good deal over the next few months and we should switch to a more JS-y text format from the s-expression current form (https://github.com/WebAssembly/design/blob/master/TextFormat.md#official-text-format).
(In reply to Luke Wagner [:luke] from comment #8)
> (In reply to Michael Bebenita [:mbx] from comment #6)
> > 1. The translation should preserve a bi-directional mapping between code
> > locations and source locations. This would allow us to set breakpoints and
> > highlight individual expressions when single-stepping.
> 
> One way this could be expressed as having an optional Vector<uint32>
> outparam v[i] gave you the bytecode offset of line i.  Probably we can hold
> off on implementing this until we need it though.

What about column information? The algorithm might want to print several bytecode worth of code on one line.

> > 2. Since the text format can be very large, it would be nice to be able to
> > decode the binary format incrementally.
> 
> This does seem like a cross-cutting concern we'd want to incorporate in the
> algorithm before getting too far.  An initial use case could be that the
> error message displayed could include the preceding N lines of text and a
> cursor at the point when things when bad.  It'd also be nice if at each
> point where validation could go bad we could display the offending byte with
> several bytes of surrounding context.
> 
> > 3. Should |decode(encode(decode(encode(text)))) == text|? I think it should,
> > even though |decode(encode(text)) == text| need not necessarily be true.
> 
> Agreed that the simpler |decode(encode(text)) == text| won't be true due to
> whitespace and names, but I don't think adding another decode(encode(...))
> would make it true.  Related, I think we should have a top-level symbol
> section which simply has a list of function names and list of list of local
> names.  With these, you could define a whitespace-stripping 'canonicalize'
> and then we should have: |decode(encode(text)) == canonicalize(text)|. 
> (Technically, the s-expr format names some other things (like exports and
> signatures) that we'd have to consider whether it was worth it to name or
> strip in canonicalize.)

My concern was more that |while (...) { text = decode(encode(text)); }| should converge after one iteration. The text format should match the binary format exactly. This would mean that both encode/decode don't do any canonicalization beyond whitespace-stripping and symbol renaming.
(In reply to Michael Bebenita [:mbx] from comment #10)
> What about column information? The algorithm might want to print several
> bytecode worth of code on one line.

This is pretty-printed text, so the columns should be short and also the devtools' UI only allow placing breakpoints on lines.

> My concern was more that |while (...) { text = decode(encode(text)); }|
> should converge after one iteration.

I understand the desire, but my point was just that |encode(x) { return empty module}| would converge after one iteration :)  A similar attractive invariant would be that |binary == encode(decode(binary))| but I think there is a small amount of canonicalization that happens there too (e.g., minimizing variable-length integer encoding length).
The browser accepts the binary encoding, not the text format, so it has already been largely canonicalized. Unnecessary redundancy in the binary encoding should be removed at the spec level, so that a text format can represent the exact binary encoding without frustrating side information, so the identity encode(decode(binary))=binary should hold. Areas that cause difficulty here should be reported as encoding issues.

For example the v8 encoding currently specifies the function names by an offset, allowing them anywhere in the binary, in any order. This frustrates identities, but can be easily fixed by placing them in order, and this would even save space in the binary. We should canonicalize leb128 so that only the shortest encoding is valid etc

One issue I would request you consider is allowing extensions to pretty-print the wasm binary, so that alternative presentations could be explored by the community.

My hope is that all the bulk data will be OOL wrt the sections definitions, so that the sections can be pre-decoded and would give the high level layout, and a pretty printer might want to do this.

It does appear to have been conceded that a wasm decoder would be required to work on a per-function granularity, and doing so might be a good place to start. This assumes that you are working with a layer-0 encoding. It might still be possible to only emit the text for a range of lines, but the entire functions AST will likely need to decoded and probably annotated with formatting information first.
In LEB128, it will be possible to encode the integer 1 as [0x1] or [0x80, 0x1] and I don't think we will attempt to define the latter as illegal.  The former will be the canonical representation.

> One issue I would request you consider is allowing extensions to
> pretty-print the wasm binary, so that alternative presentations could be
> explored by the community.

There has actually been some discussion in this direction already.  In particular, a wasm module could include a section of opaque debug metadata along with the URL of "my debugger" which devtools could launch (in a separate Realm/Worker) and send requests to to render source or perform various debugger operations (break, step, etc).  We just have the rough sketches of an idea at this point, though.  The first step is to make Debugger API and View Source work with the spec-defined wasm text projection.
Specifying the source location by the line will also be a big hit. It would be necessary to decode all preceding functions to determine the line. This could be cached. Would it be at all workable to be able to specify a source position with at least a function number prefix. In lisp it is common to use a list of form indexes to specify the location for debugging, so for example even the element in a block would have an index to quickly narrow down the AST node. For a text source code this is also resilient to changes in other AST sub-trees, but this is probably not relevant here.
(In reply to Douglas Crosher [:dougc] from comment #14)
> Specifying the source location by the line will also be a big hit.

Presumably, if we had the abovementioned interface for rendering a "window" of text, then list of bytecodes would also be given only for that window.  It would still be necessary to perform a pretend decoding up to that point, but only insofar as to increment the line number and, as you said, we could cache the starting line of every function so subsequent renderings would be cheap.
(In reply to Luke Wagner [:luke] from comment #13)
> In LEB128, it will be possible to encode the integer 1 as [0x1] or [0x80,
> 0x1] and I don't think we will attempt to define the latter as illegal.  The
> former will be the canonical representation.

That would be unfortunate and demand that a text format wanting to uniquely represent an encoding need to be able to represent the non-canonical leb128 encodings. Issue https://github.com/WebAssembly/design/issues/533 has been submitted to address this.

Practically, your text format could just spew side-information out at the end of the file and perhaps people would not even see it if only looking at part of a file, but I hope Mozilla can support addressing as much if this redundancy as possible in the binary encoding.

> > One issue I would request you consider is allowing extensions to
> > pretty-print the wasm binary, so that alternative presentations could be
> > explored by the community.
> 
> There has actually been some discussion in this direction already.  In
> particular, a wasm module could include a section of opaque debug metadata
> along with the URL of "my debugger" which devtools could launch (in a
> separate Realm/Worker) and send requests to to render source or perform
> various debugger operations (break, step, etc).  We just have the rough
> sketches of an idea at this point, though.  The first step is to make
> Debugger API and View Source work with the spec-defined wasm text projection.

Sounds great, thank you.
It's more work to detect non-canonical encoding and would also inhibit use cases that want to do backpatch, like our own asm.js->wasm internal translation.
Unfortunately Luke Wagner of Mozilla has failed to support the principle https://github.com/WebAssembly/design/issues/533 Luke also holds a position of Chair in the Community Group blocking this matter. Could the Mozilla management please review this decision?
Heh, anyway, this conversation has gotten pretty far from the goal of the bug.

From irc, I was telling mbx that, seeing as *both* the binary format and text format are likely to undergo significant churn over the next few months, it might be a good idea to hold off landing anything unless we thought it would be really valuable in the short term to, e.g., aid in the triage of fuzz bugs.
Assignee: evilpies → nobody
The impetus for this bug was from what was discussed in Orlando for initial support of wasm in the debugger -- mostly to render something in text for wasm sources rather than a message of "support coming soon"; if this is a relatively simple patch, it'd provide value for having something in tools before other browsers, and maybe something for GDC, even without the rest of the debugger's features
See Also: → 1274618
This has been done and it is used in the devtools (open devtools, then compile wasm module, then look at sources) and it's tested in the JS shell through wasmBinaryToText for both the sexpr format and the JSy format.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: