1322681 - js shell: segfault when instantiating wasm module from a Promise

Assignee

Description

•

8 years ago

I am running into segfaults when instantiating a large wasm module from a promise in the JS shell. My test code is: Promise.resolve(1).then(() => { WebAssembly.compile(os.file.readFile("libc.wasm", "binary")).then((module) => { imports = {}; imports.sys = {}; imports.sys.eh_return = () => {}; imports.sys.indcall = () => {}; imports.sys.trace = () => {}; imports.sys.got = 0; imports.sys.gpo = 0; imports.sys.plt = 0; imports.sys.table = new WebAssembly.Table({element:"anyfunc",initial:65536,maximum:65536}); imports.sys.memory = new WebAssembly.Memory({initial: 16384, maximum: 16384}); console.log("here"); return WebAssembly.instantiate(module, imports) }).then((e) => { console.log("here"); }).catch(e => console.log(e)); }); (which merely instantiates the "libc.wasm" wasm module with the imports it needs) and prints "here" twice.) Investigation with gdb shows that the issue is in js.cpp:DrainJobQueue: it finishes all async tasks, then runs a job which adds another outstanding async task; we then shut down the JS context while the outstanding task is running, and when it finishes it segfaults trying to call the (now-poisoned) ->finishAsyncTaskCallback(). The attached patch appears to fix things here, by adding an additional loop to DrainJobQueue: as long as there are async tasks or queued jobs, run all async tasks, then all queued jobs.

pipcet

Assignee

Comment 1

•

8 years ago

Attached patch proposed patch (obsolete) — Details — Splinter Review

Benjamin Bouvier [:bbouvier] (inactive)

Comment 2

•

8 years ago

Thanks for the report. Please double-check me, but I think it's shell-only and tied to the fact that sync and async promise tasks are interleaved. Does it fix the issue if you call drainJobQueue in a loop?

Component: JavaScript Engine → JavaScript Engine: JIT

pipcet

Assignee

Comment 3

•

8 years ago

(In reply to Benjamin Bouvier [:bbouvier] from comment #2) > Thanks for the report. Please double-check me, but I think it's shell-only > and tied to the fact that sync and async promise tasks are interleaved. Does > it fix the issue if you call drainJobQueue in a loop? Er, that's what my patch does, right? I mean, the loop is inside drainJobQueue rather than around it, but the effect should be the same? So I'm not quite sure what to test. A restricted loop in drainJobQueue fixes things, so it's a safe bet an unrestricted loop around it would, too. Do you want me to try that?

pipcet

Assignee

Updated

•

8 years ago

Flags: needinfo?(bbouvier)

Benjamin Bouvier [:bbouvier] (inactive)

Comment 4

•

8 years ago

Comment on attachment 8817611 [details] [diff] [review] proposed patch Review of attachment 8817611 [details] [diff] [review]: ----------------------------------------------------------------- Sorry I was on PTO after my last comment. I've given this some thought and I think it's fine to do this: it's a shell-only issue in a testing-only function, and it doesn't complicate the code too much with a comment. Looks good but requesting a few changes, so I'd like to have another look before r+, thanks. ::: js/src/shell/js.cpp @@ +819,5 @@ > > + RootedObject job(cx); > + JS::HandleValueArray args(JS::HandleValueArray::empty()); > + RootedValue rval(cx); > + // Execute jobs in a loop until we've reached the end of the queue. pre-existing nit: can you add a \n before this comment please? @@ +837,1 @@ > { nit: can you add a \n before this new block and a comment before the top surrounding loop explaining why we need this? (something like "loop in case an async task ends up queuing other async tasks") @@ +837,4 @@ > { > + ExclusiveData<ShellAsyncTasks>::Guard asyncTasks = sc->asyncTasks.lock(); > + if (!asyncTasks->outstanding) > + break; how about a do {} while loop instead, with sc->asyncTasks.lock()->outstanding as the loop condition?

Attachment #8817611 - Flags: feedback+

Benjamin Bouvier [:bbouvier] (inactive)

Updated

•

8 years ago

Flags: needinfo?(bbouvier)

pipcet

Assignee

Comment 5

•

8 years ago

Actually, upon rereading my code I think it introduces a race condition. I'll have another look, if that's okay.

pipcet

Assignee

Comment 6

•

8 years ago

(In reply to Benjamin Bouvier [:bbouvier] from comment #4) > I've given this some thought and I think it's fine to do this: it's a > shell-only issue in a testing-only function, and it doesn't complicate the > code too much with a comment. Looks good but requesting a few changes, so > I'd like to have another look before r+, thanks. As I said, I think there was a new race condition: ->outstanding might be 0 if an async task launched by a promise job was finished by the time of the check, so we need to check ->finished.length() as well. Sadly, that means we can't do the do {} while() loop (but we could if we added an AsyncTasksPending() function (where "pending" = "outstanding or finished")). Further feedback would be appreciated.

Severity: normal → minor

pipcet

Assignee

Comment 7

•

8 years ago

Attached patch proposed patch (obsolete) — Details — Splinter Review

Attachment #8817611 - Attachment is obsolete: true

Benjamin Bouvier [:bbouvier] (inactive)

Comment 8

•

8 years ago

Comment on attachment 8821293 [details] [diff] [review] proposed patch Review of attachment 8821293 [details] [diff] [review]: ----------------------------------------------------------------- Duh, it sounds much right, thanks for digging a bit. Any chance you could run your patch with rr in chaos mode, with your initial test case in a loop (10k iterations or something big), to make sure there's not another race condition hidden out there? If there's no hidden monster, the last few nits I've posted should be easy to address. Thanks! ::: js/src/shell/js.cpp @@ +779,5 @@ > return true; > > + // It's possible for jobs on the job queue to add async tasks and vice > + // versa, so loop until the queue is empty and there are no outstanding > + // tasks. I'd remove this comment; the one above the last block seems enough to me. @@ +807,5 @@ > > + // It doesn't make sense for job queue draining to be reentrant. At the > + // same time we don't want to assert against it, because that'd make > + // drainJobQueue unsafe for fuzzers. We do want fuzzers to test this, so > + // we simply ignore nested calls of drainJobQueue. (fwiw, comments should fit in <80 chars, so not sure this will be case after adding one indent, here and above) @@ +835,4 @@ > { > + ExclusiveData<ShellAsyncTasks>::Guard asyncTasks = sc->asyncTasks.lock(); > + if (asyncTasks->outstanding == 0 && > + asyncTasks->finished.length() == 0) { nit: should fit in one line (< 100 chars) @@ +836,5 @@ > + ExclusiveData<ShellAsyncTasks>::Guard asyncTasks = sc->asyncTasks.lock(); > + if (asyncTasks->outstanding == 0 && > + asyncTasks->finished.length() == 0) { > + // No outstanding or unfinished async tasks, and the job queue > + // is empty. We're done. nit: the code above + comment above the block seem explanatory enough that this comment isn't needed. In this case, the `if` body can fit on one line and braces aren't needed, per Spidermonkey style guide.

Attachment #8821293 - Flags: feedback+

pipcet

Assignee

Comment 9

•

8 years ago

(In reply to Benjamin Bouvier [:bbouvier] from comment #8) Thanks for the feedback! I really appreciate it, sorry it's taking me a while to get used to the coding style. > Duh, it sounds much right, thanks for digging a bit. Any chance you could > run your patch with rr in chaos mode, with your initial test case in a loop > (10k iterations or something big), to make sure there's not another race > condition hidden out there? Unfortunately, I don't have a recent Intel CPU. I've been trying to port rr, but I don't have a working binary right now. I've done some minimal testing to make sure that both parts of our break condition are independently true when we expect them to be, at least. I'll try running it a few thousand times to see whether anything odd shows up.

pipcet

Assignee

Comment 10

•

8 years ago

Attached patch moz-20161223.patch — Details — Splinter Review

Attachment #8821293 - Attachment is obsolete: true

Benjamin Bouvier [:bbouvier] (inactive)

Comment 11

•

8 years ago

(In reply to pipcet from comment #9) > (In reply to Benjamin Bouvier [:bbouvier] from comment #8) > Thanks for the feedback! I really appreciate it, sorry it's taking me a > while to get used to the coding style. Thank you for the patch! The coding style is a bit tedious, and I'm being finicky, so no worries to have here :) > Unfortunately, I don't have a recent Intel CPU. I've been trying to port rr, > but I don't have a working binary right now. > > I've done some minimal testing to make sure that both parts of our break > condition are independently true when we expect them to be, at least. I'll > try running it a few thousand times to see whether anything odd shows up. OK, I'll test for you, assuming the test case in comment 0 could trigger the issue.

pipcet

Assignee

Comment 12

•

8 years ago

(In reply to Benjamin Bouvier [:bbouvier] from comment #11) > > I've done some minimal testing to make sure that both parts of our break > > condition are independently true when we expect them to be, at least. I'll > > try running it a few thousand times to see whether anything odd shows up. > > OK, I'll test for you, assuming the test case in comment 0 could trigger the > issue. What I've been doing is adding a second, long-running promise task like this: Promise.resolve(1).then(() => { var now = Date.now(); while (Date.now() - now < 10000); }); to ensure that the task is no longer outstanding before we check whether it's in ->finished. The race condition would happen if the promise actually resolved at precisely the same time as the compilation task finished; I've been trying to simulate that with gdb breakpoints, and it appears to work: asyncTasks gets properly locked and we only look at consistent states of it.

Benjamin Bouvier [:bbouvier] (inactive)

Updated

•

8 years ago

Assignee: nobody → pipcet

Status: UNCONFIRMED → ASSIGNED

Ever confirmed: true

Benjamin Bouvier [:bbouvier] (inactive)

Comment 13

•

8 years ago

Comment on attachment 8821526 [details] [diff] [review] moz-20161223.patch Review of attachment 8821526 [details] [diff] [review]: ----------------------------------------------------------------- Thanks for the updated patch; having a static helper function to have a do{}while loop would be slightly nicer, but this is just aesthetics. Do you have a full test-case you could provide to reproduce the issue, or an hosted libc.wasm file so I could reproduce the initial issue and try under rr, please? My few attempts have been unsuccessful with AngryBots or simple wast modules.

Attachment #8821526 - Flags: review+

pipcet

Assignee

Comment 14

•

8 years ago

Sure, it's at https://gist.github.com/pipcet/ed5e1a3ffeae8167bdf7905ed035834d/raw/dfae121b638dd50a228fea04abe2e4bb1fdceab5/libc.wasm Strangely, I can reproduce the issue with AngryBots13.wasm instead of libc.wasm, but only in a build with reduced optimization. So it's timing-sensitive, probably, which would make sense. I'll try a fresh build to see whether that makes any difference.

pipcet

Assignee

Comment 15

•

8 years ago

Okay, I must have had my builds confused. Now I'm able to reproduce the issue with the original script and with AngryBots13.wasm, in both debug and optimized builds. This is on a (somewhat slow) quad-core system, but --helper-threads=1 doesn't make a difference.

Benjamin Bouvier [:bbouvier] (inactive)

Comment 16

•

8 years ago

(In reply to pipcet from comment #14) > Sure, it's at > https://gist.github.com/pipcet/ed5e1a3ffeae8167bdf7905ed035834d/raw/ > dfae121b638dd50a228fea04abe2e4bb1fdceab5/libc.wasm > > Strangely, I can reproduce the issue with AngryBots13.wasm instead of > libc.wasm, but only in a build with reduced optimization. So it's > timing-sensitive, probably, which would make sense. I'll try a fresh build > to see whether that makes any difference. Thanks for the link! I can trigger the issue copying the code in comment 0, on an inbound build. But if I call drainJobQueue() after the entire chain of promises, which is to be done after promises have been launched, the issue does not reproduce. Do you have a longer piece of code to reproduce the initial issue, including a call to drainJobQueue()? I assume you have a way to reproduce even with this call, otherwise the patch would not make sense; I'd just want to have a reliable test case so I can repeat the test under rr chaos mode. (Feel free to go on irc, #ionmonkey or #jsapi on irc.mozilla.org, for quicker back and forth; I'm bbouvier there)

Flags: needinfo?(pipcet)

pipcet

Assignee

Comment 17

•

8 years ago

(In reply to Benjamin Bouvier [:bbouvier] from comment #16) > Thanks for the link! I can trigger the issue copying the code in comment 0, > on an inbound build. But if I call drainJobQueue() after the entire chain of > promises, which is to be done after promises have been launched, the issue > does not reproduce. Thanks for explaining, I finally understood this now! No, a single call to drainJobQueue() is not sufficient in general, though it does help with the very simple code in comment 0. Try this: ----- Promise.resolve(1).then(() => { return WebAssembly.compile(os.file.readFile("libc.wasm", "binary")).then((module) => { imports = {}; imports.sys = {}; imports.sys.eh_return = () => {}; imports.sys.call = () => {}; imports.sys.indcall = () => {}; imports.sys.trace = () => {}; imports.sys.got = 0; imports.sys.gpo = 0; imports.sys.plt = 0; imports.sys.table = new WebAssembly.Table({element:"anyfunc",initial:65536,maximum:65536}); imports.sys.memory = new WebAssembly.Memory({initial: 16384, maximum: 16384}); console.log("here"); return WebAssembly.instantiate(module, imports) }).then((e) => { console.log("here"); }).catch(e => console.log(e)); }).then(() => {}).then(() => { return WebAssembly.compile(os.file.readFile("libc.wasm", "binary")).then((module) => { imports = {}; imports.sys = {}; imports.sys.eh_return = () => {}; imports.sys.call = () => {}; imports.sys.indcall = () => {}; imports.sys.trace = () => {}; imports.sys.got = 0; imports.sys.gpo = 0; imports.sys.plt = 0; imports.sys.table = new WebAssembly.Table({element:"anyfunc",initial:65536,maximum:65536}); imports.sys.memory = new WebAssembly.Memory({initial: 16384, maximum: 16384}); console.log("here"); return WebAssembly.instantiate(module, imports) }).then((e) => { console.log("here"); }).catch(e => console.log(e)); }); drainJobQueue(); ----- > Do you have a longer piece of code to reproduce the initial issue, including > a call to drainJobQueue()? Is that okay, or am I still misunderstanding? > I assume you have a way to reproduce even with > this call, otherwise the patch would not make sense; I'd just want to have a > reliable test case so I can repeat the test under rr chaos mode. (Feel free > to go on irc, #ionmonkey or #jsapi on irc.mozilla.org, for quicker back and > forth; I'm bbouvier there) Will be on irc in a minute. (I must be misreading the shell code; I thought the implicit call to DrainJobQueue (capital D) at the end was sufficient to avoid having to call drainJobQueue, so I never bothered doing that!)

Flags: needinfo?(pipcet)

Benjamin Bouvier [:bbouvier] (inactive)

Comment 18

•

8 years ago

I didn't see that DrainJobQueue was called by the shell; this actually explains why a single call to drainJobQueue() in JS fixes the first test case: the async task "instantiate" is handled by the second, implicit call to DrainJobQueue. Here's a try build: https://treeherder.mozilla.org/#/jobs?repo=try&revision=9d6251cbb9d9b3aff7d89cad165b9b3157480875

Pulsebot

Comment 19

•

8 years ago

Pushed by bbouvier@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/59610e848db3 JS shell: wait for async tasks in DrainJobQueue(); r=bbouvier

Benjamin Bouvier [:bbouvier] (inactive)

Comment 20

•

8 years ago

As it looked pretty green (the only red is a static analysis build recently introduced and known to fail), I've pushed it to inbound. Thanks for the report and patch, again!

Hannes Verschore [:h4writer]

Updated

•

8 years ago

Priority: -- → P1

Phil Ringnalda (:philor)

Comment 21

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/59610e848db3

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

status-firefox53: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla53

proposed patch 8 years ago pipcet 4.64 KB, patch	bbouvier : feedback+	Details \| Diff \| Splinter Review
proposed patch 8 years ago pipcet 5.08 KB, patch	bbouvier : feedback+	Details \| Diff \| Splinter Review
moz-20161223.patch 8 years ago pipcet 4.76 KB, patch	bbouvier : review+	Details \| Diff \| Splinter Review