Closed Bug 1322681 Opened 8 years ago Closed 8 years ago

js shell: segfault when instantiating wasm module from a Promise

Categories

(Core :: JavaScript Engine: JIT, defect, P1)

x86_64
Linux
defect

Tracking

()

RESOLVED FIXED
mozilla53
Tracking Status
firefox53 --- fixed

People

(Reporter: pipcet, Assigned: pipcet)

Details

Attachments

(1 file, 2 obsolete files)

I am running into segfaults when instantiating a large wasm module from a promise in the JS shell. My test code is: Promise.resolve(1).then(() => { WebAssembly.compile(os.file.readFile("libc.wasm", "binary")).then((module) => { imports = {}; imports.sys = {}; imports.sys.eh_return = () => {}; imports.sys.indcall = () => {}; imports.sys.trace = () => {}; imports.sys.got = 0; imports.sys.gpo = 0; imports.sys.plt = 0; imports.sys.table = new WebAssembly.Table({element:"anyfunc",initial:65536,maximum:65536}); imports.sys.memory = new WebAssembly.Memory({initial: 16384, maximum: 16384}); console.log("here"); return WebAssembly.instantiate(module, imports) }).then((e) => { console.log("here"); }).catch(e => console.log(e)); }); (which merely instantiates the "libc.wasm" wasm module with the imports it needs) and prints "here" twice.) Investigation with gdb shows that the issue is in js.cpp:DrainJobQueue: it finishes all async tasks, then runs a job which adds another outstanding async task; we then shut down the JS context while the outstanding task is running, and when it finishes it segfaults trying to call the (now-poisoned) ->finishAsyncTaskCallback(). The attached patch appears to fix things here, by adding an additional loop to DrainJobQueue: as long as there are async tasks or queued jobs, run all async tasks, then all queued jobs.
Attached patch proposed patch (obsolete) — Splinter Review
Thanks for the report. Please double-check me, but I think it's shell-only and tied to the fact that sync and async promise tasks are interleaved. Does it fix the issue if you call drainJobQueue in a loop?
Component: JavaScript Engine → JavaScript Engine: JIT
(In reply to Benjamin Bouvier [:bbouvier] from comment #2) > Thanks for the report. Please double-check me, but I think it's shell-only > and tied to the fact that sync and async promise tasks are interleaved. Does > it fix the issue if you call drainJobQueue in a loop? Er, that's what my patch does, right? I mean, the loop is inside drainJobQueue rather than around it, but the effect should be the same? So I'm not quite sure what to test. A restricted loop in drainJobQueue fixes things, so it's a safe bet an unrestricted loop around it would, too. Do you want me to try that?
Flags: needinfo?(bbouvier)
Comment on attachment 8817611 [details] [diff] [review] proposed patch Review of attachment 8817611 [details] [diff] [review]: ----------------------------------------------------------------- Sorry I was on PTO after my last comment. I've given this some thought and I think it's fine to do this: it's a shell-only issue in a testing-only function, and it doesn't complicate the code too much with a comment. Looks good but requesting a few changes, so I'd like to have another look before r+, thanks. ::: js/src/shell/js.cpp @@ +819,5 @@ > > + RootedObject job(cx); > + JS::HandleValueArray args(JS::HandleValueArray::empty()); > + RootedValue rval(cx); > + // Execute jobs in a loop until we've reached the end of the queue. pre-existing nit: can you add a \n before this comment please? @@ +837,1 @@ > { nit: can you add a \n before this new block and a comment before the top surrounding loop explaining why we need this? (something like "loop in case an async task ends up queuing other async tasks") @@ +837,4 @@ > { > + ExclusiveData<ShellAsyncTasks>::Guard asyncTasks = sc->asyncTasks.lock(); > + if (!asyncTasks->outstanding) > + break; how about a do {} while loop instead, with sc->asyncTasks.lock()->outstanding as the loop condition?
Attachment #8817611 - Flags: feedback+
Flags: needinfo?(bbouvier)
Actually, upon rereading my code I think it introduces a race condition. I'll have another look, if that's okay.
(In reply to Benjamin Bouvier [:bbouvier] from comment #4) > I've given this some thought and I think it's fine to do this: it's a > shell-only issue in a testing-only function, and it doesn't complicate the > code too much with a comment. Looks good but requesting a few changes, so > I'd like to have another look before r+, thanks. As I said, I think there was a new race condition: ->outstanding might be 0 if an async task launched by a promise job was finished by the time of the check, so we need to check ->finished.length() as well. Sadly, that means we can't do the do {} while() loop (but we could if we added an AsyncTasksPending() function (where "pending" = "outstanding or finished")). Further feedback would be appreciated.
Severity: normal → minor
Attached patch proposed patch (obsolete) — Splinter Review
Attachment #8817611 - Attachment is obsolete: true
Comment on attachment 8821293 [details] [diff] [review] proposed patch Review of attachment 8821293 [details] [diff] [review]: ----------------------------------------------------------------- Duh, it sounds much right, thanks for digging a bit. Any chance you could run your patch with rr in chaos mode, with your initial test case in a loop (10k iterations or something big), to make sure there's not another race condition hidden out there? If there's no hidden monster, the last few nits I've posted should be easy to address. Thanks! ::: js/src/shell/js.cpp @@ +779,5 @@ > return true; > > + // It's possible for jobs on the job queue to add async tasks and vice > + // versa, so loop until the queue is empty and there are no outstanding > + // tasks. I'd remove this comment; the one above the last block seems enough to me. @@ +807,5 @@ > > + // It doesn't make sense for job queue draining to be reentrant. At the > + // same time we don't want to assert against it, because that'd make > + // drainJobQueue unsafe for fuzzers. We do want fuzzers to test this, so > + // we simply ignore nested calls of drainJobQueue. (fwiw, comments should fit in <80 chars, so not sure this will be case after adding one indent, here and above) @@ +835,4 @@ > { > + ExclusiveData<ShellAsyncTasks>::Guard asyncTasks = sc->asyncTasks.lock(); > + if (asyncTasks->outstanding == 0 && > + asyncTasks->finished.length() == 0) { nit: should fit in one line (< 100 chars) @@ +836,5 @@ > + ExclusiveData<ShellAsyncTasks>::Guard asyncTasks = sc->asyncTasks.lock(); > + if (asyncTasks->outstanding == 0 && > + asyncTasks->finished.length() == 0) { > + // No outstanding or unfinished async tasks, and the job queue > + // is empty. We're done. nit: the code above + comment above the block seem explanatory enough that this comment isn't needed. In this case, the `if` body can fit on one line and braces aren't needed, per Spidermonkey style guide.
Attachment #8821293 - Flags: feedback+
(In reply to Benjamin Bouvier [:bbouvier] from comment #8) Thanks for the feedback! I really appreciate it, sorry it's taking me a while to get used to the coding style. > Duh, it sounds much right, thanks for digging a bit. Any chance you could > run your patch with rr in chaos mode, with your initial test case in a loop > (10k iterations or something big), to make sure there's not another race > condition hidden out there? Unfortunately, I don't have a recent Intel CPU. I've been trying to port rr, but I don't have a working binary right now. I've done some minimal testing to make sure that both parts of our break condition are independently true when we expect them to be, at least. I'll try running it a few thousand times to see whether anything odd shows up.
Attachment #8821293 - Attachment is obsolete: true
(In reply to pipcet from comment #9) > (In reply to Benjamin Bouvier [:bbouvier] from comment #8) > Thanks for the feedback! I really appreciate it, sorry it's taking me a > while to get used to the coding style. Thank you for the patch! The coding style is a bit tedious, and I'm being finicky, so no worries to have here :) > Unfortunately, I don't have a recent Intel CPU. I've been trying to port rr, > but I don't have a working binary right now. > > I've done some minimal testing to make sure that both parts of our break > condition are independently true when we expect them to be, at least. I'll > try running it a few thousand times to see whether anything odd shows up. OK, I'll test for you, assuming the test case in comment 0 could trigger the issue.
(In reply to Benjamin Bouvier [:bbouvier] from comment #11) > > I've done some minimal testing to make sure that both parts of our break > > condition are independently true when we expect them to be, at least. I'll > > try running it a few thousand times to see whether anything odd shows up. > > OK, I'll test for you, assuming the test case in comment 0 could trigger the > issue. What I've been doing is adding a second, long-running promise task like this: Promise.resolve(1).then(() => { var now = Date.now(); while (Date.now() - now < 10000); }); to ensure that the task is no longer outstanding before we check whether it's in ->finished. The race condition would happen if the promise actually resolved at precisely the same time as the compilation task finished; I've been trying to simulate that with gdb breakpoints, and it appears to work: asyncTasks gets properly locked and we only look at consistent states of it.
Assignee: nobody → pipcet
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Comment on attachment 8821526 [details] [diff] [review] moz-20161223.patch Review of attachment 8821526 [details] [diff] [review]: ----------------------------------------------------------------- Thanks for the updated patch; having a static helper function to have a do{}while loop would be slightly nicer, but this is just aesthetics. Do you have a full test-case you could provide to reproduce the issue, or an hosted libc.wasm file so I could reproduce the initial issue and try under rr, please? My few attempts have been unsuccessful with AngryBots or simple wast modules.
Attachment #8821526 - Flags: review+
Sure, it's at https://gist.github.com/pipcet/ed5e1a3ffeae8167bdf7905ed035834d/raw/dfae121b638dd50a228fea04abe2e4bb1fdceab5/libc.wasm Strangely, I can reproduce the issue with AngryBots13.wasm instead of libc.wasm, but only in a build with reduced optimization. So it's timing-sensitive, probably, which would make sense. I'll try a fresh build to see whether that makes any difference.
Okay, I must have had my builds confused. Now I'm able to reproduce the issue with the original script and with AngryBots13.wasm, in both debug and optimized builds. This is on a (somewhat slow) quad-core system, but --helper-threads=1 doesn't make a difference.
(In reply to pipcet from comment #14) > Sure, it's at > https://gist.github.com/pipcet/ed5e1a3ffeae8167bdf7905ed035834d/raw/ > dfae121b638dd50a228fea04abe2e4bb1fdceab5/libc.wasm > > Strangely, I can reproduce the issue with AngryBots13.wasm instead of > libc.wasm, but only in a build with reduced optimization. So it's > timing-sensitive, probably, which would make sense. I'll try a fresh build > to see whether that makes any difference. Thanks for the link! I can trigger the issue copying the code in comment 0, on an inbound build. But if I call drainJobQueue() after the entire chain of promises, which is to be done after promises have been launched, the issue does not reproduce. Do you have a longer piece of code to reproduce the initial issue, including a call to drainJobQueue()? I assume you have a way to reproduce even with this call, otherwise the patch would not make sense; I'd just want to have a reliable test case so I can repeat the test under rr chaos mode. (Feel free to go on irc, #ionmonkey or #jsapi on irc.mozilla.org, for quicker back and forth; I'm bbouvier there)
Flags: needinfo?(pipcet)
(In reply to Benjamin Bouvier [:bbouvier] from comment #16) > Thanks for the link! I can trigger the issue copying the code in comment 0, > on an inbound build. But if I call drainJobQueue() after the entire chain of > promises, which is to be done after promises have been launched, the issue > does not reproduce. Thanks for explaining, I finally understood this now! No, a single call to drainJobQueue() is not sufficient in general, though it does help with the very simple code in comment 0. Try this: ----- Promise.resolve(1).then(() => { return WebAssembly.compile(os.file.readFile("libc.wasm", "binary")).then((module) => { imports = {}; imports.sys = {}; imports.sys.eh_return = () => {}; imports.sys.call = () => {}; imports.sys.indcall = () => {}; imports.sys.trace = () => {}; imports.sys.got = 0; imports.sys.gpo = 0; imports.sys.plt = 0; imports.sys.table = new WebAssembly.Table({element:"anyfunc",initial:65536,maximum:65536}); imports.sys.memory = new WebAssembly.Memory({initial: 16384, maximum: 16384}); console.log("here"); return WebAssembly.instantiate(module, imports) }).then((e) => { console.log("here"); }).catch(e => console.log(e)); }).then(() => {}).then(() => { return WebAssembly.compile(os.file.readFile("libc.wasm", "binary")).then((module) => { imports = {}; imports.sys = {}; imports.sys.eh_return = () => {}; imports.sys.call = () => {}; imports.sys.indcall = () => {}; imports.sys.trace = () => {}; imports.sys.got = 0; imports.sys.gpo = 0; imports.sys.plt = 0; imports.sys.table = new WebAssembly.Table({element:"anyfunc",initial:65536,maximum:65536}); imports.sys.memory = new WebAssembly.Memory({initial: 16384, maximum: 16384}); console.log("here"); return WebAssembly.instantiate(module, imports) }).then((e) => { console.log("here"); }).catch(e => console.log(e)); }); drainJobQueue(); ----- > Do you have a longer piece of code to reproduce the initial issue, including > a call to drainJobQueue()? Is that okay, or am I still misunderstanding? > I assume you have a way to reproduce even with > this call, otherwise the patch would not make sense; I'd just want to have a > reliable test case so I can repeat the test under rr chaos mode. (Feel free > to go on irc, #ionmonkey or #jsapi on irc.mozilla.org, for quicker back and > forth; I'm bbouvier there) Will be on irc in a minute. (I must be misreading the shell code; I thought the implicit call to DrainJobQueue (capital D) at the end was sufficient to avoid having to call drainJobQueue, so I never bothered doing that!)
Flags: needinfo?(pipcet)
I didn't see that DrainJobQueue was called by the shell; this actually explains why a single call to drainJobQueue() in JS fixes the first test case: the async task "instantiate" is handled by the second, implicit call to DrainJobQueue. Here's a try build: https://treeherder.mozilla.org/#/jobs?repo=try&revision=9d6251cbb9d9b3aff7d89cad165b9b3157480875
Pushed by bbouvier@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/59610e848db3 JS shell: wait for async tasks in DrainJobQueue(); r=bbouvier
As it looked pretty green (the only red is a static analysis build recently introduced and known to fail), I've pushed it to inbound. Thanks for the report and patch, again!
Priority: -- → P1
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla53
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: