Closed Bug 961317 Opened 11 years ago Closed 11 years ago

Intermittent TEST-UNEXPECTED-FAIL | test_bug_795164.html | Test timed out. | test_uninstall_errors.html | mozApps error: MULTIPLE_APPS_PER_ORIGIN_FORBIDDEN

Categories

(Toolkit Graveyard :: OS.File, defect)

x86_64
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(firefox29 unaffected, firefox30 fixed, firefox31 fixed, firefox-esr24 unaffected, b2g-v1.3 unaffected, b2g-v1.4 fixed, b2g-v2.0 fixed)

RESOLVED FIXED
mozilla31
Tracking Status
firefox29 --- unaffected
firefox30 --- fixed
firefox31 --- fixed
firefox-esr24 --- unaffected
b2g-v1.3 --- unaffected
b2g-v1.4 --- fixed
b2g-v2.0 --- fixed

People

(Reporter: KWierso, Assigned: Yoric)

References

Details

(Keywords: intermittent-failure)

Attachments

(2 files, 3 obsolete files)

https://tbpl.mozilla.org/php/getParsedLog.php?id=33195780&tree=Fx-Team slave: tst-linux64-ec2-120 14:44:11 INFO - 876 INFO TEST-START | /tests/dom/apps/tests/test_bug_795164.html 14:44:11 INFO - Server Etag: apptype-hosted-getmanifest-true-3 14:44:11 INFO - No Client Etag 14:44:11 INFO - System JS : ERROR resource://gre/modules/osfile/_PromiseWorker.jsm:163 - NS_ERROR_FAILURE: 14:44:11 INFO - 877 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/apps/tests/test_bug_795164.html | Test timed out. 14:44:11 INFO - 878 INFO TEST-INFO | MEMORY STAT vsize after test: 110116864 14:44:11 INFO - 879 INFO TEST-INFO | MEMORY STAT residentFast after test: 55431168 14:44:11 INFO - 880 INFO TEST-INFO | MEMORY STAT heapAllocated after test: 18777504 14:44:11 INFO - 881 INFO TEST-END | /tests/dom/apps/tests/test_bug_795164.html | finished in 317802ms 14:44:41 INFO - 1128 INFO TEST-START | /tests/dom/apps/tests/test_uninstall_errors.html 14:44:41 INFO - Server Etag: apptype-hosted-getmanifest-true-3 14:44:41 INFO - No Client Etag 14:44:41 INFO - 1129 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/apps/tests/test_uninstall_errors.html | mozApps error: MULTIPLE_APPS_PER_ORIGIN_FORBIDDEN 14:44:41 INFO - 1130 INFO TEST-INFO | MEMORY STAT vsize after test: 110510080 14:44:41 INFO - 1131 INFO TEST-INFO | MEMORY STAT residentFast after test: 59572224 14:44:41 INFO - 1132 INFO TEST-INFO | MEMORY STAT heapAllocated after test: 22362364 14:44:41 INFO - 1133 INFO TEST-END | /tests/dom/apps/tests/test_uninstall_errors.html | finished in 1664ms
The error is referring to OS.File, it's possible this intermittent failure is happening after my patch in bug 959420, but it looks like an OS.File specific problem.
These is a very frequent orange on TBPL (as you can see in the comments here). This needs an owner ASAP.
Flags: needinfo?(overholt)
Flags: needinfo?(dteller)
Jan, please take a look here. The timing of the oranges indicates it was triggered by Marcos' patch as he suspects in comment #1 but he thinks it's an underlying bug.
Assignee: nobody → Jan.Varga
Flags: needinfo?(overholt)
Let's look at the errors: 01:26:19 INFO - System JS : ERROR resource://gre/modules/osfile/_PromiseWorker.jsm:180 - NS_ERROR_FAILURE: I have triple-checked – this is an error that actually cannot be launched by this line of code, neither in JS (which doesn't call any XPCOM) nor in the underlying C++ (which is purely JSAPI-based so cannot launched XPCOM errors). So, I suspect that the error is actually somewhere else, but is reported at this line due to some JSAPI-level error somewhere else in the code. Note that, a few lines above, we have 01:26:19 INFO - [Exception... "[JavaScript Error: "DOMApplicationRegistry is not defined" {file: "resource://gre/modules/BrowserElementParent.jsm" line: 238}]'[JavaScript Error: "DOMApplicationRegistry is not defined" {file: "resource://gre/modules/BrowserElementParent.jsm" line: 238}]' when calling method: [nsIObserver::observe]" nsresult: "0x80570021 (NS_ERROR_XPC_JAVASCRIPT_ERROR_WITH_DETAILS)" location: "JS frame :: http://mochi.test:8888/tests/dom/apps/tests/test_app_update.html :: checkAppState :: line 259" data: yes] Which looks like a possible source of error, too. I have the feeling that this is totally unrelated to OS.File.
Flags: needinfo?(dteller)
(In reply to David Rajchenbach Teller [:Yoric] (please use "needinfo?") from comment #73) > 01:26:19 INFO - [Exception... "[JavaScript Error: > "DOMApplicationRegistry is not defined" {file: > "resource://gre/modules/BrowserElementParent.jsm" line: 238}]'[JavaScript > Error: "DOMApplicationRegistry is not defined" {file: > "resource://gre/modules/BrowserElementParent.jsm" line: 238}]' when calling > method: [nsIObserver::observe]" nsresult: "0x80570021 > (NS_ERROR_XPC_JAVASCRIPT_ERROR_WITH_DETAILS)" location: "JS frame :: > http://mochi.test:8888/tests/dom/apps/tests/test_app_update.html :: > checkAppState :: line 259" data: yes] Patrick, could this be fallout from bug 960641? Actually, the dates don't line up but maybe you have some idea anyway :)
Flags: needinfo?(pwang)
I don't really have an idea how this happen. But I would guess that there might be exception thrown during loading WebApps.jsm, so the symbol is not exposed by the lazy getter properly.
Flags: needinfo?(pwang)
(In reply to Patrick Wang (ChihKai) (:kk1fff) from comment #79) > I don't really have an idea how this happen. But I would guess that there > might be exception thrown during loading WebApps.jsm, so the symbol is not > exposed by the lazy getter properly. The error that appears before "DOMApplicationRegistry is not defined" is: 01:20:05 INFO - ************************************************************ 01:20:05 INFO - * Call to xpconnect wrapped JSObject produced this error: * 01:20:25 INFO - [Exception... "Component returned failure code: 0x80570016 (NS_ERROR_XPC_GS_RETURNED_FAILURE) [nsIJSCID.getService]" nsresult: "0x80570016 (NS_ERROR_XPC_GS_RETURNED_FAILURE)" location: "JS frame :: resource://gre/modules/XPCOMUtils.jsm :: XPCU_serviceLambda :: line 202" data: no] 01:20:25 INFO - ************************************************************ XPCOMUtils.jsm line 202 is in a lambda inside defineLazyServiceGetter, and there are three calls to defineLazyServiceGetter in Webapps.jsm: XPCOMUtils.defineLazyServiceGetter(this, "ppmm", "@mozilla.org/parentprocessmessagemanager;1", "nsIMessageBroadcaster"); XPCOMUtils.defineLazyServiceGetter(this, "cpmm", "@mozilla.org/childprocessmessagemanager;1", "nsIMessageSender"); … XPCOMUtils.defineLazyServiceGetter(this, "dataStoreService", "@mozilla.org/datastore-service;1", "nsIDataStoreService"); So presumably one of these is throwing the exception, which causes DOMApplicationRegistry to be undefined.
This is the key line in the log: 09:38:26 INFO - System JS : ERROR resource://gre/modules/osfile/_PromiseWorker.jsm:180 - NS_ERROR_FAILURE: That is a call to worker.postMessage: <http://mxr.mozilla.org/mozilla-central/source/toolkit/components/osfile/modules/_PromiseWorker.jsm#180> Which means that MessageEventRunnable::Dispatch is failing: <http://mxr.mozilla.org/mozilla-central/source/dom/workers/WorkerPrivate.cpp#2734> There are a limited number of ways that function can fail, basically read <http://mxr.mozilla.org/mozilla-central/source/dom/workers/WorkerRunnable.cpp#80> and everything that function calls. Somebody needs to add some printf-debugging code to output something on each one of those try paths, send that to try and trigger a bunch of test runs to get this fail there and look at the log and see which one of the printf's show up. Then we will know where this error is coming from. Note that Webapps.jsm includes osfile.jsm so this is how all of this stuff is bubbling up the chain.
Assignee: Jan.Varga → nobody
Depends on: 972927
I've tried to use NetUtil instead of OS.File.writeAtomic, no failures in 5 try runs.
OK, I basically did what I recommended in comment 85 on the try server: https://tbpl.mozilla.org/?tree=Try&rev=86b7c7c29ad0. In the log, I see messages like: 11:40:56 INFO - XXXehsan DispatchInternal6 -2147418113 11:40:56 INFO - XXXehsan 2 11:40:56 INFO - XXXehsan DispatchInternal6 -2147418113 11:40:56 INFO - XXXehsan 2 11:40:56 INFO - XXXehsan DispatchInternal6 -2147418113 11:40:56 INFO - XXXehsan 2 11:40:56 INFO - XXXehsan DispatchInternal6 -2147418113 11:40:56 INFO - XXXehsan 2 -2147418113 is NS_ERROR_UNEXPECTED. If you look at my patch in that try push, you'll see that the error is coming from <http://hg.mozilla.org/mozilla-central/annotate/0085a162499f/dom/workers/WorkerPrivate.cpp#l2225> which means that OS.file is somehow attempting to post a message to the promise worker when it's shutting down, which results in the failure change observed here. This is extremely bad, since it can cause into all sorts of failures in any tests which uses apps. We have been seeing this pattern in the data store tests as well (which are currently disabled because they're timing out all the time.) It might actually be a real bug which also manifests itself to our users, but it's hard to say for sure without further investigation. <http://hg.mozilla.org/mozilla-central/annotate/0085a162499f/toolkit/components/osfile/modules/osfile_async_front.jsm#l256> seems to be the only place where we post something to this worker. There are two catch blocks there but of course none of them catch this exception. Perhaps the right thing to do here would be to add another catch block for other types of errors? It's hard for me to say since I don't understand this code much. David, can you please take a look? Thanks!
Component: DOM: Apps → OS.File
Flags: needinfo?(dteller)
Product: Core → Toolkit
(In reply to Marco Castelluccio [:marco] from comment #98) > I've tried to use NetUtil instead of OS.File.writeAtomic, no failures in 5 > try runs. Marco, can you please post this patch here? Thanks!
Flags: needinfo?(mar.castelluccio)
Nathan, it seems like Yoirc is out sick. This bug is really serious and it hits pretty much every b2g test which does anything with apps. Can you please take a look? See my analysis in comment 99. Thanks.
Flags: needinfo?(nfroyd)
Hm, bug 965527 looks like it touches sort of the same area, but I don't think it actually helps with this. Why is the promise worker shutting down in the first place?
Flags: needinfo?(nfroyd)
(In reply to comment #102) > Hm, bug 965527 looks like it touches sort of the same area, but I don't think > it actually helps with this. > > Why is the promise worker shutting down in the first place? I don't know, this code is written in a style which makes it *very* hard to read. I tried and failed, which is why I am needinfo-ing other people here. :-)
(In reply to :Ehsan Akhgari (needinfo? me!) (slow responsiveness, emailapocalypse) from comment #103) > (In reply to comment #102) > > Hm, bug 965527 looks like it touches sort of the same area, but I don't think > > it actually helps with this. > > > > Why is the promise worker shutting down in the first place? > > I don't know, this code is written in a style which makes it *very* hard to > read. I tried and failed, which is why I am needinfo-ing other people here. > :-) Ah. :) A plausible-looking scenario is this: - Something calls resetWorker: http://mxr.mozilla.org/mozilla-central/source/toolkit/components/osfile/modules/osfile_async_front.jsm#1239 This API was added at the request of the B2G folks so that the OS.File worker didn't have to hang around all the time, consuming memory. - which then attempts to shutdown the worker thread: http://mxr.mozilla.org/mozilla-central/source/toolkit/components/osfile/modules/osfile_async_worker.js#283 - and we attempt to send back the reply: http://mxr.mozilla.org/mozilla-central/source/toolkit/components/osfile/modules/osfile_async_worker.js#99 The interesting line is here: http://mxr.mozilla.org/mozilla-central/source/toolkit/components/osfile/modules/osfile_async_worker.js#107 where we call self.close() in an attempt to shutdown the worker. I don't immediately see where |self| is coming from here, though this is what the MDN page recommends? https://developer.mozilla.org/en-US/docs/Web/Guide/Performance/Using_web_workers#Terminating_a_worker I think the expected order of events here is: - The worker thread shuts down; - The handler for the Meta_reset call runs and nulls out worker: http://mxr.mozilla.org/mozilla-central/source/toolkit/components/osfile/modules/osfile_async_front.jsm#1245 - Future post() calls then notice that the worker no longer exists, and restart the worker. But since this involves threading, the sequence looks like: - The worker thread shuts down; - post() is called and tries to send something to the thread; - Calamity ensues, because the worker thread is gone, even though the scheduler doesn't know it yet. And so we have a race. Does that sound plausible to anybody else?
Argh, apparently generating those mxr links from an old tree means they don't quite point at the right places. Ideally people will be able to follow what's going on. =/
I don't understand anything from this code to be honest, but your theory sounds plausible at least!
I'll post my patch in bug 981085.
Flags: needinfo?(mar.castelluccio)
(In reply to Nathan Froyd (:froydnj) from comment #104) > The interesting line is here: > > http://mxr.mozilla.org/mozilla-central/source/toolkit/components/osfile/ > modules/osfile_async_worker.js#107 > > where we call self.close() in an attempt to shutdown the worker. I don't > immediately see where |self| is coming from here, though this is what the > MDN page recommends? > > https://developer.mozilla.org/en-US/docs/Web/Guide/Performance/ > Using_web_workers#Terminating_a_worker |self| is a global of workers. > I think the expected order of events here is: [...] > And so we have a race. Does that sound plausible to anybody else? Shouldn't this have been fixed by the latest patch to bug 965527? Or are we talking about an older codebase?
Flags: needinfo?(dteller)
Also, I can't find any trace of a OS.File error in the logs. Ehsan: Indeed, the code of OS.File is often hard to follow. Some of it can be improved, but it's going to be a long task – neither ChromeWorkers nor js-ctypes make it very easy to work nice code.
(In reply to David Rajchenbach Teller [:Yoric] (currently away from bugzilla – ETA March 11th – please use "needinfo?") from comment #110) > > And so we have a race. Does that sound plausible to anybody else? > > Shouldn't this have been fixed by the latest patch to bug 965527? > Or are we talking about an older codebase? You can test to see if that fix has worked by pushing a try run which re-enables the tests disabled here. Most of the tests that have been hitting this bug have been disabled because they fail intermittently too many times. (In reply to David Rajchenbach Teller [:Yoric] (currently away from bugzilla – ETA March 11th – please use "needinfo?") from comment #111) > Also, I can't find any trace of a OS.File error in the logs. Comment 85 describes the things you see in the log, and comment 99 explains my debugging session on try up to the point I narrowed this down to the OS.File code. Did you read both of them? The latter is very important as it explains everything I've done over the period of a few days debugging this on try. > Ehsan: Indeed, the code of OS.File is often hard to follow. Some of it can > be improved, but it's going to be a long task – neither ChromeWorkers nor > js-ctypes make it very easy to work nice code. I talked to Taras and Nathan about this on Friday (since you were out on that day.) Nathan told me that we're rewriting parts of this code in C++, which should help make things better, and I did look at the code and it didn't look very simple to me. I argued the case that JS is a very strange choice to have implemented this in, and Taras argued the opposite case saying that node.js has been able to do this in js, so should we. That didn't really go anywhere I don't think. Anyways, I filed bug 981085 to get the DOM apps code moved away from OS.File mostly because of this bug and the fact that it seems to be much slower and more resource intensive than the alternative. My immediate interest is to see this bug fixed so that we can switch back to OS.File once the rest of the problems with it have been addressed. If you'd like to discuss future improvements, please ping me, as I'm interested to see this component improve, but none of this should block fixing this bug. :-)
Flags: needinfo?(dteller)
(In reply to comment #111) > Also, I can't find any trace of a OS.File error in the logs. (If this refers to Nathan's try push, please ignore that part of my reply! I don't know what Nathan has pushed to try, but presumably if it doesn't help discover anything then we need more logging or something.) Also, note that I expect this to be reproducible locally in an emulator build.
(In reply to :Ehsan Akhgari (needinfo? me!) from comment #112) > (In reply to David Rajchenbach Teller [:Yoric] (currently away from bugzilla > – ETA March 11th – please use "needinfo?") from comment #110) > > > And so we have a race. Does that sound plausible to anybody else? > > > > Shouldn't this have been fixed by the latest patch to bug 965527? > > Or are we talking about an older codebase? > > You can test to see if that fix has worked by pushing a try run which > re-enables the tests disabled here. Most of the tests that have been > hitting this bug have been disabled because they fail intermittently too > many times. Let me rephrase: are we talking about a B2G built with an old Gecko or with the latest Gecko? It is my understanding that some of the work on B2G is still based on Gecko 26. I *guess* we are talking about recent Gecko, but being sure would help. > (In reply to David Rajchenbach Teller [:Yoric] (currently away from bugzilla > – ETA March 11th – please use "needinfo?") from comment #111) > > Also, I can't find any trace of a OS.File error in the logs. > > Comment 85 describes the things you see in the log, and comment 99 explains > my debugging session on try up to the point I narrowed this down to the > OS.File code. Did you read both of them? The latter is very important as > it explains everything I've done over the period of a few days debugging > this on try. Sorry, I should have been clearer: I was talking about the oranges in comment 107 i.e. https://tbpl.mozilla.org/php/getParsedLog.php?id=35819755&tree=Try&full=1 fails at test_bug_795164 without any error message related to _PromiseWorker or OS.File. Your investigation described in comment 85 and comment 99 clearly confirms that the call to post() causes the weird xpcom error, but I suspect that the unexpected failure that prompted this bug is unrelated. This doesn't mean that there OS.File is cleared (see below), but that might mean that there is a problem entirely unrelated to OS.File, too. > [from comment 99] > There are two catch blocks there but of course none of them catch this exception. Perhaps the right thing to do here would be to add another catch block for other types of errors? We could certainly improve error reporting in that case. Filed bug 981875. > which means that OS.file is somehow attempting to post a message to the promise worker when it's shutting down, which results in the failure change observed here 1/ B2G reset timer kicks in, sends message "Meta_reset"; 2/ We post another message, which gets enqueued; 3/ Reset succeeds, reply arrives, but before the reply is treated, the second message is sent to the worker; 4/ Since there is no worker, we fail. If this is the cause of the xpcom error, it is very unlikely to happen in the wild, but it might cause a failure in the test suite.
Flags: needinfo?(dteller)
So, with the result of comment 107, do we still have any reason to believe that the error is related to OS.File?
Flags: needinfo?(ehsan)
(In reply to David Rajchenbach Teller [:Yoric] (please use "needinfo?") from comment #114) > (In reply to :Ehsan Akhgari (needinfo? me!) from comment #112) > > (In reply to David Rajchenbach Teller [:Yoric] (currently away from bugzilla > > – ETA March 11th – please use "needinfo?") from comment #110) > > > > And so we have a race. Does that sound plausible to anybody else? > > > > > > Shouldn't this have been fixed by the latest patch to bug 965527? > > > Or are we talking about an older codebase? > > > > You can test to see if that fix has worked by pushing a try run which > > re-enables the tests disabled here. Most of the tests that have been > > hitting this bug have been disabled because they fail intermittently too > > many times. > > Let me rephrase: are we talking about a B2G built with an old Gecko or with > the latest Gecko? It is my understanding that some of the work on B2G is > still based on Gecko 26. I *guess* we are talking about recent Gecko, but > being sure would help. We're talking about what runs on tbpl, which is the latest Gecko. Once we determine the cause of the bug we might decide to backport it to older versions but let's focus on trunk for now. > > (In reply to David Rajchenbach Teller [:Yoric] (currently away from bugzilla > > – ETA March 11th – please use "needinfo?") from comment #111) > > > Also, I can't find any trace of a OS.File error in the logs. > > > > Comment 85 describes the things you see in the log, and comment 99 explains > > my debugging session on try up to the point I narrowed this down to the > > OS.File code. Did you read both of them? The latter is very important as > > it explains everything I've done over the period of a few days debugging > > this on try. > > Sorry, I should have been clearer: I was talking about the oranges in > comment 107 > i.e. > https://tbpl.mozilla.org/php/getParsedLog.php?id=35819755&tree=Try&full=1 > fails at test_bug_795164 without any error message related to _PromiseWorker > or OS.File. I see. That suggests to me that we don't have sufficient logging in the place where this bug happens. > Your investigation described in comment 85 and comment 99 clearly confirms > that the call to post() causes the weird xpcom error, Which weird xpcom error are you talking about? As I said in comment 99, Worker.postMessage() is throwing from here: <http://hg.mozilla.org/mozilla-central/annotate/0085a162499f/dom/workers/WorkerPrivate.cpp#l2225>, because a message is being posted to a worker that is shutting down. > but I suspect that the > unexpected failure that prompted this bug is unrelated. Why do you suspect that? I have debugging evidence in comment 99 to suggest that the failure is exactly related to what OS.File does to its worker. > This doesn't mean > that there OS.File is cleared (see below), but that might mean that there is > a problem entirely unrelated to OS.File, too. I don't follow your reasoning. > > [from comment 99] > > There are two catch blocks there but of course none of them catch this exception. Perhaps the right thing to do here would be to add another catch block for other types of errors? > > We could certainly improve error reporting in that case. Filed bug 981875. OK, FWIW that was merely a theory, I have no reason to believe that my theory is correct. > > which means that OS.file is somehow attempting to post a message to the promise worker when it's shutting down, which results in the failure change observed here > > 1/ B2G reset timer kicks in, sends message "Meta_reset"; > 2/ We post another message, which gets enqueued; > 3/ Reset succeeds, reply arrives, but before the reply is treated, the > second message is sent to the worker; > 4/ Since there is no worker, we fail. > > If this is the cause of the xpcom error, it is very unlikely to happen in > the wild, but it might cause a failure in the test suite. That's plausible, sure, but unless we've pinpointed the bug exactly, I feel its premature to speculate about how likely it is to occur in the wild. Even if this doesn't happen in the wild, it's still pretty bad since it has caused us not to be able to run our tests since the middle of February! (In reply to David Rajchenbach Teller [:Yoric] (please use "needinfo?") from comment #115) > So, with the result of comment 107, do we still have any reason to believe > that the error is related to OS.File? Absolutely, everything I said in comment 99 still stands. Like I said, the only thing that comment 107 suggests is that we need better logging there.
Flags: needinfo?(ehsan)
I suggest taking my debugging patch in comment 99 <https://hg.mozilla.org/try/rev/21a6518fa995>, studying it to figure out exactly how it narrowed down the bug, and then add your own logging/stacktraces/etc to the patch and pushing it to try until you manage to narrow down the bug further.
(In reply to :Ehsan Akhgari (needinfo? me!) from comment #116) > I see. That suggests to me that we don't have sufficient logging in the > place where this bug happens. Well, I expect that 14:44:11 INFO - System JS : ERROR resource://gre/modules/osfile/_PromiseWorker.jsm:163 - NS_ERROR_FAILURE: in comment 1 would still appear if we add more logging. > > Your investigation described in comment 85 and comment 99 clearly confirms > > that the call to post() causes the weird xpcom error, > > Which weird xpcom error are you talking about? As I said in comment 99, > Worker.postMessage() is throwing from here: > <http://hg.mozilla.org/mozilla-central/annotate/0085a162499f/dom/workers/ > WorkerPrivate.cpp#l2225>, because a message is being posted to a worker that > is shutting down. Well, "weird" insofar as we don't expect a XPCOM error to be thrown when we call a (web-exposed) JS API. > > but I suspect that the > > unexpected failure that prompted this bug is unrelated. > > Why do you suspect that? I have debugging evidence in comment 99 to suggest > that the failure is exactly related to what OS.File does to its worker. > > > This doesn't mean > > that there OS.File is cleared (see below), but that might mean that there is > > a problem entirely unrelated to OS.File, too. > > I don't follow your reasoning. Well, the reasoning is the following: if I understand correctly, when the OS.File issue takes place (xpcom error thrown from |postMessage()|) , it is visible in the logs. Here, we seem to be in a setting in which the WebApps.jsm unexpected failure appears but not the OS.File issue. From which I deduce that they are probably independent. Perhaps I don't understand your comment 99, but it seems to me that you're looking for the source of the OS.File xpcom error, which is great, but with no certainty that this xpcom error causes the WebApps.jsm error. Am I missing something? Now, I'm going to try and fix the codepath that leads to this xpcom error. I just don't have high hopes that it will also fix the current issue. > That's plausible, sure, but unless we've pinpointed the bug exactly, I feel > its premature to speculate about how likely it is to occur in the wild. > Even if this doesn't happen in the wild, it's still pretty bad since it has > caused us not to be able to run our tests since the middle of February! Ack.
(In reply to David Rajchenbach Teller [:Yoric] (please use "needinfo?") from comment #118) > Now, I'm going to try and fix the codepath that leads to this xpcom error. I > just don't have high hopes that it will also fix the current issue. I did trigger five try runs using NetUtil instead of OS.File.writeAtomic in the Webapps.jsm::_writeFile function, and they didn't fail. This may be a coincidence, but IIRC these tests were failing with a high probability, so five tests may be enough to think that the problem was actually in the _writeFile function. The patch I sent to try was this: https://bug981085.bugzilla.mozilla.org/attachment.cgi?id=8387863 It's also possible that it's a timing issue that wasn't visible and was uncovered by using OS.File more than before.
(In reply to David Rajchenbach Teller [:Yoric] (please use "needinfo?") from comment #118) > (In reply to :Ehsan Akhgari (needinfo? me!) from comment #116) > > I see. That suggests to me that we don't have sufficient logging in the > > place where this bug happens. > > Well, I expect that > 14:44:11 INFO - System JS : ERROR > resource://gre/modules/osfile/_PromiseWorker.jsm:163 - NS_ERROR_FAILURE: > in comment 1 would still appear if we add more logging. I am not familiar enough with XPConnect error reporting to tell you why this doesn't appear in some of the logs, but my first assumption would be that the XPConnect error reporting doesn't work correctly here. Anyways, this is *very* easy to check, just wrap the places where postMessage is called in a try/catch and log any exceptions that are raised somehow. > > > Your investigation described in comment 85 and comment 99 clearly confirms > > > that the call to post() causes the weird xpcom error, > > > > Which weird xpcom error are you talking about? As I said in comment 99, > > Worker.postMessage() is throwing from here: > > <http://hg.mozilla.org/mozilla-central/annotate/0085a162499f/dom/workers/ > > WorkerPrivate.cpp#l2225>, because a message is being posted to a worker that > > is shutting down. > > Well, "weird" insofar as we don't expect a XPCOM error to be thrown when we > call a (web-exposed) JS API. We do that *all the time*. :-) Lots of JS exposed platform APIs can throw various types of exceptions. If you need to check a specific method to see whether it can throw, check the corresponding webidl file and look for a [Throws] extended attribute (example: http://mxr.mozilla.org/mozilla-central/source/dom/webidl/Worker.webidl#20). > > > but I suspect that the > > > unexpected failure that prompted this bug is unrelated. > > > > Why do you suspect that? I have debugging evidence in comment 99 to suggest > > that the failure is exactly related to what OS.File does to its worker. > > > > > This doesn't mean > > > that there OS.File is cleared (see below), but that might mean that there is > > > a problem entirely unrelated to OS.File, too. > > > > I don't follow your reasoning. > > Well, the reasoning is the following: if I understand correctly, when the > OS.File issue takes place (xpcom error thrown from |postMessage()|) , it is > visible in the logs. Assuming if the XPConnect error reporting layer works correctly. We know that at least in some cases it doesn't. See bug 957666 for example where I chased an incomplete error message. > Here, we seem to be in a setting in which the > WebApps.jsm unexpected failure appears but not the OS.File issue. From which > I deduce that they are probably independent. I did not follow the exception path very closely to see where that exception propagates to. But from my experience chasing these types of intermittent failures, it's usually best to assume as few things work correctly as possible. :-) Please note that once we identify exactly what's happening here, we should file a bug to get better error reporting here. There is no good reason why we had to go through this much pain to figure out which code throws. Ideally we should have had a clean stack trace printed to stdout or something! > Perhaps I don't understand your comment 99, but it seems to me that you're > looking for the source of the OS.File xpcom error, which is great, but with > no certainty that this xpcom error causes the WebApps.jsm error. So in comment 85, I narrowed the failure down to the Worker.postMessage call made by OS.File throwing. Then I wanted to know more about _why_ that code is throwing, and comment 99 summarizes my debugging scenario to figure out the exact place where we return the first error code which then propagates through the callers until it reaches the JS caller. (Note that comment 99 misses a number of previous simpler try pushes, I basically started with sticking a small number of printf's in a bunch of places and move my way from that point to narrow down the failure point.) Given the fact that the code I linked to in comment 99 throws (and there is no doubt about that, see the printf outputs in the test output log), there is no doubt that the postMessage call is throwing (you can follow the C++ callers to get to PostMessage). And given how busted the current XPConnect error reporting story is, I am not surprised at all that something might be eating up that _PromiseWorker.jsm error message. But the fact remains that we know that Worker.postMessage() is throwing, we know that it is the one used by OS.File that is throwing, we know that OS.File doesn't properly protect against that exception, so it will propagate through the JS callers until it reaches the global scope of WebApps.jsm, possibly breaking the loading of that module, etc. (This last bit is the part that I was talking about before when I said I didn't investigate the exception propagation path very carefully.) > Am I missing something? > > Now, I'm going to try and fix the codepath that leads to this xpcom error. I > just don't have high hopes that it will also fix the current issue. Here's one idea, just make sure that the postMessage call cannot propagate its exceptions, push to try and see if the failure goes away, or changes drastically. If it does, that confirms my theory (although that will not necessarily be the right fix, and the race condition that Nathan was suspecting may manifest itself in any number of ways with that wallpaper fix.) (In reply to Marco Castelluccio [:marco] from comment #119) > (In reply to David Rajchenbach Teller [:Yoric] (please use "needinfo?") from > comment #118) > > Now, I'm going to try and fix the codepath that leads to this xpcom error. I > > just don't have high hopes that it will also fix the current issue. > > I did trigger five try runs using NetUtil instead of OS.File.writeAtomic in > the Webapps.jsm::_writeFile function, and they didn't fail. > This may be a coincidence, but IIRC these tests were failing with a high > probability, so five tests may be enough to think that the problem was > actually in the _writeFile function. > > The patch I sent to try was this: > https://bug981085.bugzilla.mozilla.org/attachment.cgi?id=8387863 > > It's also possible that it's a timing issue that wasn't visible and was > uncovered by using OS.File more than before. In my experience, this code is faily enough that 5 runs should probably show the failure, but to double check, you can push the same patch to try again and trigger 10-20 runs on it. If they all turn up green, we at least know which OS.File call site is the busted one.
(In reply to :Ehsan Akhgari (needinfo? me!) from comment #120) > In my experience, this code is faily enough that 5 runs should probably show > the failure, but to double check, you can push the same patch to try again > and trigger 10-20 runs on it. If they all turn up green, we at least know > which OS.File call site is the busted one. I will do that in the next couple of days. Setting needinfo=me for a reminder.
Flags: needinfo?(mar.castelluccio)
(In reply to Marco Castelluccio [:marco] from comment #121) > (In reply to :Ehsan Akhgari (needinfo? me!) from comment #120) > > In my experience, this code is faily enough that 5 runs should probably show > > the failure, but to double check, you can push the same patch to try again > > and trigger 10-20 runs on it. If they all turn up green, we at least know > > which OS.File call site is the busted one. > > I will do that in the next couple of days. Setting needinfo=me for a > reminder. FWIW my comment was addressed to David, since I believe he's already looking into this.
Flags: needinfo?(mar.castelluccio)
Much to my bafflement, Try seems to suggest that bug 981875 is sufficient to fix this bug. Or has anything landed in the meantime that could explain the bug not appearing? https://tbpl.mozilla.org/?tree=Try&rev=757826e09eba
Flags: needinfo?(mar.castelluccio)
Flags: needinfo?(ehsan)
(In reply to David Rajchenbach Teller [:Yoric] (please use "needinfo?") from comment #123) > Much to my bafflement, Try seems to suggest that bug 981875 is sufficient to > fix this bug. Or has anything landed in the meantime that could explain the > bug not appearing? > > https://tbpl.mozilla.org/?tree=Try&rev=757826e09eba Hmm, why do you think the bug is gone away on this push? See: https://tbpl.mozilla.org/php/getParsedLog.php?id=36139883&tree=Try&full=1 https://tbpl.mozilla.org/php/getParsedLog.php?id=36139917&tree=Try&full=1 (Don't be misled by what TBPL suggests, these are all the same bug, see the debug spew and also the "DOMApplicationRegistry is not defined" error in the log, etc.)
Flags: needinfo?(ehsan)
Ok, my bad, back to tracking this bug.
Flags: needinfo?(mar.castelluccio)
(In reply to Myk Melez [:myk] [@mykmelez] from comment #80) > (In reply to Patrick Wang (ChihKai) (:kk1fff) from comment #79) > > I don't really have an idea how this happen. But I would guess that there > > might be exception thrown during loading WebApps.jsm, so the symbol is not > > exposed by the lazy getter properly. > > The error that appears before "DOMApplicationRegistry is not defined" is: > > 01:20:05 INFO - > ************************************************************ > 01:20:05 INFO - * Call to xpconnect wrapped JSObject produced this > error: * > 01:20:25 INFO - [Exception... "Component returned failure code: > 0x80570016 (NS_ERROR_XPC_GS_RETURNED_FAILURE) [nsIJSCID.getService]" > nsresult: "0x80570016 (NS_ERROR_XPC_GS_RETURNED_FAILURE)" location: "JS > frame :: resource://gre/modules/XPCOMUtils.jsm :: XPCU_serviceLambda :: line > 202" data: no] > 01:20:25 INFO - > ************************************************************ I have instrumented the code a little further https://tbpl.mozilla.org/php/getParsedLog.php?id=36451003&tree=Try&full=1#error0 According to the logs, this error is due to @mozilla.org/parentprocessmessagemanager;1 and it could very well explain that "DOMApplicationRegistry is not defined". Still trying to understand the ChromeWorker-related problems, but I figure we have a more likely suspect for all our trouble.
Flags: needinfo?(ehsan)
Full stack: 17:01:27 INFO - Call stack: XPCU_serviceLambda@resource://gre/modules/XPCOMUtils.jsm:211:11 17:01:27 INFO - XPCU_defineLazyGetter/<.get@resource://gre/modules/XPCOMUtils.jsm:177:9 17:01:27 INFO - activities_init/<@resource://gre/modules/ActivitiesService.jsm:170:7 17:01:27 INFO - activities_init@resource://gre/modules/ActivitiesService.jsm:171:1 17:01:27 INFO - @resource://gre/modules/ActivitiesService.jsm:412:1 17:01:27 INFO - @resource://gre/modules/Webapps.jsm:39:1 17:01:27 INFO - @resource://gre/modules/BrowserElementParent.jsm:24:3 17:01:27 INFO - XPCU_defineLazyGetter/<.get@resource://gre/modules/XPCOMUtils.jsm:177:9 17:01:27 INFO - BrowserElementParent.prototype._registerAppManifest@resource://gre/modules/BrowserElementParent.jsm:238:1 17:01:27 INFO - BrowserElementParent@resource://gre/modules/BrowserElementParent.jsm:202:5 17:01:27 INFO - create@resource://gre/modules/BrowserElementParent.jsm:91:5 17:01:27 INFO - BrowserElementParentFactory.prototype._createBrowserElementParent@jar:file:///system/b2g/omni.ja!/components/BrowserElementParent.js:106:1 17:01:27 INFO - BrowserElementParentFactory.prototype._observeInProcessBrowserFrameShown@jar:file:///system/b2g/omni.ja!/components/BrowserElementParent.js:89:1 17:01:27 INFO - BrowserElementParentFactory.prototype.observe@jar:file:///system/b2g/omni.ja!/components/BrowserElementParent.js:123:7 17:01:27 INFO - checkAppState@http://mochi.test:8888/tests/dom/apps/tests/test_app_update.html:259:5 17:01:27 INFO - runTest@http://mochi.test:8888/tests/dom/apps/tests/test_app_update.html:92:5 17:01:27 INFO - continueTest@http://mochi.test:8888/tests/dom/apps/tests/test_app_update.html:32:11 17:01:27 INFO - WebappsRegistry.prototype.receiveMessage@jar:file:///system/b2g/omni.ja!/components/Webapps.js:47:9
I'm in the process of tracking down the WorkerPrivate/WorkerRuntime uncaught errors. Some of them are indeed due to a race condition around the B2G worker reset (which would explain why we didn't notice them before) and some due to internal race conditions in workers, in which we attempt to send connection information to workers although the workers are shutting down. Patch in progress. We'll see if it fixes the current bug.
(In reply to David Rajchenbach Teller [:Yoric] (please use "needinfo?") from comment #126) > (In reply to Myk Melez [:myk] [@mykmelez] from comment #80) > > (In reply to Patrick Wang (ChihKai) (:kk1fff) from comment #79) > > > I don't really have an idea how this happen. But I would guess that there > > > might be exception thrown during loading WebApps.jsm, so the symbol is not > > > exposed by the lazy getter properly. > > > > The error that appears before "DOMApplicationRegistry is not defined" is: > > > > 01:20:05 INFO - > > ************************************************************ > > 01:20:05 INFO - * Call to xpconnect wrapped JSObject produced this > > error: * > > 01:20:25 INFO - [Exception... "Component returned failure code: > > 0x80570016 (NS_ERROR_XPC_GS_RETURNED_FAILURE) [nsIJSCID.getService]" > > nsresult: "0x80570016 (NS_ERROR_XPC_GS_RETURNED_FAILURE)" location: "JS > > frame :: resource://gre/modules/XPCOMUtils.jsm :: XPCU_serviceLambda :: line > > 202" data: no] > > 01:20:25 INFO - > > ************************************************************ > > I have instrumented the code a little further > https://tbpl.mozilla.org/php/getParsedLog. > php?id=36451003&tree=Try&full=1#error0 > > According to the logs, this error is due to > @mozilla.org/parentprocessmessagemanager;1 and it could very well explain > that "DOMApplicationRegistry is not defined". Hmm, reading NS_NewParentProcessesMessageManager <http://mxr.mozilla.org/mozilla-central/source/content/base/src/nsFrameMessageManager.cpp#1834>, it seems like the only two scenarios for that function to fail are either if we are in the content process or if QueryInterface fails. The latter is very unlikely because nsFrameMessageManager definitely implements nsIMessageBroadcaster: <http://mxr.mozilla.org/mozilla-central/source/content/base/src/nsFrameMessageManager.cpp#108> (Both mChrome and mIsBroadcaster will be true there.) Fabrice, is there any possibility of Webapps.jsm being loaded into a content process? If so, then is it expected that accessing ppmm in Webapps.jsm to throw?
Flags: needinfo?(ehsan) → needinfo?(fabrice)
So, Webapps.jsm should *never* be loaded in a content process. That looks a lot like the same issue as bug 902165 I have patch that may work here, let me find it.
Flags: needinfo?(fabrice)
Here it is. I can move it to bug 902165 it that's more in-context.
Attachment #8395033 - Flags: feedback?(bent.mozilla)
Comment on attachment 8395033 [details] [diff] [review] no webapps.jsm in content processes Review of attachment 8395033 [details] [diff] [review]: ----------------------------------------------------------------- Seems ok to me without the commented code :)
Attachment #8395033 - Flags: feedback?(bent.mozilla) → feedback+
I don't understand why this is intermittent though.
There is more than one bug in the logs, so that might not be the [only?] source of failures.
Depends on: 902165
Attached patch Reworking OS.File reset/shutdown (obsolete) — Splinter Review
I believe that this should take care of the XPCOM errors.
Attachment #8395614 - Flags: review?(nfroyd)
(In reply to David Rajchenbach Teller [:Yoric] (please use "needinfo?") from comment #135) > Created attachment 8395614 [details] [diff] [review] > Reworking OS.File reset/shutdown > > I believe that this should take care of the XPCOM errors. Can you please post a link to the try server run? Thanks!
Comment on attachment 8395614 [details] [diff] [review] Reworking OS.File reset/shutdown Review of attachment 8395614 [details] [diff] [review]: ----------------------------------------------------------------- Apologies on the review lag, between work week and getting sick... I have questions about bits of the patch. ::: toolkit/components/osfile/modules/osfile_async_front.jsm @@ +245,5 @@ > } > > let {openedFiles, openedDirectoryIterators, killed} = resources; > + if (!reset && (openedFiles && openedFiles.length > + || ( openedDirectoryIterators && openedDirectoryIterators.length))) { Nit: the indentation on this conditional is bizarre. Make it more like it was before, please. @@ +269,2 @@ > > + this.shutdown = shutdown; Should this be killed || shutdown? @@ -380,5 @@ > if (firstLaunch) { > Scheduler._updateTelemetry(); > } > - > - Scheduler.restartTimer(); So in my m-c pull, this is actually conditionalized. Where did the conditional go prior to this patch? @@ +522,5 @@ > // configured for testing by the testsuite. > let phase = AsyncShutdown._getPhase(TOPIC); > phase.addBlocker( > "(for testing purposes) OS.File: warn about unclosed files", > + () => Scheduler.kill({shutdown: false, reset: false}) Why are we changing the reset behavior here? @@ +1320,2 @@ > File.resetWorker = function() { > + return Task.spawn(function*() { Why are we making this async?
Attachment #8395614 - Flags: review?(nfroyd) → feedback+
(In reply to Nathan Froyd (:froydnj) from comment #137) > @@ +269,2 @@ > > > > + this.shutdown = shutdown; > > Should this be killed || shutdown? No, |killed| informs us that the thread was killed, but in a way that will let us resurrect it. On the other hand, |this.shutdown| informs us that we should reject any new message that could possibly resurrect the worker. Different stuff. > @@ -380,5 @@ > > if (firstLaunch) { > > Scheduler._updateTelemetry(); > > } > > - > > - Scheduler.restartTimer(); > > So in my m-c pull, this is actually conditionalized. Where did the > conditional go prior to this patch? Mmmh.... Obviously, I forgot to fold stuff. > > @@ +522,5 @@ > > // configured for testing by the testsuite. > > let phase = AsyncShutdown._getPhase(TOPIC); > > phase.addBlocker( > > "(for testing purposes) OS.File: warn about unclosed files", > > + () => Scheduler.kill({shutdown: false, reset: false}) > > Why are we changing the reset behavior here? > > @@ +1320,2 @@ > > File.resetWorker = function() { > > + return Task.spawn(function*() { > > Why are we making this async? I believe that this has always been async.
Attachment #8395614 - Attachment is obsolete: true
Attachment #8397835 - Flags: review?(nfroyd)
(Note that with bug 981085 being fixed, you need to back out the patch for that before testing this on try.)
Attachment #8397835 - Flags: review?(nfroyd) → review+
(In reply to :Ehsan Akhgari (lagging on bugmail, needinfo? me!) from comment #140) > (Note that with bug 981085 being fixed, you need to back out the patch for > that before testing this on try.) Yes, my test was using a tree prior landing of bug 981085. We now need to land "Reworking OS.File reset/shutdown, v2."
Keywords: checkin-needed
Whiteboard: [land Reworking OS.File reset/shutdown, v2]
Assignee: nobody → dteller
Flags: in-testsuite+
Keywords: checkin-needed
Whiteboard: [land Reworking OS.File reset/shutdown, v2] → [fixed-in-fx-team]
Ah, of course, it works better if I apply my trivial fix in both places where the problem occurs.
Attachment #8400537 - Attachment is obsolete: true
Attachment #8400584 - Flags: review+
Keywords: checkin-needed
Whiteboard: [land Reworking OS.File reset/shutdown, v4]
Keywords: checkin-needed
Whiteboard: [land Reworking OS.File reset/shutdown, v4] → [fixed-in-fx-team]
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Whiteboard: [fixed-in-fx-team]
Target Milestone: --- → mozilla31
The following changesets are now in Firefox Nightly: > 4eb0d4408fd2 Bug 961317 - Clean up OS.File shutdown race condition and rework OS.File reset/shutdown. r=froydnj > e47bd7fca2cf Backed out changeset 4eb0d4408fd2 (bug 961317) for Windows xpcshell failures. Nightly Build Information: ID: 20140402030201 Changeset: 4941a2ac0786109b08856738019b016a6c5a66a6 Version: 31.0a1 TBPL: https://tbpl.mozilla.org/?rev=4941a2ac0786 URL: https://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014/04/2014-04-02-03-02-01-mozilla-central Download Links: > Linux x86: https://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014/04/2014-04-02-03-02-01-mozilla-central/firefox-31.0a1.en-US.linux-i686.tar.bz2 > Linux x86_64: https://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014/04/2014-04-02-03-02-01-mozilla-central/firefox-31.0a1.en-US.linux-x86_64.tar.bz2 > Linux x86_64 ASAN: https://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014/04/2014-04-02-03-02-01-mozilla-central/firefox-31.0a1.en-US.linux-x86_64-asan.tar.bz2 > Mac: https://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014/04/2014-04-02-03-02-01-mozilla-central/firefox-31.0a1.en-US.mac.dmg > Win32: https://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014/04/2014-04-02-03-02-01-mozilla-central/firefox-31.0a1.en-US.win32.installer.exe > Win64: https://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014/04/2014-04-02-03-02-01-mozilla-central/firefox-31.0a1.en-US.win64-x86_64.installer.exe Previous Nightly Build Information: ID: 20140401030203 Changeset: 1417d180a1d8665b1a91b897d1cc4cc31e7980d4 Version: 31.0a1 TBPL: https://tbpl.mozilla.org/?rev=1417d180a1d8 URL: https://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014/04/2014-04-01-03-02-03-mozilla-central
Do we want this on Aurora for v1.4?
Deflecting to ehsan.
Flags: needinfo?(ehsan)
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #150) > Do we want this on Aurora for v1.4? Yes, I think so.
Flags: needinfo?(ehsan)
In that case, can I please get a branch nomination? :)
Comment on attachment 8400584 [details] [diff] [review] Reworking OS.File reset/shutdown, v4 [Approval Request Comment] Bug caused by (feature/regressing bug #): bug 927560 User impact if declined: Possible freezes/crashes in b2g Testing completed (on m-c, etc.): Has landed 6 days ago. Risk to taking this patch (and alternatives if risky): If it doesn't work, we will find out quickly. We can back this out in case of problem. String or IDL/UUID changes made by this patch: none
Attachment #8400584 - Flags: approval-mozilla-aurora?
Flags: needinfo?(dteller)
Attachment #8400584 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Product: Toolkit → Toolkit Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: