961317 - Intermittent TEST-UNEXPECTED-FAIL | test_bug_795164.html | Test timed out. | test_uninstall_errors.html | mozApps error: MULTIPLE_APPS_PER_ORIGIN_FORBIDDEN

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Reporter

Description

•

11 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=33195780&tree=Fx-Team slave: tst-linux64-ec2-120 14:44:11 INFO - 876 INFO TEST-START | /tests/dom/apps/tests/test_bug_795164.html 14:44:11 INFO - Server Etag: apptype-hosted-getmanifest-true-3 14:44:11 INFO - No Client Etag 14:44:11 INFO - System JS : ERROR resource://gre/modules/osfile/_PromiseWorker.jsm:163 - NS_ERROR_FAILURE: 14:44:11 INFO - 877 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/apps/tests/test_bug_795164.html | Test timed out. 14:44:11 INFO - 878 INFO TEST-INFO | MEMORY STAT vsize after test: 110116864 14:44:11 INFO - 879 INFO TEST-INFO | MEMORY STAT residentFast after test: 55431168 14:44:11 INFO - 880 INFO TEST-INFO | MEMORY STAT heapAllocated after test: 18777504 14:44:11 INFO - 881 INFO TEST-END | /tests/dom/apps/tests/test_bug_795164.html | finished in 317802ms 14:44:41 INFO - 1128 INFO TEST-START | /tests/dom/apps/tests/test_uninstall_errors.html 14:44:41 INFO - Server Etag: apptype-hosted-getmanifest-true-3 14:44:41 INFO - No Client Etag 14:44:41 INFO - 1129 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/apps/tests/test_uninstall_errors.html | mozApps error: MULTIPLE_APPS_PER_ORIGIN_FORBIDDEN 14:44:41 INFO - 1130 INFO TEST-INFO | MEMORY STAT vsize after test: 110510080 14:44:41 INFO - 1131 INFO TEST-INFO | MEMORY STAT residentFast after test: 59572224 14:44:41 INFO - 1132 INFO TEST-INFO | MEMORY STAT heapAllocated after test: 22362364 14:44:41 INFO - 1133 INFO TEST-END | /tests/dom/apps/tests/test_uninstall_errors.html | finished in 1664ms

Marco Castelluccio [:marco]

Comment 1

•

11 years ago

The error is referring to OS.File, it's possible this intermittent failure is happening after my patch in bug 959420, but it looks like an OS.File specific problem.

Comment hidden (Legacy TBPL/Treeherder Robot)

philor https://tbpl.mozilla.org/php/getParsedLog.php?id=33958067&tree=Mozilla-Inbound b2g_emulator_vm mozilla-inbound opt test mochitest-3 on 2014-02-01 15:30:33 revision: 68022ef6e3ed slave: tst-linux64-ec2-321 21663 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/apps/tests/test_bug_795164.html | Test timed out. 21932 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/apps/tests/test_uninstall_errors.html | mozApps error: MULTIPLE_APPS_PER_ORIGIN_FORBIDDEN 29034 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/datastore/tests/test_basic.html | Test timed out. 29040 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/datastore/tests/test_bug924104.html | Error callback invoked 29047 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/datastore/tests/test_certifiedApp.html | Error callback invoked 29053 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/datastore/tests/test_changes.html | Error callback invoked 29059 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/datastore/tests/test_duplicate.html | Error callback invoked 29065 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/datastore/tests/test_keys.html | Error callback invoked 29071 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/datastore/tests/test_oop.html | Error callback invoked 29077 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/datastore/tests/test_readonly.html | Error callback invoked 29083 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/datastore/tests/test_sync.html | Error callback invoked 02-02 00:10:23.096 742 742 I GeckoDump: 21663 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/apps/tests/test_bug_795164.html | Test timed out. 02-02 00:11:13.285 742 742 I GeckoDump: 21932 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/apps/tests/test_uninstall_errors.html | mozApps error: MULTIPLE_APPS_PER_ORIGIN_FORBIDDEN 02-02 00:35:02.169 742 742 I GeckoDump: 29034 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/datastore/tests/test_basic.html | Test timed out. 02-02 00:35:06.152 742 742 I GeckoDump: 29040 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/datastore/tests/test_bug924104.html | Error callback invoked 02-02 00:35:08.793 742 742 I GeckoDump: 29047 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/datastore/tests/test_certifiedApp.html | Error callback invoked 02-02 00:35:11.438 742 742 I GeckoDump: 29053 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/datastore/tests/test_changes.html | Error callback invoked 02-02 00:35:14.115 742 742 I GeckoDump: 29059 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/datastore/tests/test_duplicate.html | Error callback invoked 02-02 00:35:16.925 742 742 I GeckoDump: 29065 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/datastore/tests/test_keys.html | Error callback invoked 02-02 00:35:20.389 742 742 I GeckoDump: 29071 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/datastore/tests/test_oop.html | Error callback invoked Only displaying first 20 of 22 failures - .

Comment hidden (Legacy TBPL/Treeherder Robot)

Ryan VanderMeulen [:RyanVM]

Comment 60

•

11 years ago

These is a very frequent orange on TBPL (as you can see in the comments here). This needs an owner ASAP.

Flags: needinfo?(overholt)

Flags: needinfo?(dteller)

Comment hidden (Legacy TBPL/Treeherder Robot)

Andrew Overholt [:overholt]

Comment 62

•

11 years ago

Jan, please take a look here. The timing of the oranges indicates it was triggered by Marcos' patch as he suspects in comment #1 but he thinks it's an underlying bug.

Assignee: nobody → Jan.Varga

Flags: needinfo?(overholt)

Comment hidden (Legacy TBPL/Treeherder Robot)

David Teller [:Yoric] - still alive but not very active

Assignee

Comment 73

•

11 years ago

Let's look at the errors: 01:26:19 INFO - System JS : ERROR resource://gre/modules/osfile/_PromiseWorker.jsm:180 - NS_ERROR_FAILURE: I have triple-checked – this is an error that actually cannot be launched by this line of code, neither in JS (which doesn't call any XPCOM) nor in the underlying C++ (which is purely JSAPI-based so cannot launched XPCOM errors). So, I suspect that the error is actually somewhere else, but is reported at this line due to some JSAPI-level error somewhere else in the code. Note that, a few lines above, we have 01:26:19 INFO - [Exception... "[JavaScript Error: "DOMApplicationRegistry is not defined" {file: "resource://gre/modules/BrowserElementParent.jsm" line: 238}]'[JavaScript Error: "DOMApplicationRegistry is not defined" {file: "resource://gre/modules/BrowserElementParent.jsm" line: 238}]' when calling method: [nsIObserver::observe]" nsresult: "0x80570021 (NS_ERROR_XPC_JAVASCRIPT_ERROR_WITH_DETAILS)" location: "JS frame :: http://mochi.test:8888/tests/dom/apps/tests/test_app_update.html :: checkAppState :: line 259" data: yes] Which looks like a possible source of error, too. I have the feeling that this is totally unrelated to OS.File.

Flags: needinfo?(dteller)

Comment hidden (Legacy TBPL/Treeherder Robot)

Andrew Overholt [:overholt]

Comment 78

•

11 years ago

(In reply to David Rajchenbach Teller [:Yoric] (please use "needinfo?") from comment #73) > 01:26:19 INFO - [Exception... "[JavaScript Error: > "DOMApplicationRegistry is not defined" {file: > "resource://gre/modules/BrowserElementParent.jsm" line: 238}]'[JavaScript > Error: "DOMApplicationRegistry is not defined" {file: > "resource://gre/modules/BrowserElementParent.jsm" line: 238}]' when calling > method: [nsIObserver::observe]" nsresult: "0x80570021 > (NS_ERROR_XPC_JAVASCRIPT_ERROR_WITH_DETAILS)" location: "JS frame :: > http://mochi.test:8888/tests/dom/apps/tests/test_app_update.html :: > checkAppState :: line 259" data: yes] Patrick, could this be fallout from bug 960641? Actually, the dates don't line up but maybe you have some idea anyway :)

Flags: needinfo?(pwang)

Patrick Wang (Chih-Kai Wang) (:kk1fff)

Comment 79

•

11 years ago

I don't really have an idea how this happen. But I would guess that there might be exception thrown during loading WebApps.jsm, so the symbol is not exposed by the lazy getter properly.

Flags: needinfo?(pwang)

Myk Melez [:myk] [@mykmelez]

Comment 80

•

11 years ago

(In reply to Patrick Wang (ChihKai) (:kk1fff) from comment #79) > I don't really have an idea how this happen. But I would guess that there > might be exception thrown during loading WebApps.jsm, so the symbol is not > exposed by the lazy getter properly. The error that appears before "DOMApplicationRegistry is not defined" is: 01:20:05 INFO - ************************************************************ 01:20:05 INFO - * Call to xpconnect wrapped JSObject produced this error: * 01:20:25 INFO - [Exception... "Component returned failure code: 0x80570016 (NS_ERROR_XPC_GS_RETURNED_FAILURE) [nsIJSCID.getService]" nsresult: "0x80570016 (NS_ERROR_XPC_GS_RETURNED_FAILURE)" location: "JS frame :: resource://gre/modules/XPCOMUtils.jsm :: XPCU_serviceLambda :: line 202" data: no] 01:20:25 INFO - ************************************************************ XPCOMUtils.jsm line 202 is in a lambda inside defineLazyServiceGetter, and there are three calls to defineLazyServiceGetter in Webapps.jsm: XPCOMUtils.defineLazyServiceGetter(this, "ppmm", "@mozilla.org/parentprocessmessagemanager;1", "nsIMessageBroadcaster"); XPCOMUtils.defineLazyServiceGetter(this, "cpmm", "@mozilla.org/childprocessmessagemanager;1", "nsIMessageSender"); … XPCOMUtils.defineLazyServiceGetter(this, "dataStoreService", "@mozilla.org/datastore-service;1", "nsIDataStoreService"); So presumably one of these is throwing the exception, which causes DOMApplicationRegistry to be undefined.

Comment hidden (Legacy TBPL/Treeherder Robot)

(no longer active)

Comment 85

•

11 years ago

This is the key line in the log: 09:38:26 INFO - System JS : ERROR resource://gre/modules/osfile/_PromiseWorker.jsm:180 - NS_ERROR_FAILURE: That is a call to worker.postMessage: <http://mxr.mozilla.org/mozilla-central/source/toolkit/components/osfile/modules/_PromiseWorker.jsm#180> Which means that MessageEventRunnable::Dispatch is failing: <http://mxr.mozilla.org/mozilla-central/source/dom/workers/WorkerPrivate.cpp#2734> There are a limited number of ways that function can fail, basically read <http://mxr.mozilla.org/mozilla-central/source/dom/workers/WorkerRunnable.cpp#80> and everything that function calls. Somebody needs to add some printf-debugging code to output something on each one of those try paths, send that to try and trigger a bunch of test runs to get this fail there and look at the log and see which one of the printf's show up. Then we will know where this error is coming from. Note that Webapps.jsm includes osfile.jsm so this is how all of this stuff is bubbling up the chain.

Assignee: Jan.Varga → nobody

Comment hidden (Legacy TBPL/Treeherder Robot)

Ryan VanderMeulen [:RyanVM]

Updated

•

11 years ago

Depends on: 972927

Comment hidden (Legacy TBPL/Treeherder Robot)

Marco Castelluccio [:marco]

Comment 98

•

11 years ago

I've tried to use NetUtil instead of OS.File.writeAtomic, no failures in 5 try runs.

(no longer active)

Comment 99

•

11 years ago

OK, I basically did what I recommended in comment 85 on the try server: https://tbpl.mozilla.org/?tree=Try&rev=86b7c7c29ad0. In the log, I see messages like: 11:40:56 INFO - XXXehsan DispatchInternal6 -2147418113 11:40:56 INFO - XXXehsan 2 11:40:56 INFO - XXXehsan DispatchInternal6 -2147418113 11:40:56 INFO - XXXehsan 2 11:40:56 INFO - XXXehsan DispatchInternal6 -2147418113 11:40:56 INFO - XXXehsan 2 11:40:56 INFO - XXXehsan DispatchInternal6 -2147418113 11:40:56 INFO - XXXehsan 2 -2147418113 is NS_ERROR_UNEXPECTED. If you look at my patch in that try push, you'll see that the error is coming from <http://hg.mozilla.org/mozilla-central/annotate/0085a162499f/dom/workers/WorkerPrivate.cpp#l2225> which means that OS.file is somehow attempting to post a message to the promise worker when it's shutting down, which results in the failure change observed here. This is extremely bad, since it can cause into all sorts of failures in any tests which uses apps. We have been seeing this pattern in the data store tests as well (which are currently disabled because they're timing out all the time.) It might actually be a real bug which also manifests itself to our users, but it's hard to say for sure without further investigation. <http://hg.mozilla.org/mozilla-central/annotate/0085a162499f/toolkit/components/osfile/modules/osfile_async_front.jsm#l256> seems to be the only place where we post something to this worker. There are two catch blocks there but of course none of them catch this exception. Perhaps the right thing to do here would be to add another catch block for other types of errors? It's hard for me to say since I don't understand this code much. David, can you please take a look? Thanks!

Component: DOM: Apps → OS.File

Flags: needinfo?(dteller)

Product: Core → Toolkit

(no longer active)

Comment 100

•

11 years ago

(In reply to Marco Castelluccio [:marco] from comment #98) > I've tried to use NetUtil instead of OS.File.writeAtomic, no failures in 5 > try runs. Marco, can you please post this patch here? Thanks!

Flags: needinfo?(mar.castelluccio)

(no longer active)

Comment 101

•

11 years ago

Nathan, it seems like Yoirc is out sick. This bug is really serious and it hits pretty much every b2g test which does anything with apps. Can you please take a look? See my analysis in comment 99. Thanks.

Flags: needinfo?(nfroyd)

Nathan Froyd [:froydnj]

Comment 102

•

11 years ago

Hm, bug 965527 looks like it touches sort of the same area, but I don't think it actually helps with this. Why is the promise worker shutting down in the first place?

Flags: needinfo?(nfroyd)

(no longer active)

Comment 103

•

11 years ago

(In reply to comment #102) > Hm, bug 965527 looks like it touches sort of the same area, but I don't think > it actually helps with this. > > Why is the promise worker shutting down in the first place? I don't know, this code is written in a style which makes it *very* hard to read. I tried and failed, which is why I am needinfo-ing other people here. :-)

Nathan Froyd [:froydnj]

Comment 104

•

11 years ago

(In reply to :Ehsan Akhgari (needinfo? me!) (slow responsiveness, emailapocalypse) from comment #103) > (In reply to comment #102) > > Hm, bug 965527 looks like it touches sort of the same area, but I don't think > > it actually helps with this. > > > > Why is the promise worker shutting down in the first place? > > I don't know, this code is written in a style which makes it *very* hard to > read. I tried and failed, which is why I am needinfo-ing other people here. > :-) Ah. :) A plausible-looking scenario is this: - Something calls resetWorker: http://mxr.mozilla.org/mozilla-central/source/toolkit/components/osfile/modules/osfile_async_front.jsm#1239 This API was added at the request of the B2G folks so that the OS.File worker didn't have to hang around all the time, consuming memory. - which then attempts to shutdown the worker thread: http://mxr.mozilla.org/mozilla-central/source/toolkit/components/osfile/modules/osfile_async_worker.js#283 - and we attempt to send back the reply: http://mxr.mozilla.org/mozilla-central/source/toolkit/components/osfile/modules/osfile_async_worker.js#99 The interesting line is here: http://mxr.mozilla.org/mozilla-central/source/toolkit/components/osfile/modules/osfile_async_worker.js#107 where we call self.close() in an attempt to shutdown the worker. I don't immediately see where |self| is coming from here, though this is what the MDN page recommends? https://developer.mozilla.org/en-US/docs/Web/Guide/Performance/Using_web_workers#Terminating_a_worker I think the expected order of events here is: - The worker thread shuts down; - The handler for the Meta_reset call runs and nulls out worker: http://mxr.mozilla.org/mozilla-central/source/toolkit/components/osfile/modules/osfile_async_front.jsm#1245 - Future post() calls then notice that the worker no longer exists, and restart the worker. But since this involves threading, the sequence looks like: - The worker thread shuts down; - post() is called and tries to send something to the thread; - Calamity ensues, because the worker thread is gone, even though the scheduler doesn't know it yet. And so we have a race. Does that sound plausible to anybody else?

Nathan Froyd [:froydnj]

Comment 105

•

11 years ago

Argh, apparently generating those mxr links from an old tree means they don't quite point at the right places. Ideally people will be able to follow what's going on. =/

Nathan Froyd [:froydnj]

Comment 106

•

11 years ago

A try run with OS.File logging turned on everywhere: http://mxr.mozilla.org/mozilla-central/source/toolkit/components/osfile/modules/osfile_shared_allthreads.jsm#57 might help confirm...

Ryan VanderMeulen [:RyanVM]

Comment 107

•

11 years ago

Why not. https://tbpl.mozilla.org/?tree=Try&rev=7417a14dcfcb

(no longer active)

Comment 108

•

11 years ago

I don't understand anything from this code to be honest, but your theory sounds plausible at least!

Marco Castelluccio [:marco]

Comment 109

•

11 years ago

I'll post my patch in bug 981085.

Flags: needinfo?(mar.castelluccio)

Georg Fritzsche [:gfritzsche]

Updated

•

11 years ago

Blocks: 980379

David Teller [:Yoric] - still alive but not very active

Assignee

Comment 110

•

11 years ago

(In reply to Nathan Froyd (:froydnj) from comment #104) > The interesting line is here: > > http://mxr.mozilla.org/mozilla-central/source/toolkit/components/osfile/ > modules/osfile_async_worker.js#107 > > where we call self.close() in an attempt to shutdown the worker. I don't > immediately see where |self| is coming from here, though this is what the > MDN page recommends? > > https://developer.mozilla.org/en-US/docs/Web/Guide/Performance/ > Using_web_workers#Terminating_a_worker |self| is a global of workers. > I think the expected order of events here is: [...] > And so we have a race. Does that sound plausible to anybody else? Shouldn't this have been fixed by the latest patch to bug 965527? Or are we talking about an older codebase?

Flags: needinfo?(dteller)

David Teller [:Yoric] - still alive but not very active

Assignee

Comment 111

•

11 years ago

Also, I can't find any trace of a OS.File error in the logs. Ehsan: Indeed, the code of OS.File is often hard to follow. Some of it can be improved, but it's going to be a long task – neither ChromeWorkers nor js-ctypes make it very easy to work nice code.

(no longer active)

Comment 112

•

11 years ago

(In reply to David Rajchenbach Teller [:Yoric] (currently away from bugzilla – ETA March 11th – please use "needinfo?") from comment #110) > > And so we have a race. Does that sound plausible to anybody else? > > Shouldn't this have been fixed by the latest patch to bug 965527? > Or are we talking about an older codebase? You can test to see if that fix has worked by pushing a try run which re-enables the tests disabled here. Most of the tests that have been hitting this bug have been disabled because they fail intermittently too many times. (In reply to David Rajchenbach Teller [:Yoric] (currently away from bugzilla – ETA March 11th – please use "needinfo?") from comment #111) > Also, I can't find any trace of a OS.File error in the logs. Comment 85 describes the things you see in the log, and comment 99 explains my debugging session on try up to the point I narrowed this down to the OS.File code. Did you read both of them? The latter is very important as it explains everything I've done over the period of a few days debugging this on try. > Ehsan: Indeed, the code of OS.File is often hard to follow. Some of it can > be improved, but it's going to be a long task – neither ChromeWorkers nor > js-ctypes make it very easy to work nice code. I talked to Taras and Nathan about this on Friday (since you were out on that day.) Nathan told me that we're rewriting parts of this code in C++, which should help make things better, and I did look at the code and it didn't look very simple to me. I argued the case that JS is a very strange choice to have implemented this in, and Taras argued the opposite case saying that node.js has been able to do this in js, so should we. That didn't really go anywhere I don't think. Anyways, I filed bug 981085 to get the DOM apps code moved away from OS.File mostly because of this bug and the fact that it seems to be much slower and more resource intensive than the alternative. My immediate interest is to see this bug fixed so that we can switch back to OS.File once the rest of the problems with it have been addressed. If you'd like to discuss future improvements, please ping me, as I'm interested to see this component improve, but none of this should block fixing this bug. :-)

(no longer active)

Updated

•

11 years ago

Flags: needinfo?(dteller)

(no longer active)

Comment 113

•

11 years ago

(In reply to comment #111) > Also, I can't find any trace of a OS.File error in the logs. (If this refers to Nathan's try push, please ignore that part of my reply! I don't know what Nathan has pushed to try, but presumably if it doesn't help discover anything then we need more logging or something.) Also, note that I expect this to be reproducible locally in an emulator build.

David Teller [:Yoric] - still alive but not very active

Assignee

Comment 114

•

11 years ago

(In reply to :Ehsan Akhgari (needinfo? me!) from comment #112) > (In reply to David Rajchenbach Teller [:Yoric] (currently away from bugzilla > – ETA March 11th – please use "needinfo?") from comment #110) > > > And so we have a race. Does that sound plausible to anybody else? > > > > Shouldn't this have been fixed by the latest patch to bug 965527? > > Or are we talking about an older codebase? > > You can test to see if that fix has worked by pushing a try run which > re-enables the tests disabled here. Most of the tests that have been > hitting this bug have been disabled because they fail intermittently too > many times. Let me rephrase: are we talking about a B2G built with an old Gecko or with the latest Gecko? It is my understanding that some of the work on B2G is still based on Gecko 26. I *guess* we are talking about recent Gecko, but being sure would help. > (In reply to David Rajchenbach Teller [:Yoric] (currently away from bugzilla > – ETA March 11th – please use "needinfo?") from comment #111) > > Also, I can't find any trace of a OS.File error in the logs. > > Comment 85 describes the things you see in the log, and comment 99 explains > my debugging session on try up to the point I narrowed this down to the > OS.File code. Did you read both of them? The latter is very important as > it explains everything I've done over the period of a few days debugging > this on try. Sorry, I should have been clearer: I was talking about the oranges in comment 107 i.e. https://tbpl.mozilla.org/php/getParsedLog.php?id=35819755&tree=Try&full=1 fails at test_bug_795164 without any error message related to _PromiseWorker or OS.File. Your investigation described in comment 85 and comment 99 clearly confirms that the call to post() causes the weird xpcom error, but I suspect that the unexpected failure that prompted this bug is unrelated. This doesn't mean that there OS.File is cleared (see below), but that might mean that there is a problem entirely unrelated to OS.File, too. > [from comment 99] > There are two catch blocks there but of course none of them catch this exception. Perhaps the right thing to do here would be to add another catch block for other types of errors? We could certainly improve error reporting in that case. Filed bug 981875. > which means that OS.file is somehow attempting to post a message to the promise worker when it's shutting down, which results in the failure change observed here 1/ B2G reset timer kicks in, sends message "Meta_reset"; 2/ We post another message, which gets enqueued; 3/ Reset succeeds, reply arrives, but before the reply is treated, the second message is sent to the worker; 4/ Since there is no worker, we fail. If this is the cause of the xpcom error, it is very unlikely to happen in the wild, but it might cause a failure in the test suite.

Flags: needinfo?(dteller)

David Teller [:Yoric] - still alive but not very active

Assignee

Comment 115

•

11 years ago

So, with the result of comment 107, do we still have any reason to believe that the error is related to OS.File?

Flags: needinfo?(ehsan)

(no longer active)

Comment 116

•

11 years ago

(In reply to David Rajchenbach Teller [:Yoric] (please use "needinfo?") from comment #114) > (In reply to :Ehsan Akhgari (needinfo? me!) from comment #112) > > (In reply to David Rajchenbach Teller [:Yoric] (currently away from bugzilla > > – ETA March 11th – please use "needinfo?") from comment #110) > > > > And so we have a race. Does that sound plausible to anybody else? > > > > > > Shouldn't this have been fixed by the latest patch to bug 965527? > > > Or are we talking about an older codebase? > > > > You can test to see if that fix has worked by pushing a try run which > > re-enables the tests disabled here. Most of the tests that have been > > hitting this bug have been disabled because they fail intermittently too > > many times. > > Let me rephrase: are we talking about a B2G built with an old Gecko or with > the latest Gecko? It is my understanding that some of the work on B2G is > still based on Gecko 26. I *guess* we are talking about recent Gecko, but > being sure would help. We're talking about what runs on tbpl, which is the latest Gecko. Once we determine the cause of the bug we might decide to backport it to older versions but let's focus on trunk for now. > > (In reply to David Rajchenbach Teller [:Yoric] (currently away from bugzilla > > – ETA March 11th – please use "needinfo?") from comment #111) > > > Also, I can't find any trace of a OS.File error in the logs. > > > > Comment 85 describes the things you see in the log, and comment 99 explains > > my debugging session on try up to the point I narrowed this down to the > > OS.File code. Did you read both of them? The latter is very important as > > it explains everything I've done over the period of a few days debugging > > this on try. > > Sorry, I should have been clearer: I was talking about the oranges in > comment 107 > i.e. > https://tbpl.mozilla.org/php/getParsedLog.php?id=35819755&tree=Try&full=1 > fails at test_bug_795164 without any error message related to _PromiseWorker > or OS.File. I see. That suggests to me that we don't have sufficient logging in the place where this bug happens. > Your investigation described in comment 85 and comment 99 clearly confirms > that the call to post() causes the weird xpcom error, Which weird xpcom error are you talking about? As I said in comment 99, Worker.postMessage() is throwing from here: <http://hg.mozilla.org/mozilla-central/annotate/0085a162499f/dom/workers/WorkerPrivate.cpp#l2225>, because a message is being posted to a worker that is shutting down. > but I suspect that the > unexpected failure that prompted this bug is unrelated. Why do you suspect that? I have debugging evidence in comment 99 to suggest that the failure is exactly related to what OS.File does to its worker. > This doesn't mean > that there OS.File is cleared (see below), but that might mean that there is > a problem entirely unrelated to OS.File, too. I don't follow your reasoning. > > [from comment 99] > > There are two catch blocks there but of course none of them catch this exception. Perhaps the right thing to do here would be to add another catch block for other types of errors? > > We could certainly improve error reporting in that case. Filed bug 981875. OK, FWIW that was merely a theory, I have no reason to believe that my theory is correct. > > which means that OS.file is somehow attempting to post a message to the promise worker when it's shutting down, which results in the failure change observed here > > 1/ B2G reset timer kicks in, sends message "Meta_reset"; > 2/ We post another message, which gets enqueued; > 3/ Reset succeeds, reply arrives, but before the reply is treated, the > second message is sent to the worker; > 4/ Since there is no worker, we fail. > > If this is the cause of the xpcom error, it is very unlikely to happen in > the wild, but it might cause a failure in the test suite. That's plausible, sure, but unless we've pinpointed the bug exactly, I feel its premature to speculate about how likely it is to occur in the wild. Even if this doesn't happen in the wild, it's still pretty bad since it has caused us not to be able to run our tests since the middle of February! (In reply to David Rajchenbach Teller [:Yoric] (please use "needinfo?") from comment #115) > So, with the result of comment 107, do we still have any reason to believe > that the error is related to OS.File? Absolutely, everything I said in comment 99 still stands. Like I said, the only thing that comment 107 suggests is that we need better logging there.

Flags: needinfo?(ehsan)

(no longer active)

Comment 117

•

11 years ago

I suggest taking my debugging patch in comment 99 <https://hg.mozilla.org/try/rev/21a6518fa995>, studying it to figure out exactly how it narrowed down the bug, and then add your own logging/stacktraces/etc to the patch and pushing it to try until you manage to narrow down the bug further.

David Teller [:Yoric] - still alive but not very active

Assignee

Comment 118

•

11 years ago

(In reply to :Ehsan Akhgari (needinfo? me!) from comment #116) > I see. That suggests to me that we don't have sufficient logging in the > place where this bug happens. Well, I expect that 14:44:11 INFO - System JS : ERROR resource://gre/modules/osfile/_PromiseWorker.jsm:163 - NS_ERROR_FAILURE: in comment 1 would still appear if we add more logging. > > Your investigation described in comment 85 and comment 99 clearly confirms > > that the call to post() causes the weird xpcom error, > > Which weird xpcom error are you talking about? As I said in comment 99, > Worker.postMessage() is throwing from here: > <http://hg.mozilla.org/mozilla-central/annotate/0085a162499f/dom/workers/ > WorkerPrivate.cpp#l2225>, because a message is being posted to a worker that > is shutting down. Well, "weird" insofar as we don't expect a XPCOM error to be thrown when we call a (web-exposed) JS API. > > but I suspect that the > > unexpected failure that prompted this bug is unrelated. > > Why do you suspect that? I have debugging evidence in comment 99 to suggest > that the failure is exactly related to what OS.File does to its worker. > > > This doesn't mean > > that there OS.File is cleared (see below), but that might mean that there is > > a problem entirely unrelated to OS.File, too. > > I don't follow your reasoning. Well, the reasoning is the following: if I understand correctly, when the OS.File issue takes place (xpcom error thrown from |postMessage()|) , it is visible in the logs. Here, we seem to be in a setting in which the WebApps.jsm unexpected failure appears but not the OS.File issue. From which I deduce that they are probably independent. Perhaps I don't understand your comment 99, but it seems to me that you're looking for the source of the OS.File xpcom error, which is great, but with no certainty that this xpcom error causes the WebApps.jsm error. Am I missing something? Now, I'm going to try and fix the codepath that leads to this xpcom error. I just don't have high hopes that it will also fix the current issue. > That's plausible, sure, but unless we've pinpointed the bug exactly, I feel > its premature to speculate about how likely it is to occur in the wild. > Even if this doesn't happen in the wild, it's still pretty bad since it has > caused us not to be able to run our tests since the middle of February! Ack.

Marco Castelluccio [:marco]

Comment 119

•

11 years ago

(In reply to David Rajchenbach Teller [:Yoric] (please use "needinfo?") from comment #118) > Now, I'm going to try and fix the codepath that leads to this xpcom error. I > just don't have high hopes that it will also fix the current issue. I did trigger five try runs using NetUtil instead of OS.File.writeAtomic in the Webapps.jsm::_writeFile function, and they didn't fail. This may be a coincidence, but IIRC these tests were failing with a high probability, so five tests may be enough to think that the problem was actually in the _writeFile function. The patch I sent to try was this: https://bug981085.bugzilla.mozilla.org/attachment.cgi?id=8387863 It's also possible that it's a timing issue that wasn't visible and was uncovered by using OS.File more than before.

(no longer active)

Comment 120

•

11 years ago

(In reply to David Rajchenbach Teller [:Yoric] (please use "needinfo?") from comment #118) > (In reply to :Ehsan Akhgari (needinfo? me!) from comment #116) > > I see. That suggests to me that we don't have sufficient logging in the > > place where this bug happens. > > Well, I expect that > 14:44:11 INFO - System JS : ERROR > resource://gre/modules/osfile/_PromiseWorker.jsm:163 - NS_ERROR_FAILURE: > in comment 1 would still appear if we add more logging. I am not familiar enough with XPConnect error reporting to tell you why this doesn't appear in some of the logs, but my first assumption would be that the XPConnect error reporting doesn't work correctly here. Anyways, this is *very* easy to check, just wrap the places where postMessage is called in a try/catch and log any exceptions that are raised somehow. > > > Your investigation described in comment 85 and comment 99 clearly confirms > > > that the call to post() causes the weird xpcom error, > > > > Which weird xpcom error are you talking about? As I said in comment 99, > > Worker.postMessage() is throwing from here: > > <http://hg.mozilla.org/mozilla-central/annotate/0085a162499f/dom/workers/ > > WorkerPrivate.cpp#l2225>, because a message is being posted to a worker that > > is shutting down. > > Well, "weird" insofar as we don't expect a XPCOM error to be thrown when we > call a (web-exposed) JS API. We do that *all the time*. :-) Lots of JS exposed platform APIs can throw various types of exceptions. If you need to check a specific method to see whether it can throw, check the corresponding webidl file and look for a [Throws] extended attribute (example: http://mxr.mozilla.org/mozilla-central/source/dom/webidl/Worker.webidl#20). > > > but I suspect that the > > > unexpected failure that prompted this bug is unrelated. > > > > Why do you suspect that? I have debugging evidence in comment 99 to suggest > > that the failure is exactly related to what OS.File does to its worker. > > > > > This doesn't mean > > > that there OS.File is cleared (see below), but that might mean that there is > > > a problem entirely unrelated to OS.File, too. > > > > I don't follow your reasoning. > > Well, the reasoning is the following: if I understand correctly, when the > OS.File issue takes place (xpcom error thrown from |postMessage()|) , it is > visible in the logs. Assuming if the XPConnect error reporting layer works correctly. We know that at least in some cases it doesn't. See bug 957666 for example where I chased an incomplete error message. > Here, we seem to be in a setting in which the > WebApps.jsm unexpected failure appears but not the OS.File issue. From which > I deduce that they are probably independent. I did not follow the exception path very closely to see where that exception propagates to. But from my experience chasing these types of intermittent failures, it's usually best to assume as few things work correctly as possible. :-) Please note that once we identify exactly what's happening here, we should file a bug to get better error reporting here. There is no good reason why we had to go through this much pain to figure out which code throws. Ideally we should have had a clean stack trace printed to stdout or something! > Perhaps I don't understand your comment 99, but it seems to me that you're > looking for the source of the OS.File xpcom error, which is great, but with > no certainty that this xpcom error causes the WebApps.jsm error. So in comment 85, I narrowed the failure down to the Worker.postMessage call made by OS.File throwing. Then I wanted to know more about _why_ that code is throwing, and comment 99 summarizes my debugging scenario to figure out the exact place where we return the first error code which then propagates through the callers until it reaches the JS caller. (Note that comment 99 misses a number of previous simpler try pushes, I basically started with sticking a small number of printf's in a bunch of places and move my way from that point to narrow down the failure point.) Given the fact that the code I linked to in comment 99 throws (and there is no doubt about that, see the printf outputs in the test output log), there is no doubt that the postMessage call is throwing (you can follow the C++ callers to get to PostMessage). And given how busted the current XPConnect error reporting story is, I am not surprised at all that something might be eating up that _PromiseWorker.jsm error message. But the fact remains that we know that Worker.postMessage() is throwing, we know that it is the one used by OS.File that is throwing, we know that OS.File doesn't properly protect against that exception, so it will propagate through the JS callers until it reaches the global scope of WebApps.jsm, possibly breaking the loading of that module, etc. (This last bit is the part that I was talking about before when I said I didn't investigate the exception propagation path very carefully.) > Am I missing something? > > Now, I'm going to try and fix the codepath that leads to this xpcom error. I > just don't have high hopes that it will also fix the current issue. Here's one idea, just make sure that the postMessage call cannot propagate its exceptions, push to try and see if the failure goes away, or changes drastically. If it does, that confirms my theory (although that will not necessarily be the right fix, and the race condition that Nathan was suspecting may manifest itself in any number of ways with that wallpaper fix.) (In reply to Marco Castelluccio [:marco] from comment #119) > (In reply to David Rajchenbach Teller [:Yoric] (please use "needinfo?") from > comment #118) > > Now, I'm going to try and fix the codepath that leads to this xpcom error. I > > just don't have high hopes that it will also fix the current issue. > > I did trigger five try runs using NetUtil instead of OS.File.writeAtomic in > the Webapps.jsm::_writeFile function, and they didn't fail. > This may be a coincidence, but IIRC these tests were failing with a high > probability, so five tests may be enough to think that the problem was > actually in the _writeFile function. > > The patch I sent to try was this: > https://bug981085.bugzilla.mozilla.org/attachment.cgi?id=8387863 > > It's also possible that it's a timing issue that wasn't visible and was > uncovered by using OS.File more than before. In my experience, this code is faily enough that 5 runs should probably show the failure, but to double check, you can push the same patch to try again and trigger 10-20 runs on it. If they all turn up green, we at least know which OS.File call site is the busted one.

Marco Castelluccio [:marco]

Comment 121

•

11 years ago

(In reply to :Ehsan Akhgari (needinfo? me!) from comment #120) > In my experience, this code is faily enough that 5 runs should probably show > the failure, but to double check, you can push the same patch to try again > and trigger 10-20 runs on it. If they all turn up green, we at least know > which OS.File call site is the busted one. I will do that in the next couple of days. Setting needinfo=me for a reminder.

Flags: needinfo?(mar.castelluccio)

(no longer active)

Comment 122

•

11 years ago

(In reply to Marco Castelluccio [:marco] from comment #121) > (In reply to :Ehsan Akhgari (needinfo? me!) from comment #120) > > In my experience, this code is faily enough that 5 runs should probably show > > the failure, but to double check, you can push the same patch to try again > > and trigger 10-20 runs on it. If they all turn up green, we at least know > > which OS.File call site is the busted one. > > I will do that in the next couple of days. Setting needinfo=me for a > reminder. FWIW my comment was addressed to David, since I believe he's already looking into this.

Marco Castelluccio [:marco]

Updated

•

11 years ago

Flags: needinfo?(mar.castelluccio)

David Teller [:Yoric] - still alive but not very active

Assignee

Comment 123

•

11 years ago

Much to my bafflement, Try seems to suggest that bug 981875 is sufficient to fix this bug. Or has anything landed in the meantime that could explain the bug not appearing? https://tbpl.mozilla.org/?tree=Try&rev=757826e09eba

David Teller [:Yoric] - still alive but not very active

Assignee

Updated

•

11 years ago

Flags: needinfo?(mar.castelluccio)

Flags: needinfo?(ehsan)

(no longer active)

Comment 124

•

11 years ago

(In reply to David Rajchenbach Teller [:Yoric] (please use "needinfo?") from comment #123) > Much to my bafflement, Try seems to suggest that bug 981875 is sufficient to > fix this bug. Or has anything landed in the meantime that could explain the > bug not appearing? > > https://tbpl.mozilla.org/?tree=Try&rev=757826e09eba Hmm, why do you think the bug is gone away on this push? See: https://tbpl.mozilla.org/php/getParsedLog.php?id=36139883&tree=Try&full=1 https://tbpl.mozilla.org/php/getParsedLog.php?id=36139917&tree=Try&full=1 (Don't be misled by what TBPL suggests, these are all the same bug, see the debug spew and also the "DOMApplicationRegistry is not defined" error in the log, etc.)

Flags: needinfo?(ehsan)

David Teller [:Yoric] - still alive but not very active

Assignee

Comment 125

•

11 years ago

Ok, my bad, back to tracking this bug.

Marco Castelluccio [:marco]

Updated

•

11 years ago

Flags: needinfo?(mar.castelluccio)

David Teller [:Yoric] - still alive but not very active

Assignee

Comment 126

•

11 years ago

(In reply to Myk Melez [:myk] [@mykmelez] from comment #80) > (In reply to Patrick Wang (ChihKai) (:kk1fff) from comment #79) > > I don't really have an idea how this happen. But I would guess that there > > might be exception thrown during loading WebApps.jsm, so the symbol is not > > exposed by the lazy getter properly. > > The error that appears before "DOMApplicationRegistry is not defined" is: > > 01:20:05 INFO - > ************************************************************ > 01:20:05 INFO - * Call to xpconnect wrapped JSObject produced this > error: * > 01:20:25 INFO - [Exception... "Component returned failure code: > 0x80570016 (NS_ERROR_XPC_GS_RETURNED_FAILURE) [nsIJSCID.getService]" > nsresult: "0x80570016 (NS_ERROR_XPC_GS_RETURNED_FAILURE)" location: "JS > frame :: resource://gre/modules/XPCOMUtils.jsm :: XPCU_serviceLambda :: line > 202" data: no] > 01:20:25 INFO - > ************************************************************ I have instrumented the code a little further https://tbpl.mozilla.org/php/getParsedLog.php?id=36451003&tree=Try&full=1#error0 According to the logs, this error is due to @mozilla.org/parentprocessmessagemanager;1 and it could very well explain that "DOMApplicationRegistry is not defined". Still trying to understand the ChromeWorker-related problems, but I figure we have a more likely suspect for all our trouble.

Flags: needinfo?(ehsan)

David Teller [:Yoric] - still alive but not very active

Assignee

Comment 127

•

11 years ago

Full stack: 17:01:27 INFO - Call stack: XPCU_serviceLambda@resource://gre/modules/XPCOMUtils.jsm:211:11 17:01:27 INFO - XPCU_defineLazyGetter/<.get@resource://gre/modules/XPCOMUtils.jsm:177:9 17:01:27 INFO - activities_init/<@resource://gre/modules/ActivitiesService.jsm:170:7 17:01:27 INFO - activities_init@resource://gre/modules/ActivitiesService.jsm:171:1 17:01:27 INFO - @resource://gre/modules/ActivitiesService.jsm:412:1 17:01:27 INFO - @resource://gre/modules/Webapps.jsm:39:1 17:01:27 INFO - @resource://gre/modules/BrowserElementParent.jsm:24:3 17:01:27 INFO - XPCU_defineLazyGetter/<.get@resource://gre/modules/XPCOMUtils.jsm:177:9 17:01:27 INFO - BrowserElementParent.prototype._registerAppManifest@resource://gre/modules/BrowserElementParent.jsm:238:1 17:01:27 INFO - BrowserElementParent@resource://gre/modules/BrowserElementParent.jsm:202:5 17:01:27 INFO - create@resource://gre/modules/BrowserElementParent.jsm:91:5 17:01:27 INFO - BrowserElementParentFactory.prototype._createBrowserElementParent@jar:file:///system/b2g/omni.ja!/components/BrowserElementParent.js:106:1 17:01:27 INFO - BrowserElementParentFactory.prototype._observeInProcessBrowserFrameShown@jar:file:///system/b2g/omni.ja!/components/BrowserElementParent.js:89:1 17:01:27 INFO - BrowserElementParentFactory.prototype.observe@jar:file:///system/b2g/omni.ja!/components/BrowserElementParent.js:123:7 17:01:27 INFO - checkAppState@http://mochi.test:8888/tests/dom/apps/tests/test_app_update.html:259:5 17:01:27 INFO - runTest@http://mochi.test:8888/tests/dom/apps/tests/test_app_update.html:92:5 17:01:27 INFO - continueTest@http://mochi.test:8888/tests/dom/apps/tests/test_app_update.html:32:11 17:01:27 INFO - WebappsRegistry.prototype.receiveMessage@jar:file:///system/b2g/omni.ja!/components/Webapps.js:47:9

David Teller [:Yoric] - still alive but not very active

Assignee

Comment 128

•

11 years ago

I'm in the process of tracking down the WorkerPrivate/WorkerRuntime uncaught errors. Some of them are indeed due to a race condition around the B2G worker reset (which would explain why we didn't notice them before) and some due to internal race conditions in workers, in which we attempt to send connection information to workers although the workers are shutting down. Patch in progress. We'll see if it fixes the current bug.

(no longer active)

Comment 129

•

11 years ago

(In reply to David Rajchenbach Teller [:Yoric] (please use "needinfo?") from comment #126) > (In reply to Myk Melez [:myk] [@mykmelez] from comment #80) > > (In reply to Patrick Wang (ChihKai) (:kk1fff) from comment #79) > > > I don't really have an idea how this happen. But I would guess that there > > > might be exception thrown during loading WebApps.jsm, so the symbol is not > > > exposed by the lazy getter properly. > > > > The error that appears before "DOMApplicationRegistry is not defined" is: > > > > 01:20:05 INFO - > > ************************************************************ > > 01:20:05 INFO - * Call to xpconnect wrapped JSObject produced this > > error: * > > 01:20:25 INFO - [Exception... "Component returned failure code: > > 0x80570016 (NS_ERROR_XPC_GS_RETURNED_FAILURE) [nsIJSCID.getService]" > > nsresult: "0x80570016 (NS_ERROR_XPC_GS_RETURNED_FAILURE)" location: "JS > > frame :: resource://gre/modules/XPCOMUtils.jsm :: XPCU_serviceLambda :: line > > 202" data: no] > > 01:20:25 INFO - > > ************************************************************ > > I have instrumented the code a little further > https://tbpl.mozilla.org/php/getParsedLog. > php?id=36451003&tree=Try&full=1#error0 > > According to the logs, this error is due to > @mozilla.org/parentprocessmessagemanager;1 and it could very well explain > that "DOMApplicationRegistry is not defined". Hmm, reading NS_NewParentProcessesMessageManager <http://mxr.mozilla.org/mozilla-central/source/content/base/src/nsFrameMessageManager.cpp#1834>, it seems like the only two scenarios for that function to fail are either if we are in the content process or if QueryInterface fails. The latter is very unlikely because nsFrameMessageManager definitely implements nsIMessageBroadcaster: <http://mxr.mozilla.org/mozilla-central/source/content/base/src/nsFrameMessageManager.cpp#108> (Both mChrome and mIsBroadcaster will be true there.) Fabrice, is there any possibility of Webapps.jsm being loaded into a content process? If so, then is it expected that accessing ppmm in Webapps.jsm to throw?

Flags: needinfo?(ehsan) → needinfo?(fabrice)

[:fabrice] Fabrice Desré

Comment 130

•

11 years ago

So, Webapps.jsm should *never* be loaded in a content process. That looks a lot like the same issue as bug 902165 I have patch that may work here, let me find it.

Flags: needinfo?(fabrice)

[:fabrice] Fabrice Desré

Comment 131

•

11 years ago

Attached patch no webapps.jsm in content processes — Details — Splinter Review

Here it is. I can move it to bug 902165 it that's more in-context.

Attachment #8395033 - Flags: feedback?(bent.mozilla)

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 132

•

11 years ago

Comment on attachment 8395033 [details] [diff] [review] no webapps.jsm in content processes Review of attachment 8395033 [details] [diff] [review]: ----------------------------------------------------------------- Seems ok to me without the commented code :)

Attachment #8395033 - Flags: feedback?(bent.mozilla) → feedback+

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 133

•

11 years ago

I don't understand why this is intermittent though.

David Teller [:Yoric] - still alive but not very active

Assignee

Comment 134

•

11 years ago

There is more than one bug in the logs, so that might not be the [only?] source of failures.

(no longer active)

Updated

•

11 years ago

Depends on: 902165

David Teller [:Yoric] - still alive but not very active

Assignee

Comment 135

•

11 years ago

Attached patch Reworking OS.File reset/shutdown (obsolete) — Details — Splinter Review

I believe that this should take care of the XPCOM errors.

Attachment #8395614 - Flags: review?(nfroyd)

(no longer active)

Comment 136

•

11 years ago

(In reply to David Rajchenbach Teller [:Yoric] (please use "needinfo?") from comment #135) > Created attachment 8395614 [details] [diff] [review] > Reworking OS.File reset/shutdown > > I believe that this should take care of the XPCOM errors. Can you please post a link to the try server run? Thanks!

Nathan Froyd [:froydnj]

Comment 137

•

11 years ago

Comment on attachment 8395614 [details] [diff] [review] Reworking OS.File reset/shutdown Review of attachment 8395614 [details] [diff] [review]: ----------------------------------------------------------------- Apologies on the review lag, between work week and getting sick... I have questions about bits of the patch. ::: toolkit/components/osfile/modules/osfile_async_front.jsm @@ +245,5 @@ > } > > let {openedFiles, openedDirectoryIterators, killed} = resources; > + if (!reset && (openedFiles && openedFiles.length > + || ( openedDirectoryIterators && openedDirectoryIterators.length))) { Nit: the indentation on this conditional is bizarre. Make it more like it was before, please. @@ +269,2 @@ > > + this.shutdown = shutdown; Should this be killed || shutdown? @@ -380,5 @@ > if (firstLaunch) { > Scheduler._updateTelemetry(); > } > - > - Scheduler.restartTimer(); So in my m-c pull, this is actually conditionalized. Where did the conditional go prior to this patch? @@ +522,5 @@ > // configured for testing by the testsuite. > let phase = AsyncShutdown._getPhase(TOPIC); > phase.addBlocker( > "(for testing purposes) OS.File: warn about unclosed files", > + () => Scheduler.kill({shutdown: false, reset: false}) Why are we changing the reset behavior here? @@ +1320,2 @@ > File.resetWorker = function() { > + return Task.spawn(function*() { Why are we making this async?

Attachment #8395614 - Flags: review?(nfroyd) → feedback+

David Teller [:Yoric] - still alive but not very active

Assignee

Comment 138

•

11 years ago

(In reply to Nathan Froyd (:froydnj) from comment #137) > @@ +269,2 @@ > > > > + this.shutdown = shutdown; > > Should this be killed || shutdown? No, |killed| informs us that the thread was killed, but in a way that will let us resurrect it. On the other hand, |this.shutdown| informs us that we should reject any new message that could possibly resurrect the worker. Different stuff. > @@ -380,5 @@ > > if (firstLaunch) { > > Scheduler._updateTelemetry(); > > } > > - > > - Scheduler.restartTimer(); > > So in my m-c pull, this is actually conditionalized. Where did the > conditional go prior to this patch? Mmmh.... Obviously, I forgot to fold stuff. > > @@ +522,5 @@ > > // configured for testing by the testsuite. > > let phase = AsyncShutdown._getPhase(TOPIC); > > phase.addBlocker( > > "(for testing purposes) OS.File: warn about unclosed files", > > + () => Scheduler.kill({shutdown: false, reset: false}) > > Why are we changing the reset behavior here? > > @@ +1320,2 @@ > > File.resetWorker = function() { > > + return Task.spawn(function*() { > > Why are we making this async? I believe that this has always been async.

David Teller [:Yoric] - still alive but not very active

Assignee

Comment 139

•

11 years ago

Attached patch Reworking OS.File reset/shutdown, v2 (obsolete) — Details — Splinter Review

Properly folded, this time. Try: https://tbpl.mozilla.org/?tree=Try&rev=723171e4d7ec

Attachment #8395614 - Attachment is obsolete: true

Attachment #8397835 - Flags: review?(nfroyd)

(no longer active)

Comment 140

•

11 years ago

(Note that with bug 981085 being fixed, you need to back out the patch for that before testing this on try.)

Nathan Froyd [:froydnj]

Updated

•

11 years ago

Attachment #8397835 - Flags: review?(nfroyd) → review+

David Teller [:Yoric] - still alive but not very active

Assignee

Comment 141

•

11 years ago

(In reply to :Ehsan Akhgari (lagging on bugmail, needinfo? me!) from comment #140) > (Note that with bug 981085 being fixed, you need to back out the patch for > that before testing this on try.) Yes, my test was using a tree prior landing of bug 981085. We now need to land "Reworking OS.File reset/shutdown, v2."

Keywords: checkin-needed

Whiteboard: [land Reworking OS.File reset/shutdown, v2]

Ryan VanderMeulen [:RyanVM]

Comment 142

•

11 years ago

https://hg.mozilla.org/integration/fx-team/rev/4eb0d4408fd2

Assignee: nobody → dteller

Flags: in-testsuite+

Keywords: checkin-needed

Whiteboard: [land Reworking OS.File reset/shutdown, v2] → [fixed-in-fx-team]

Ryan VanderMeulen [:RyanVM]

Comment 143

•

11 years ago

Backed out for Windows xpcshell failures. https://hg.mozilla.org/integration/fx-team/rev/e47bd7fca2cf https://tbpl.mozilla.org/php/getParsedLog.php?id=37072948&tree=Fx-Team

Whiteboard: [fixed-in-fx-team]

David Teller [:Yoric] - still alive but not very active

Assignee

Comment 144

•

11 years ago

Attached patch Reworking OS.File reset/shutdown, v3 (obsolete) — Details — Splinter Review

Trivial testfix Try: https://tbpl.mozilla.org/?tree=Try&rev=a3468330408d

Attachment #8397835 - Attachment is obsolete: true

Attachment #8400537 - Flags: review+

David Teller [:Yoric] - still alive but not very active

Assignee

Comment 145

•

11 years ago

Attached patch Reworking OS.File reset/shutdown, v4 — Details — Splinter Review

Ah, of course, it works better if I apply my trivial fix in both places where the problem occurs.

Attachment #8400537 - Attachment is obsolete: true

Attachment #8400584 - Flags: review+

David Teller [:Yoric] - still alive but not very active

Assignee

Comment 146

•

11 years ago

Try: https://tbpl.mozilla.org/?tree=Try&rev=9335a5b4c44c

Keywords: checkin-needed

Whiteboard: [land Reworking OS.File reset/shutdown, v4]

Ryan VanderMeulen [:RyanVM]

Comment 147

•

11 years ago

https://hg.mozilla.org/integration/fx-team/rev/aa93afa78cd5

Keywords: checkin-needed

Whiteboard: [land Reworking OS.File reset/shutdown, v4] → [fixed-in-fx-team]

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Reporter

Comment 148

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/aa93afa78cd5

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Whiteboard: [fixed-in-fx-team]

Target Milestone: --- → mozilla31

Firefox Commit Bot

Comment 149

•

11 years ago

Ryan VanderMeulen [:RyanVM]

Comment 150

•

11 years ago

Do we want this on Aurora for v1.4?

status-b2g-v1.3: --- → unaffected

status-b2g-v1.4: --- → affected

status-b2g-v2.0: --- → fixed

status-firefox29: --- → unaffected

status-firefox30: --- → affected

status-firefox31: --- → fixed

status-firefox-esr24: --- → unaffected

Flags: needinfo?(dteller)

David Teller [:Yoric] - still alive but not very active

Assignee

Comment 151

•

11 years ago

Deflecting to ehsan.

Flags: needinfo?(ehsan)

(no longer active)

Comment 152

•

11 years ago

(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #150) > Do we want this on Aurora for v1.4? Yes, I think so.

Flags: needinfo?(ehsan)

Ryan VanderMeulen [:RyanVM]

Comment 153

•

11 years ago

In that case, can I please get a branch nomination? :)

Comment hidden (Legacy TBPL/Treeherder Robot)

David Teller [:Yoric] - still alive but not very active

Assignee

Comment 155

•

11 years ago

Comment on attachment 8400584 [details] [diff] [review] Reworking OS.File reset/shutdown, v4 [Approval Request Comment] Bug caused by (feature/regressing bug #): bug 927560 User impact if declined: Possible freezes/crashes in b2g Testing completed (on m-c, etc.): Has landed 6 days ago. Risk to taking this patch (and alternatives if risky): If it doesn't work, we will find out quickly. We can back this out in case of problem. String or IDL/UUID changes made by this patch: none

Attachment #8400584 - Flags: approval-mozilla-aurora?

Flags: needinfo?(dteller)

Lukas Blakk [:lsblakk] use ?needinfo

Updated

•

11 years ago

Attachment #8400584 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Ryan VanderMeulen [:RyanVM]

Comment 156

•

11 years ago

https://hg.mozilla.org/releases/mozilla-aurora/rev/8e92bb5cb712

status-b2g-v1.4: affected → fixed

status-firefox30: affected → fixed

David Teller [:Yoric] - still alive but not very active

Assignee

Updated

•

10 years ago

Blocks: 927560

Ben Kelly [:bkelly, not reviewing]

Updated

•

8 years ago

Blocks: 1279389

BMO Automation

Updated

•

1 year ago

Product: Toolkit → Toolkit Graveyard

no webapps.jsm in content processes 11 years ago [:fabrice] Fabrice Desré 4.30 KB, patch	bent.mozilla : feedback+	Details \| Diff \| Splinter Review
Reworking OS.File reset/shutdown 11 years ago David Teller [:Yoric] - still alive but not very active 11.45 KB, patch	froydnj : feedback+	Details \| Diff \| Splinter Review
Reworking OS.File reset/shutdown, v2 11 years ago David Teller [:Yoric] - still alive but not very active 14.42 KB, patch	froydnj : review+	Details \| Diff \| Splinter Review
Reworking OS.File reset/shutdown, v3 11 years ago David Teller [:Yoric] - still alive but not very active 15.37 KB, patch	Yoric : review+	Details \| Diff \| Splinter Review
Reworking OS.File reset/shutdown, v4 11 years ago David Teller [:Yoric] - still alive but not very active 16.02 KB, patch	Yoric : review+ lsblakk : approval-mozilla-aurora+	Details \| Diff \| Splinter Review