Closed Bug 1300927 Opened 8 years ago Closed 7 years ago

Intermittent dom/indexedDB/test/test_filehandle_workers.html | Test timed out.

Categories

(Core :: Storage: IndexedDB, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla54
Tracking Status
firefox52 --- fixed
firefox-esr52 --- fixed
firefox53 --- fixed
firefox54 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: bevis)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell fixed])

Attachments

(1 file, 1 obsolete file)

Recent failures spiked around Jan 29. They are all linux and all e10s. Logs look like:

[task 2017-02-03T15:43:36.003249Z] 15:43:36     INFO - TEST-START | dom/indexedDB/test/test_filehandle_workers.html
[task 2017-02-03T15:48:40.447063Z] 15:48:40     INFO - TEST-INFO | started process screentopng
[task 2017-02-03T15:48:41.076205Z] 15:48:41     INFO - TEST-INFO | screentopng: exit 0
[task 2017-02-03T15:48:41.079101Z] 15:48:41     INFO - Buffered messages logged at 15:43:36
[task 2017-02-03T15:48:41.079305Z] 15:48:41     INFO - Running
[task 2017-02-03T15:48:41.079543Z] 15:48:41     INFO - Pushing preferences
[task 2017-02-03T15:48:41.081408Z] 15:48:41     INFO - Pushing permissions
[task 2017-02-03T15:48:41.085256Z] 15:48:41     INFO - Clearing old databases
[task 2017-02-03T15:48:41.087230Z] 15:48:41     INFO - TEST-FAIL | dom/indexedDB/test/test_filehandle_workers.html | Skipping test in a worker because it's not structured properly 
[task 2017-02-03T15:48:41.089044Z] 15:48:41     INFO - Running test in main thread
[task 2017-02-03T15:48:41.090928Z] 15:48:41     INFO - TEST-PASS | dom/indexedDB/test/test_filehandle_workers.html | Correct byte length 
[task 2017-02-03T15:48:41.092829Z] 15:48:41     INFO - TEST-PASS | dom/indexedDB/test/test_filehandle_workers.html | Got correct event type 
[task 2017-02-03T15:48:41.094844Z] 15:48:41     INFO - TEST-PASS | dom/indexedDB/test/test_filehandle_workers.html | Got correct event type 
[task 2017-02-03T15:48:41.098279Z] 15:48:41     INFO - TEST-PASS | dom/indexedDB/test/test_filehandle_workers.html | Got exception. 
[task 2017-02-03T15:48:41.100377Z] 15:48:41     INFO - TEST-PASS | dom/indexedDB/test/test_filehandle_workers.html | Good error. 
[task 2017-02-03T15:48:41.102327Z] 15:48:41     INFO - TEST-PASS | dom/indexedDB/test/test_filehandle_workers.html | Good error code. 
[task 2017-02-03T15:48:41.104229Z] 15:48:41     INFO - TEST-PASS | dom/indexedDB/test/test_filehandle_workers.html | Got exception. 
[task 2017-02-03T15:48:41.106110Z] 15:48:41     INFO - TEST-PASS | dom/indexedDB/test/test_filehandle_workers.html | Good error. 
[task 2017-02-03T15:48:41.107922Z] 15:48:41     INFO - TEST-PASS | dom/indexedDB/test/test_filehandle_workers.html | Good error code. 
[task 2017-02-03T15:48:41.109694Z] 15:48:41     INFO - Buffered messages finished
[task 2017-02-03T15:48:41.111666Z] 15:48:41     INFO - TEST-UNEXPECTED-FAIL | dom/indexedDB/test/test_filehandle_workers.html | Test timed out.
:hsinyi, this failure is showing up often in automation primarily on linux64 (all configs- debug/opt/pgo) and on linux32 as well.

As indicated in this bug this started about 2 weeks ago.  I believe there are a few other indexedDB bugs, although this one is related to a time out and the most frequent time out bug for indexeddb.

Can you take a look at this failure and get it resolved in the next 2 weeks?
Flags: needinfo?(htsai)
(In reply to Joel Maher ( :jmaher) from comment #4)
> :hsinyi, this failure is showing up often in automation primarily on linux64
> (all configs- debug/opt/pgo) and on linux32 as well.
> 
> As indicated in this bug this started about 2 weeks ago.  I believe there
> are a few other indexedDB bugs, although this one is related to a time out
> and the most frequent time out bug for indexeddb.
> 
> Can you take a look at this failure and get it resolved in the next 2 weeks?

Defer to Bevis for deeper investigation.
Flags: needinfo?(htsai) → needinfo?(btseng)
take this to follow up.
Assignee: nobody → btseng
Flags: needinfo?(btseng)
2 symptoms after adding some logs and run the test in treeherder:
1. The request of adding a IDBMutableFile into an ObjectStore was never returned.
2. If it returns, the next IDBFactory::Open() will never be returned.
I'll add more logs to narrow down the problem.
(In reply to Bevis Tseng[:bevistseng][:btseng] from comment #9)
> 2 symptoms after adding some logs and run the test in treeherder:
> 1. The request of adding a IDBMutableFile into an ObjectStore was never
> returned.
> 2. If it returns, the next IDBFactory::Open() will never be returned.
> I'll add more logs to narrow down the problem.

In today's log, I found that either RecvPBackgroundIDBRequestConstructor or RecvPBackgroundIDBFactoryRequestConstructor was never been called after their SendXxxConstructors were called when these 2 symptoms happened.

It seems that the PBackground thread in parent was stuck.
I'll set name of each PBackground Thread runnable to see which one is not returned.
Update finding so far. (BTW, this symptom can be reproduced more easier with rr chaos mode.)
In short, when the symptom happened, the *IPDL BackgroundThread* in parent was stuck at ConnectActorRunnable which seems unexpected according to the comment:
http://searchfox.org/mozilla-central/rev/cac6cb6a10afb8ebb2ecfbeeedaff7c66f57dd75/ipc/glue/MessageLink.cpp#125

Need to figure why this only happened in this test case.

#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x0000000000bc84d7 in PR_WaitCondVar (cvar=0x1f642c0, timeout=4294967295)
    at /home/bevis/Projects/Build/gecko/src/nsprpub/pr/src/pthreads/ptsynch.c:396
#2  0x00003ec54b346d8c in mozilla::CondVar::Wait (this=0x44ee40e75ee0, aInterval=4294967295)
    at /home/bevis/Projects/Build/gecko/src/xpcom/threads/BlockingResourceBase.cpp:499
#3  0x00003ec54b312bb0 in mozilla::Monitor::Wait (this=0x44ee40e75ec0, aInterval=4294967295)
    at /home/bevis/Projects/Build/gecko/src/obj-firefox-central/dist/include/mozilla/Monitor.h:40
#4  0x00003ec54bb690ed in mozilla::ipc::ProcessLink::Open (this=0x27d2689f6e50, 
    aTransport=0x3534a51b700, aIOLoop=0x79bc5964bd20, aSide=mozilla::ipc::ParentSide)
    at /home/bevis/Projects/Build/gecko/src/ipc/glue/MessageLink.cpp:125
#5  0x00003ec54bb61a75 in mozilla::ipc::MessageChannel::Open (this=0x27d268a9a0f8, 
    aTransport=0x3534a51b700, aIOLoop=0x79bc5964bd20, aSide=mozilla::ipc::ParentSide)
    at /home/bevis/Projects/Build/gecko/src/ipc/glue/MessageChannel.cpp:675
#6  0x00003ec54bb86229 in mozilla::ipc::IToplevelProtocol::Open (this=0x27d268a9a000, 
    aTransport=0x3534a51b700, aOtherPid=19663, aThread=0x79bc5964bd20, aSide=mozilla::ipc::ParentSide)
    at /home/bevis/Projects/Build/gecko/src/ipc/glue/ProtocolUtils.cpp:594
#7  0x00003ec54bb751dd in mozilla::ipc::Endpoint<mozilla::ipc::PBackgroundParent>::Bind (
    this=0x70891fa4ebd0, aActor=0x27d268a9a000)
    at /home/bevis/Projects/Build/gecko/src/obj-firefox-central/dist/include/mozilla/ipc/ProtocolUtils.h:612
#8  0x00003ec54bb575fc in (anonymous namespace)::ParentImpl::ConnectActorRunnable::Run (
    this=0x236e66b7c440) at /home/bevis/Projects/Build/gecko/src/ipc/glue/BackgroundImpl.cpp:1459
#9  0x00003ec54b374891 in nsThread::ProcessNextEvent (this=0x7f965ac35cc0, aMayWait=true, 
    aResult=0x70891fa4ed07) at /home/bevis/Projects/Build/gecko/src/xpcom/threads/nsThread.cpp:1261
#10 0x00003ec54b379d83 in NS_ProcessNextEvent (aThread=0x7f965ac35cc0, aMayWait=true)
treeherder shows the same result as comment 11 in raw logs:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=74be24b91d395e2a3e9a5a94aeac0e16ed335736&selectedJob=77904099

The last runnable in Background thread is printed before Run() but was never returned from Run():
> BVS, Background Runnable: ParentImpl::ConnectActorRunnable
Otherwise, will see something like this:
> BVS, Background Runnable: ParentImpl::ConnectActorRunnable is finished.
The ConnectActorRunnable is related to the newly initiated BackgroundActor in worker thread from child that is going to be tested in next step. There is no reason why this runnable keep waiting the change from IO loop.

Since there is not much clue that shows anything wrong in IDB and this symptom can be reproduced much easier by rr chaos mode, I'll bisect to figure out which change set is more related to this problem.
Bug 1329797 landed on January 28th and involved PBackground, that might be a good starting point to check out.
Thanks! I'll try it! :)
Unfortunately, this symptom seems to be existed for a long time but becomes more frequent in these 2 weeks. Even I bisect from the push log on 1-1-2017, it's still reproducible. :(

Hence, I looked into why the IPDL Background Thread was stuck in comment 11, and I found something weird.
In parent process, the reason why the newly created actor was block at [1] is that even though ProcessLink::OnChannelOpened was invoked in IOLoop to set the mChannelState to *ChannelOpening* and then notify the monitor, the MessageChannel::MaybeInterceptSpecialIOMessage() is called unexpectedly (that means shutdown is ongoing) and change the mChannelState to *ChannelClosing* before while (!mChan->Connected() && ...) is check!

So the next step is to figure out why *shutdown* was notified during testing.

[1] http://searchfox.org/mozilla-central/rev/cac6cb6a10afb8ebb2ecfbeeedaff7c66f57dd75/ipc/glue/MessageLink.cpp#125
I think the possible reason we hit this intermittent failure is that, in test_filehandle_workers.html, we new a worker and terminated it immediately. This could be a race to close the channel before the parent side acknowledge the *ChannelOpening* state change. :(
(In reply to Bevis Tseng[:bevistseng][:btseng] from comment #17)
> I think the possible reason we hit this intermittent failure is that, in
> test_filehandle_workers.html, we new a worker and terminated it immediately.
> This could be a race to close the channel before the parent side acknowledge
> the *ChannelOpening* state change. :(

http://searchfox.org/mozilla-central/rev/ea31c4b64f34a29415a69fb768f8051495547315/dom/indexedDB/test/test_filehandle_workers.html#80-91
This modification to the original test case proves what I have investigated from comment 11 to comment 18:
If terminating the worker right after its instantiation, there will be a race in parent side to close the ipc channel before acknowledging *ChannelOpening* to the newly created parent actor.

Hi Ben,

Do you have any suggestion for how to fix this in worker in a short time?
Otherwise, I might have to work around it with this modification to prevent this test case being disabled.
Flags: needinfo?(bkelly)
Bevis, can you point me at the code that is starting the operation in the Worker thread?  Its unclear from the comments if you are referring to an IDB protocol actor or the PBackground actor creation itself.

The short answer, though, is that we should be using a WorkerHolder to prevent the worker thread from being destroyed during this operation.
Flags: needinfo?(bkelly) → needinfo?(btseng)
(In reply to Ben Kelly [not reviewing due to deadline][:bkelly] from comment #20)
> Bevis, can you point me at the code that is starting the operation in the
> Worker thread?  Its unclear from the comments if you are referring to an IDB
> protocol actor or the PBackground actor creation itself.
The problematic one is this:
http://searchfox.org/mozilla-central/rev/ea31c4b64f34a29415a69fb768f8051495547315/dom/indexedDB/test/test_filehandle_workers.html#80-91
which tried to post a FileHandle to worker thread but shall be returned immediately as expected with exception of DATA_CLONE_ERR. Hence, there should be no operation in worker thread.

Besides, while tracing, the problematic PBackgroundChild are both created and closed at WorkerThreadPrimaryRunnable::Run():
http://searchfox.org/mozilla-central/source/dom/workers/RuntimeService.cpp#2810,2885
So, I am pretty sure that it's not the one for IDB's 1st open request on current thread but the one for newly created worker thread.

> The short answer, though, is that we should be using a WorkerHolder to
> prevent the worker thread from being destroyed during this operation.

Thanks! I'll try WorkerHolder to see if this can be improved.

However, does this means we still have this start-and-immediately-stop problem in worker?
Flags: needinfo?(btseng)
(In reply to Bevis Tseng[:bevistseng][:btseng] from comment #21)
> Thanks! I'll try WorkerHolder to see if this can be improved.
It seems that the WorkerHolder is something in implementation level and 
> However, does this means we still have this start-and-immediately-stop
> problem in worker?

It might take me too much time to fix this in worker.
I'd like to safe this test case first from being disabled. (The test criteria is more related to DATA_CLONE_ERR test of the FileHandle.)

Bug 1340941 was filed to address this problem instead.
(In reply to Bevis Tseng[:bevistseng][:btseng] from comment #23)
> It might take me too much time to fix this in worker.
After thinking twice, it shall be more about an ipc implementation issue according to comment 16.
> Bug 1340941 was filed to address this problem instead.
The component category of Bug 1340941 has been updated to *IPC* instead.
According to the analysis in comment 16 to comment 18, bug 1340941 is filed to address the potential race in ipc protocol.

Since the timing of worker.terminate() is the test criteria in test_filehandle_worker.html, I'd like to workaround bug 1340941 first to prevent being disabled from auto-test.

Hi Jan,

May I have your review for this work around?

Thanks!
Attachment #8838636 - Attachment is obsolete: true
Attachment #8839046 - Flags: review?(jvarga)
(In reply to Bevis Tseng[:bevistseng][:btseng] from comment #25)
> Since the timing of worker.terminate() is the test criteria in
                                           ^ not
BTW, it's not reproducible anymore with the workaround in 100+ tests in treeherder on linux{32,x64} debug build:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8d9a4023ef1b57ff82336fe265fb0381023d698b&group_state=expanded
Comment on attachment 8839046 [details] [diff] [review]
Patch: Terminate all workers at the end of the test to work around bug 1340941.

Review of attachment 8839046 [details] [diff] [review]:
-----------------------------------------------------------------

Ok
Attachment #8839046 - Flags: review?(jvarga) → review+
Pushed by btseng@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/3f4226570a86
Terminate all workers at the end of the test to work around bug 1340941. r=janv
https://hg.mozilla.org/mozilla-central/rev/3f4226570a86
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
Whiteboard: [stockwell fixed]
You need to log in before you can comment on or make changes to this bug.