1300927 - Intermittent dom/indexedDB/test/test_filehandle_workers.html | Test timed out.

Reporter

Description

•

8 years ago

treeherder

Filed by: wkocher [at] mozilla.com

https://treeherder.mozilla.org/logviewer.html#?job_id=35393102&repo=mozilla-inbound

https://queue.taskcluster.net/v1/task/Rkp6o-AqQQu_0SjfUnkgbw/runs/0/artifacts/public%2Flogs%2Flive_backing.log

Comment hidden (Intermittent Failures Robot)

6 failures in 749 pushes (0.008 failures/push) were associated with this bug in the last 7 days.  

Repository breakdown:
* mozilla-inbound: 4
* autoland: 2

Platform breakdown:
* linux64: 4
* linux32: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1300927&startday=2017-01-23&endday=2017-01-29&tree=all

Geoff Brown [:gbrown]

Comment 2

•

7 years ago

Recent failures spiked around Jan 29. They are all linux and all e10s. Logs look like:

[task 2017-02-03T15:43:36.003249Z] 15:43:36     INFO - TEST-START | dom/indexedDB/test/test_filehandle_workers.html
[task 2017-02-03T15:48:40.447063Z] 15:48:40     INFO - TEST-INFO | started process screentopng
[task 2017-02-03T15:48:41.076205Z] 15:48:41     INFO - TEST-INFO | screentopng: exit 0
[task 2017-02-03T15:48:41.079101Z] 15:48:41     INFO - Buffered messages logged at 15:43:36
[task 2017-02-03T15:48:41.079305Z] 15:48:41     INFO - Running
[task 2017-02-03T15:48:41.079543Z] 15:48:41     INFO - Pushing preferences
[task 2017-02-03T15:48:41.081408Z] 15:48:41     INFO - Pushing permissions
[task 2017-02-03T15:48:41.085256Z] 15:48:41     INFO - Clearing old databases
[task 2017-02-03T15:48:41.087230Z] 15:48:41     INFO - TEST-FAIL | dom/indexedDB/test/test_filehandle_workers.html | Skipping test in a worker because it's not structured properly 
[task 2017-02-03T15:48:41.089044Z] 15:48:41     INFO - Running test in main thread
[task 2017-02-03T15:48:41.090928Z] 15:48:41     INFO - TEST-PASS | dom/indexedDB/test/test_filehandle_workers.html | Correct byte length 
[task 2017-02-03T15:48:41.092829Z] 15:48:41     INFO - TEST-PASS | dom/indexedDB/test/test_filehandle_workers.html | Got correct event type 
[task 2017-02-03T15:48:41.094844Z] 15:48:41     INFO - TEST-PASS | dom/indexedDB/test/test_filehandle_workers.html | Got correct event type 
[task 2017-02-03T15:48:41.098279Z] 15:48:41     INFO - TEST-PASS | dom/indexedDB/test/test_filehandle_workers.html | Got exception. 
[task 2017-02-03T15:48:41.100377Z] 15:48:41     INFO - TEST-PASS | dom/indexedDB/test/test_filehandle_workers.html | Good error. 
[task 2017-02-03T15:48:41.102327Z] 15:48:41     INFO - TEST-PASS | dom/indexedDB/test/test_filehandle_workers.html | Good error code. 
[task 2017-02-03T15:48:41.104229Z] 15:48:41     INFO - TEST-PASS | dom/indexedDB/test/test_filehandle_workers.html | Got exception. 
[task 2017-02-03T15:48:41.106110Z] 15:48:41     INFO - TEST-PASS | dom/indexedDB/test/test_filehandle_workers.html | Good error. 
[task 2017-02-03T15:48:41.107922Z] 15:48:41     INFO - TEST-PASS | dom/indexedDB/test/test_filehandle_workers.html | Good error code. 
[task 2017-02-03T15:48:41.109694Z] 15:48:41     INFO - Buffered messages finished
[task 2017-02-03T15:48:41.111666Z] 15:48:41     INFO - TEST-UNEXPECTED-FAIL | dom/indexedDB/test/test_filehandle_workers.html | Test timed out.

Comment hidden (Intermittent Failures Robot)

50 failures in 733 pushes (0.068 failures/push) were associated with this bug in the last 7 days. 

This is the #30 most frequent failure this week. 

** This failure happened more than 50 times this week! Resolving this bug is a high priority. **

Repository breakdown:
* autoland: 22
* mozilla-inbound: 21
* try: 5
* mozilla-central: 1
* graphics: 1

Platform breakdown:
* linux64: 43
* linux32: 7

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1300927&startday=2017-01-30&endday=2017-02-05&tree=all

Joel Maher ( :jmaher ) (UTC -8)

Comment 4

•

7 years ago

:hsinyi, this failure is showing up often in automation primarily on linux64 (all configs- debug/opt/pgo) and on linux32 as well.

As indicated in this bug this started about 2 weeks ago.  I believe there are a few other indexedDB bugs, although this one is related to a time out and the most frequent time out bug for indexeddb.

Can you take a look at this failure and get it resolved in the next 2 weeks?

Flags: needinfo?(htsai)

Comment hidden (Intermittent Failures Robot)

47 failures in 836 pushes (0.056 failures/push) were associated with this bug in the last 7 days. 

This is the #42 most frequent failure this week. 

** This failure happened more than 30 times this week! Resolving this bug is a high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 2 weeks, the affected test(s) may be disabled. **

Repository breakdown:
* mozilla-inbound: 27
* autoland: 13
* try: 3
* mozilla-central: 3
* graphics: 1

Platform breakdown:
* linux64: 26
* linux32: 21

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1300927&startday=2017-02-06&endday=2017-02-12&tree=all

Hsin-Yi Tsai (she/her) [:hsinyi]

Comment 6

•

7 years ago

(In reply to Joel Maher ( :jmaher) from comment #4)
> :hsinyi, this failure is showing up often in automation primarily on linux64
> (all configs- debug/opt/pgo) and on linux32 as well.
> 
> As indicated in this bug this started about 2 weeks ago.  I believe there
> are a few other indexedDB bugs, although this one is related to a time out
> and the most frequent time out bug for indexeddb.
> 
> Can you take a look at this failure and get it resolved in the next 2 weeks?

Defer to Bevis for deeper investigation.

Flags: needinfo?(htsai) → needinfo?(btseng)

Bevis Tseng[:bevis][:btseng](Exited)

Assignee

Comment 7

•

7 years ago

take this to follow up.

Assignee: nobody → btseng

Flags: needinfo?(btseng)

Comment hidden (Intermittent Failures Robot)

24 failures in 161 pushes (0.149 failures/push) were associated with this bug yesterday.  
Repository breakdown:
* mozilla-inbound: 8
* graphics: 7
* autoland: 7
* try: 1
* mozilla-central: 1

Platform breakdown:
* linux64: 17
* linux32: 6
* windows8-64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1300927&startday=2017-02-14&endday=2017-02-14&tree=all

Bevis Tseng[:bevis][:btseng](Exited)

Assignee

Comment 9

•

7 years ago

2 symptoms after adding some logs and run the test in treeherder:
1. The request of adding a IDBMutableFile into an ObjectStore was never returned.
2. If it returns, the next IDBFactory::Open() will never be returned.
I'll add more logs to narrow down the problem.

Bevis Tseng[:bevis][:btseng](Exited)

Assignee

Comment 10

•

7 years ago

(In reply to Bevis Tseng[:bevistseng][:btseng] from comment #9)
> 2 symptoms after adding some logs and run the test in treeherder:
> 1. The request of adding a IDBMutableFile into an ObjectStore was never
> returned.
> 2. If it returns, the next IDBFactory::Open() will never be returned.
> I'll add more logs to narrow down the problem.

In today's log, I found that either RecvPBackgroundIDBRequestConstructor or RecvPBackgroundIDBFactoryRequestConstructor was never been called after their SendXxxConstructors were called when these 2 symptoms happened.

It seems that the PBackground thread in parent was stuck.
I'll set name of each PBackground Thread runnable to see which one is not returned.

Bevis Tseng[:bevis][:btseng](Exited)

Assignee

Comment 11

•

7 years ago

Update finding so far. (BTW, this symptom can be reproduced more easier with rr chaos mode.)
In short, when the symptom happened, the *IPDL BackgroundThread* in parent was stuck at ConnectActorRunnable which seems unexpected according to the comment:
http://searchfox.org/mozilla-central/rev/cac6cb6a10afb8ebb2ecfbeeedaff7c66f57dd75/ipc/glue/MessageLink.cpp#125

Need to figure why this only happened in this test case.

#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x0000000000bc84d7 in PR_WaitCondVar (cvar=0x1f642c0, timeout=4294967295)
    at /home/bevis/Projects/Build/gecko/src/nsprpub/pr/src/pthreads/ptsynch.c:396
#2  0x00003ec54b346d8c in mozilla::CondVar::Wait (this=0x44ee40e75ee0, aInterval=4294967295)
    at /home/bevis/Projects/Build/gecko/src/xpcom/threads/BlockingResourceBase.cpp:499
#3  0x00003ec54b312bb0 in mozilla::Monitor::Wait (this=0x44ee40e75ec0, aInterval=4294967295)
    at /home/bevis/Projects/Build/gecko/src/obj-firefox-central/dist/include/mozilla/Monitor.h:40
#4  0x00003ec54bb690ed in mozilla::ipc::ProcessLink::Open (this=0x27d2689f6e50, 
    aTransport=0x3534a51b700, aIOLoop=0x79bc5964bd20, aSide=mozilla::ipc::ParentSide)
    at /home/bevis/Projects/Build/gecko/src/ipc/glue/MessageLink.cpp:125
#5  0x00003ec54bb61a75 in mozilla::ipc::MessageChannel::Open (this=0x27d268a9a0f8, 
    aTransport=0x3534a51b700, aIOLoop=0x79bc5964bd20, aSide=mozilla::ipc::ParentSide)
    at /home/bevis/Projects/Build/gecko/src/ipc/glue/MessageChannel.cpp:675
#6  0x00003ec54bb86229 in mozilla::ipc::IToplevelProtocol::Open (this=0x27d268a9a000, 
    aTransport=0x3534a51b700, aOtherPid=19663, aThread=0x79bc5964bd20, aSide=mozilla::ipc::ParentSide)
    at /home/bevis/Projects/Build/gecko/src/ipc/glue/ProtocolUtils.cpp:594
#7  0x00003ec54bb751dd in mozilla::ipc::Endpoint<mozilla::ipc::PBackgroundParent>::Bind (
    this=0x70891fa4ebd0, aActor=0x27d268a9a000)
    at /home/bevis/Projects/Build/gecko/src/obj-firefox-central/dist/include/mozilla/ipc/ProtocolUtils.h:612
#8  0x00003ec54bb575fc in (anonymous namespace)::ParentImpl::ConnectActorRunnable::Run (
    this=0x236e66b7c440) at /home/bevis/Projects/Build/gecko/src/ipc/glue/BackgroundImpl.cpp:1459
#9  0x00003ec54b374891 in nsThread::ProcessNextEvent (this=0x7f965ac35cc0, aMayWait=true, 
    aResult=0x70891fa4ed07) at /home/bevis/Projects/Build/gecko/src/xpcom/threads/nsThread.cpp:1261
#10 0x00003ec54b379d83 in NS_ProcessNextEvent (aThread=0x7f965ac35cc0, aMayWait=true)

Bevis Tseng[:bevis][:btseng](Exited)

Assignee

Comment 12

•

7 years ago

treeherder shows the same result as comment 11 in raw logs:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=74be24b91d395e2a3e9a5a94aeac0e16ed335736&selectedJob=77904099

The last runnable in Background thread is printed before Run() but was never returned from Run():
> BVS, Background Runnable: ParentImpl::ConnectActorRunnable
Otherwise, will see something like this:
> BVS, Background Runnable: ParentImpl::ConnectActorRunnable is finished.

Bevis Tseng[:bevis][:btseng](Exited)

Assignee

Comment 13

•

7 years ago

The ConnectActorRunnable is related to the newly initiated BackgroundActor in worker thread from child that is going to be tested in next step. There is no reason why this runnable keep waiting the change from IO loop.

Since there is not much clue that shows anything wrong in IDB and this symptom can be reproduced much easier by rr chaos mode, I'll bisect to figure out which change set is more related to this problem.

Andrew Sutherland [:asuth] (he/him)

Comment 14

•

7 years ago

Bug 1329797 landed on January 28th and involved PBackground, that might be a good starting point to check out.

Bevis Tseng[:bevis][:btseng](Exited)

Assignee

Comment 15

•

7 years ago

Thanks! I'll try it! :)

Bevis Tseng[:bevis][:btseng](Exited)

Assignee

Comment 16

•

7 years ago

Unfortunately, this symptom seems to be existed for a long time but becomes more frequent in these 2 weeks. Even I bisect from the push log on 1-1-2017, it's still reproducible. :(

Hence, I looked into why the IPDL Background Thread was stuck in comment 11, and I found something weird.
In parent process, the reason why the newly created actor was block at [1] is that even though ProcessLink::OnChannelOpened was invoked in IOLoop to set the mChannelState to *ChannelOpening* and then notify the monitor, the MessageChannel::MaybeInterceptSpecialIOMessage() is called unexpectedly (that means shutdown is ongoing) and change the mChannelState to *ChannelClosing* before while (!mChan->Connected() && ...) is check!

So the next step is to figure out why *shutdown* was notified during testing.

[1] http://searchfox.org/mozilla-central/rev/cac6cb6a10afb8ebb2ecfbeeedaff7c66f57dd75/ipc/glue/MessageLink.cpp#125

Bevis Tseng[:bevis][:btseng](Exited)

Assignee

Comment 17

•

7 years ago

I think the possible reason we hit this intermittent failure is that, in test_filehandle_workers.html, we new a worker and terminated it immediately. This could be a race to close the channel before the parent side acknowledge the *ChannelOpening* state change. :(

Bevis Tseng[:bevis][:btseng](Exited)

Assignee

Comment 18

•

7 years ago

(In reply to Bevis Tseng[:bevistseng][:btseng] from comment #17)
> I think the possible reason we hit this intermittent failure is that, in
> test_filehandle_workers.html, we new a worker and terminated it immediately.
> This could be a race to close the channel before the parent side acknowledge
> the *ChannelOpening* state change. :(

http://searchfox.org/mozilla-central/rev/ea31c4b64f34a29415a69fb768f8051495547315/dom/indexedDB/test/test_filehandle_workers.html#80-91

Bevis Tseng[:bevis][:btseng](Exited)

Assignee

Comment 19

•

7 years ago

Attached patch Test: Terminate all workers at once. (obsolete) — Details — Splinter Review

This modification to the original test case proves what I have investigated from comment 11 to comment 18:
If terminating the worker right after its instantiation, there will be a race in parent side to close the ipc channel before acknowledging *ChannelOpening* to the newly created parent actor.

Hi Ben,

Do you have any suggestion for how to fix this in worker in a short time?
Otherwise, I might have to work around it with this modification to prevent this test case being disabled.

Flags: needinfo?(bkelly)

Ben Kelly [:bkelly, not reviewing]

Comment 20

•

7 years ago

Bevis, can you point me at the code that is starting the operation in the Worker thread?  Its unclear from the comments if you are referring to an IDB protocol actor or the PBackground actor creation itself.

The short answer, though, is that we should be using a WorkerHolder to prevent the worker thread from being destroyed during this operation.

Flags: needinfo?(bkelly) → needinfo?(btseng)

Bevis Tseng[:bevis][:btseng](Exited)

Assignee

Comment 21

•

7 years ago

(In reply to Ben Kelly [not reviewing due to deadline][:bkelly] from comment #20)
> Bevis, can you point me at the code that is starting the operation in the
> Worker thread?  Its unclear from the comments if you are referring to an IDB
> protocol actor or the PBackground actor creation itself.
The problematic one is this:
http://searchfox.org/mozilla-central/rev/ea31c4b64f34a29415a69fb768f8051495547315/dom/indexedDB/test/test_filehandle_workers.html#80-91
which tried to post a FileHandle to worker thread but shall be returned immediately as expected with exception of DATA_CLONE_ERR. Hence, there should be no operation in worker thread.

Besides, while tracing, the problematic PBackgroundChild are both created and closed at WorkerThreadPrimaryRunnable::Run():
http://searchfox.org/mozilla-central/source/dom/workers/RuntimeService.cpp#2810,2885
So, I am pretty sure that it's not the one for IDB's 1st open request on current thread but the one for newly created worker thread.

> The short answer, though, is that we should be using a WorkerHolder to
> prevent the worker thread from being destroyed during this operation.

Thanks! I'll try WorkerHolder to see if this can be improved.

However, does this means we still have this start-and-immediately-stop problem in worker?

Flags: needinfo?(btseng)

Comment hidden (Intermittent Failures Robot)

70 failures in 833 pushes (0.084 failures/push) were associated with this bug in the last 7 days. 

This is the #30 most frequent failure this week. 

** This failure happened more than 30 times this week! Resolving this bug is a high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 2 weeks, the affected test(s) may be disabled. **

Repository breakdown:
* autoland: 25
* mozilla-inbound: 24
* graphics: 10
* mozilla-central: 6
* try: 5

Platform breakdown:
* linux64: 49
* linux32: 20
* windows8-64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1300927&startday=2017-02-13&endday=2017-02-19&tree=all

Bevis Tseng[:bevis][:btseng](Exited)

Assignee

Comment 23

•

7 years ago

(In reply to Bevis Tseng[:bevistseng][:btseng] from comment #21)
> Thanks! I'll try WorkerHolder to see if this can be improved.
It seems that the WorkerHolder is something in implementation level and 
> However, does this means we still have this start-and-immediately-stop
> problem in worker?

It might take me too much time to fix this in worker.
I'd like to safe this test case first from being disabled. (The test criteria is more related to DATA_CLONE_ERR test of the FileHandle.)

Bug 1340941 was filed to address this problem instead.

Bevis Tseng[:bevis][:btseng](Exited)

Assignee

Comment 24

•

7 years ago

(In reply to Bevis Tseng[:bevistseng][:btseng] from comment #23)
> It might take me too much time to fix this in worker.
After thinking twice, it shall be more about an ipc implementation issue according to comment 16.
> Bug 1340941 was filed to address this problem instead.
The component category of Bug 1340941 has been updated to *IPC* instead.

Bevis Tseng[:bevis][:btseng](Exited)

Assignee

Comment 25

•

7 years ago

Attached patch Patch: Terminate all workers at the end of the test to work around bug 1340941. — Details — Splinter Review

According to the analysis in comment 16 to comment 18, bug 1340941 is filed to address the potential race in ipc protocol.

Since the timing of worker.terminate() is the test criteria in test_filehandle_worker.html, I'd like to workaround bug 1340941 first to prevent being disabled from auto-test.

Hi Jan,

May I have your review for this work around?

Thanks!

Attachment #8838636 - Attachment is obsolete: true

Attachment #8839046 - Flags: review?(jvarga)

Bevis Tseng[:bevis][:btseng](Exited)

Assignee

Comment 26

•

7 years ago

(In reply to Bevis Tseng[:bevistseng][:btseng] from comment #25)
> Since the timing of worker.terminate() is the test criteria in
                                           ^ not

Bevis Tseng[:bevis][:btseng](Exited)

Assignee

Comment 27

•

7 years ago

BTW, it's not reproducible anymore with the workaround in 100+ tests in treeherder on linux{32,x64} debug build:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8d9a4023ef1b57ff82336fe265fb0381023d698b&group_state=expanded

Jan Varga [:janv]

Comment 31

•

7 years ago

Comment on attachment 8839046 [details] [diff] [review]
Patch: Terminate all workers at the end of the test to work around bug 1340941.

Review of attachment 8839046 [details] [diff] [review]:
-----------------------------------------------------------------

Ok

Attachment #8839046 - Flags: review?(jvarga) → review+

Pulsebot

Comment 32

•

7 years ago

Pushed by btseng@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/3f4226570a86
Terminate all workers at the end of the test to work around bug 1340941. r=janv

Carsten Book [:Tomcat]

Comment 33

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/3f4226570a86

Status: NEW → RESOLVED

Closed: 7 years ago

status-firefox54: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla54

Ryan VanderMeulen [:RyanVM]

Updated

•

7 years ago

status-firefox52: --- → affected

status-firefox53: --- → affected

Ryan VanderMeulen [:RyanVM]

Comment 34

•

7 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-aurora/rev/13e44cc8e7c0

status-firefox53: affected → fixed

Flags: in-testsuite+

Ryan VanderMeulen [:RyanVM]

Comment 35

•

7 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/6181fc70ffc8

status-firefox52: affected → fixed

Ryan VanderMeulen [:RyanVM]

Comment 36

•

7 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-esr52/rev/6181fc70ffc8

status-firefox-esr52: --- → fixed

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

7 years ago

Whiteboard: [stockwell fixed]

Comment hidden (Intermittent Failures Robot)

7 failures in 812 pushes (0.009 failures/push) were associated with this bug in the last 7 days.  
Repository breakdown:
* mozilla-inbound: 3
* autoland: 2
* try: 1
* mozilla-central: 1

Platform breakdown:
* linux32: 5
* linux64: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1300927&startday=2017-02-20&endday=2017-02-26&tree=all

Test: Terminate all workers at once. 7 years ago Bevis Tseng[:bevis][:btseng](Exited) 2.15 KB, patch		Details \| Diff \| Splinter Review
Patch: Terminate all workers at the end of the test to work around bug 1340941. 7 years ago Bevis Tseng[:bevis][:btseng](Exited) 2.28 KB, patch	janv : review+	Details \| Diff \| Splinter Review