Closed Bug 1151666 Opened 9 years ago Closed 9 years ago

Intermittent services/fxaccounts/tests/xpcshell/test_accounts.js | Test timed out

Categories

(Firefox :: Firefox Accounts, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

()

RESOLVED FIXED
Firefox 40
Tracking Status
firefox38 --- fixed
firefox39 --- fixed
firefox40 --- fixed
firefox-esr31 --- unaffected

People

(Reporter: RyanVM, Assigned: markh)

References

Details

(Keywords: intermittent-failure)

Attachments

(2 files, 1 obsolete file)

15:55:08 INFO - TEST-START | services/fxaccounts/tests/xpcshell/test_accounts.js
16:00:08 WARNING - TEST-UNEXPECTED-TIMEOUT | services/fxaccounts/tests/xpcshell/test_accounts.js | Test timed out
16:00:08 INFO - TEST-INFO took 300034ms
16:00:09 INFO - xpcshell return code: -1
16:00:18 INFO - Exception in thread ProcessReader:
16:00:18 INFO - Traceback (most recent call last):
16:00:18 INFO - File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
16:00:18 INFO - self.run()
16:00:18 INFO - File "/usr/lib/python2.7/threading.py", line 504, in run
16:00:18 INFO - self.__target(*self.__args, **self.__kwargs)
16:00:18 INFO - File "/builds/slave/test/build/venv/local/lib/python2.7/site-packages/mozprocess/processhandler.py", line 902, in _read
16:00:18 INFO - self.timeout_callback()
16:00:18 INFO - File "/builds/slave/test/build/venv/local/lib/python2.7/site-packages/mozprocess/processhandler.py", line 813, in __call__
16:00:18 INFO - e(*args, **kwargs)
16:00:18 INFO - File "/builds/slave/test/build/venv/local/lib/python2.7/site-packages/mozdevice/devicemanagerADB.py", line 138, in _raise
16:00:18 INFO - raise DMError("Timeout exceeded for shell call")
16:00:18 INFO - DMError: Timeout exceeded for shell call
Flags: needinfo?(mhammond)
I'm looking at the log from https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=8640424 - http://ftp.mozilla.org/pub/mozilla.org/b2g/try-builds/rstrong@mozilla.com-bf263a12b3b4/try-emulator-debug/try_ubuntu64_vm-b2g-emulator-debug_test-xpcshell-2-bm116-tests1-linux64-build167.txt.gz

> 20:46:02     INFO -  TEST-START | services/fxaccounts/tests/xpcshell/test_accounts.js
> 20:51:02  WARNING -  TEST-UNEXPECTED-TIMEOUT | services/fxaccounts/tests/xpcshell/test_accounts.js | Test timed out
> 20:51:02     INFO -  TEST-INFO took 300030ms

So it was killed after 300 seconds.  Then:

> 20:51:18     INFO - Copy/paste: cat /builds/slave/test/build/emulator-5554.log

Looks like it is getting the log from the remote. The very next line:

> 20:51:18     INFO -  04-09 03:10:35.698 I/qemu-props(   35): connected to 'boot-properties' qemud service.

has a timestamp of 3 mins 10 seconds, and the log implies the emulator is still booting. The first line related to the test is at:

> 20:51:24     INFO -  04-09 03:46:57.378 I/Gecko   ( 1853): {"action":"log","time":1428551217348,"thread":null,"pid":null,"source":"xpcshell/head.js","level":"INFO","message":"(xpcshell/head.js) | test MAIN run_test pending (1)"}

So the test has only just started, but the timestamp is already at 3:46.  The *last* line I see related to the test is:

> 20:51:25     INFO -  04-09 03:51:12.048 I/Gecko   ( 1853): Services.FxAccounts.test	DEBUG	Bob verifying his email ...

Then entries like this start appearing:

> 20:51:25     INFO -  04-09 03:51:12.979 I/ServiceManager(   33): service 'media.resource_manager' died
> 20:51:25     INFO -  04-09 03:51:12.989 I/ServiceManager(   33): service 'permission' died
...
> 20:51:25     INFO -  04-09 03:51:17.458 I/AudioPolicyService( 1931): Loaded audio policy from LEGACY Audio Policy HAL (audio_policy)
> 20:51:25     INFO - Return code: 0

Which I believe is the process being killed due to the timeout.  So unless I'm missing something:

* Of the 300 seconds timeout, only 3:51 (231 seconds) is accounted for.
* The test was only actually running for 5 seconds (3:46 -> 3:51) before being killed.

Ryan, any insights here?
Flags: needinfo?(mhammond) → needinfo?(ryanvm)
(In reply to Mark Hammond [:markh] from comment #15)
> > 20:51:18     INFO -  04-09 03:10:35.698 I/qemu-props(   35): connected to 'boot-properties' qemud service.

hmm - on second thought, "03:10:35.698" looks like wall time or the clock - 3 hours, 10 minutes, 35 seconds - but:

> > 20:51:24     INFO -  04-09 03:46:57.378 I/Gecko   ( 1853): {"action":"log","time":1428551217348,"thread":null,"pid":null,"source":"xpcshell/head.js","level":"INFO","message":"(xpcshell/head.js) | test MAIN run_test pending (1)"}

Being 36 minutes later makes even less sense.  So I'm confused.
Those timestamps are clearly hh:mm:ss, so excuse my brain-fart.
Flags: needinfo?(ryanvm)
I think this patch will fix the problem. It also increases the verbosity of logging for this test which will help in the future if it turns out to not be a complete fix.

* The 5/15 second timeouts for verification are now stored as constants on the FxAInternal object, and the test changes the initial poll from 15 seconds to 100ms. This saves ~20 seconds on the test on b2g and 10 seconds on desktop (I'm not sure exactly why this difference, but I guess that FxOS is slower so we start the poll in cases where desktop doesn't.

* Bug 1139743 introduced a "test regression" that means the MockStorage object used by the test wasn't actually used in some cases. This means we often spent ~5 seconds doing file IO for the storage where we weren't before - see also bug 1153682.

* The log is now set to "Trace" level, this should be fine as the output is only saves when the test fails.  This, combined with bug 1153691, means we can get meaningful timestamps from the logs on FxOS, which means we will be in far better shape to diagnose future issues.

* 2 additional changes that don't seem to be related to the failure:
** A long-winded 'deferred' promise is changed to a single line that does exactly the same thing.
** A missing "return" of a promise in a task means that the one test will be considered finished before it actually is. This might cause a failure to be attributed to a different test - as mentioned, I didn't actually see this happen but the change makes it correct.

A try is at https://treeherder.mozilla.org/#/jobs?repo=try&revision=01bacb1c06ad - note it's a dit different as I've chopped and changed some of the revisions, but the changes are the same in aggregate.
Assignee: nobody → mhammond
Status: NEW → ASSIGNED
Attachment #8591437 - Flags: review?(zack.carter)
Comment on attachment 8591437 [details] [diff] [review]
0008-Bug-1151666-fix-intermittent-orange-by-reducing-veri.patch

Review of attachment 8591437 [details] [diff] [review]:
-----------------------------------------------------------------

Nice cleanup overall.
Attachment #8591437 - Flags: review?(zack.carter) → review+
Comment on attachment 8591437 [details] [diff] [review]
0008-Bug-1151666-fix-intermittent-orange-by-reducing-veri.patch

Review of attachment 8591437 [details] [diff] [review]:
-----------------------------------------------------------------

Hmm, the timeout issue persists: https://pastebin.mozilla.org/8829915
Attachment #8591437 - Flags: review+
Attachment #8591437 - Flags: review+
(In reply to Mark Hammond [:markh] from comment #28)
> A try is at
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=01bacb1c06ad

FTR, I screwed that try push up - the error in that patch has been fixed in what I pushed to fx-team
https://hg.mozilla.org/mozilla-central/rev/347f8029ab4d
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla40
Yikes, something made this angry recently.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla40 → ---
Looking into the new failures, I also noticed that some timers seemed to be running forever. Most of these are test-only issues (we create a new fxa instance for each test, so we need to ensure the timer for that instance is stopped by calling .signOut().) 

However, there does seem to be one non-test issue - resendVerificationMail() unconditionally starts a new timer. I fixed this by having pollEmailStatus("start", ...) kill any existing timers.

I'm not sure this will fix the timeout, but seems like a reasonable attempt :)
Attachment #8597856 - Flags: review?(zack.carter)
Comment on attachment 8597856 [details] [diff] [review]
0005-Bug-1151666-avoid-multiple-verified-email-timers-fro.patch

Actually, I don't think the failures are caused by this - the test seem to complete but the harness thinks it hasn't (so for a long timer we just see the "polling" messages)

So just f? now - I'm pretty sure we want this *and* something else ;)
Attachment #8597856 - Flags: review?(zack.carter) → feedback?(zack.carter)
This patch is the same as the previous one, but also removes all "pending" calls from the test - these aren't necessary when add_test/add_task is used, and I suspect something is going wrong there (as I mentioned, the logs show the final test completing successfully but the harness doesn't seem to think everything is done)

This is almost perma-orange but the try at https://treeherder.mozilla.org/#/jobs?repo=try&revision=e61d83f4a3eb with this patch (and many re-triggers) looks good.
Attachment #8597856 - Attachment is obsolete: true
Attachment #8597856 - Flags: feedback?(zack.carter)
Attachment #8598473 - Flags: review?(zack.carter)
Attachment #8598473 - Flags: review?(zack.carter) → review+
https://hg.mozilla.org/mozilla-central/rev/461b2cd49f3c
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla40
Product: Core → Firefox
Target Milestone: mozilla40 → Firefox 40
You need to log in before you can comment on or make changes to this bug.