Closed Bug 809437 Opened 7 years ago Closed 7 years ago

Frequent b2g emulator failures with "ERROR - Return code: 1" after ": Timeout waiting for marionette on port 'nnnnn'" or "Timeout waiting for the b2g process to start" or "TypeError: %d format: a number is required, not NoneType"

Categories

(Testing :: Marionette, defect)

defect
Not set

Tracking

(firefox17 unaffected, firefox18 affected, firefox19 affected)

RESOLVED FIXED
Tracking Status
firefox17 --- unaffected
firefox18 --- affected
firefox19 --- affected

People

(Reporter: ahal, Assigned: ahal)

References

Details

(Keywords: intermittent-failure)

Attachments

(2 files, 6 obsolete files)

Another one: https://tbpl.mozilla.org/php/getParsedLog.php?id=16818281&tree=Mozilla-Inbound&full=1

In both cases there was a retry when pushing one of the files (see bug 800102 (sigh))
Nice, a third one that doesn't have a retried push: https://tbpl.mozilla.org/php/getParsedLog.php?id=16823573&tree=Mozilla-Inbound&full=1

I'd bet a sleep would fix the issue, but I'll try and figure out what event we need to be waiting for before we're supposed to restart the emulator.
Blocks: 438871
Summary: Possible race condition when restarting b2g in install_gecko → Intermitent B2G emulator "Could not successfully complete transport of message to Gecko, socket closed?", due to possible race condition when restarting b2g in install_gecko
Whiteboard: [orange]
Depends on: 809447
Assignee: nobody → ahalberstadt
Status: NEW → ASSIGNED
Ed, this shouldn't cause any problems with the tbpl error regexes should it?
Attachment #679232 - Flags: review?(jgriffin)
Attachment #679232 - Flags: feedback?(bmo)
Comment on attachment 679232 [details] [diff] [review]
Patch 1.0 - ensure b2g process has stopped

Review of attachment 679232 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good. I think we should add these two new strings to the error list of the mozharness scripts, so that they'll turn the jobs red:

http://hg.mozilla.org/build/mozharness/file/739cfaf3d165/scripts/b2g_emulator_unittest.py#l80
http://hg.mozilla.org/build/mozharness/file/739cfaf3d165/scripts/marionette.py#l80
Attachment #679232 - Flags: review?(jgriffin) → review+
Comment on attachment 679232 [details] [diff] [review]
Patch 1.0 - ensure b2g process has stopped

These are fine from a TBPL POV (whilst TBPL won't recognise them as is; with the comment 5 change plus my patch in bug 809436, it will).

Thank you for checking though :-)
Attachment #679232 - Flags: feedback?(bmo) → feedback+
Depends on: 809529
Whiteboard: [orange] → [orange][automation-needed-in-aurora]
I saw this yet again on inbound :(

https://tbpl.mozilla.org/php/getParsedLog.php?id=16837096&tree=Mozilla-Inbound&full=1
Whiteboard: [orange][automation-needed-in-aurora] → [orange][automation-needed-in-aurora][leave-open]
Attached patch Patch 1.1 - Possible fix (obsolete) — Splinter Review
This has been a pretty annoying problem as I can't reproduce it locally at all. All I know is that we never had these timeouts before when we stopped b2g before copying the files. Perhaps this will finally be the fix.
Attachment #679426 - Flags: review?(jgriffin)
Comment on attachment 679426 [details] [diff] [review]
Patch 1.1 - Possible fix

Review of attachment 679426 [details] [diff] [review]:
-----------------------------------------------------------------

Worth a shot!
Attachment #679426 - Flags: review?(jgriffin) → review+
Nope. If anything that last one seems to have made it worse. Next idea is to go back to pushing the files to /data/local first and then copying to /system/b2g like we had it before. Maybe that combined with the retrying of the push files will work.
(In reply to Andrew Halberstadt [:ahal] from comment #12)
> Nope. If anything that last one seems to have made it worse.

Feel free to backout said broken patches next time. This was causing a lot of orange/red on b2g Arm opt builds.

Backed out:
https://hg.mozilla.org/integration/mozilla-inbound/rev/1ab2043a4669
Sorry about that, I was the only one around when I noticed and I don't have level 3 :)
Attachment #679775 - Flags: review?(jgriffin) → review+
(In reply to Andrew Halberstadt [:ahal] from comment #15)
> Sorry about that, I was the only one around when I noticed and I don't have
> level 3 :)

You can always ping a sheriff on IRC.

(In reply to Jonathan Griffin (:jgriffin) from comment #18)
> https://hg.mozilla.org/integration/mozilla-inbound/rev/53d5080a6749

Backed out for b2g mochitest orange.
https://hg.mozilla.org/integration/mozilla-inbound/rev/960037a36758

13:40:45 WARNING - TEST-UNEXPECTED-FAIL | automation.py | application timed out after 330 seconds with no output
13:43:54 ERROR - Return code: 1
The previous patch timed out in automation.py (not waiting for marionette like what this bug is about). There was an error loading libxpcom.so in both the reftests and mochitests (though apparently it didn't matter for reftest).

I don't know why this is, libxpcom.so was getting copied to /system/b2g like it was supposed to. Maybe pushing to /data/local vs /system/b2g results in different file permissions?
Duplicate of this bug: 810371
Updating title to reflect the updated error message from attachment 679232 [details] [diff] [review].
Summary: Intermitent B2G emulator "Could not successfully complete transport of message to Gecko, socket closed?", due to possible race condition when restarting b2g in install_gecko → Frequent b2g emulator failures with "ERROR - Return code: 1" after ": Timeout waiting for marionette on port 'nnnnn'"
https://tbpl.mozilla.org/php/getParsedLog.php?id=16935211&tree=Mozilla-Inbound

(and a dozen or so starred by people who think it's a good idea not to paste logs)
Somehow this patch: https://hg.mozilla.org/try/rev/b5744527e483
Resulted in this orange: https://tbpl.mozilla.org/php/getParsedLog.php?id=16960991&tree=Try

I'm starting to think the easiest way to solve this is to get releng to provide full stack emulator builds.
Maybe we should just put a time.sleep(10) between 'stop b2g' and 'start b2g'.  Yucky, I know.
So I think this patch fixes the issue. I re-triggered it a whole bunch of times on try and haven't seen any timeouts: https://tbpl.mozilla.org/?tree=Try&rev=80318b816089

I also don't think that is enough evidence to call it fixed, but I think we should land this and keep an eye out in the future.
Attachment #680832 - Flags: review?(jgriffin)
Attachment #680832 - Flags: review?(jgriffin) → review+
Whiteboard: [orange][automation-needed-in-aurora][leave-open] → [orange][automation-needed-in-aurora]
Attached patch Patch 3.1 - fix typo (obsolete) — Splinter Review
Evidently my 40+ try re-triggers didn't actually manage to test the failing codepath, otherwise this typo would have been caught.

https://tbpl.mozilla.org/php/getParsedLog.php?id=16991177&tree=Mozilla-Inbound
Attachment #681033 - Flags: review?(mdas)
Attachment #681033 - Flags: review?(mdas) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/c3b7e68ef6a0

So seeing as all those try re-triggers never hit the failing codepath, I have no idea if this is actually fixed or not. I'll keep an eye out.
When we re-tried the restart this happened: https://tbpl.mozilla.org/php/getParsedLog.php?id=16994991&tree=Mozilla-Inbound

Possible solutions:
1) Restart whole emulator instead of b2g process
2) Get releng to provide full stack emulator builds
3) Figure out why this is happening and fix properly (would require dumping logcat to the logs)
Whiteboard: [orange][automation-needed-in-aurora] → [orange][automation-needed-in-aurora][leave-open]
If we go with option 1), there is some code in emulator.py to handle userdata.img files, but none to handle system.img files, so we'd have to add that before rebooting the emulator would be a viable option.

We should pursue 3) regardless, since if the problem is related to restarting B2G and has nothing to do with installing gecko, then 2) won't help us.  If we can verify that the problem only occurs when installing gecko, then we could ask for 2).
Summary: Frequent b2g emulator failures with "ERROR - Return code: 1" after ": Timeout waiting for marionette on port 'nnnnn'" → Frequent b2g emulator failures with "ERROR - Return code: 1" after ": Timeout waiting for marionette on port 'nnnnn'" or "Timeout waiting for the b2g process to start"
I tried making a patch that restarts the emulator and preserves system.img, but I run into bug 802877

I have another patch that just does a dirty long sleep in between the b2g start/stop here: https://tbpl.mozilla.org/?tree=Try&rev=b75a849e2840
I'll re-trigger a whole bunch more times before I consider landing it though.
https://tbpl.mozilla.org/php/getParsedLog.php?id=17055710&tree=Mozilla-Inbound

21:36:13     INFO -  restarting B2G (attempt 2 of 10)
21:36:13     INFO -  Traceback (most recent call last):
21:36:13     INFO -    File "runreftestb2g.py", line 563, in <module>
21:36:13     INFO -      sys.exit(main())
21:36:13     INFO -    File "runreftestb2g.py", line 483, in main
21:36:13     INFO -      marionette = Marionette(**kwargs)
21:36:13     INFO -    File "/home/cltbld/talos-slave/test/build/venv/lib/python2.6/site-packages/marionette/marionette.py", line 148, in __init__
21:36:13     INFO -      self.emulator.install_gecko(self.gecko_path, self)
21:36:13     INFO -    File "/home/cltbld/talos-slave/test/build/venv/lib/python2.6/site-packages/marionette/emulator.py", line 436, in install_gecko
21:36:13     INFO -      self._restart_b2g(marionette)
21:36:13     INFO -    File "/home/cltbld/talos-slave/test/build/venv/lib/python2.6/site-packages/marionette/emulator.py", line 380, in _restart_b2g
21:36:13     INFO -      self.dm.shellCheckOutput(['stop', 'b2g'])
21:36:13     INFO -    File "/home/cltbld/talos-slave/test/build/venv/lib/python2.6/site-packages/mozdevice/devicemanager.py", line 65, in shellCheckOutput
21:36:13     INFO -      raise DMError("Non-zero return code for command: %s (output: '%s', retval: '%i')" % (cmd, output, retval))
21:36:13     INFO -  TypeError: %d format: a number is required, not NoneType
21:36:13    ERROR - Return code: 1
Summary: Frequent b2g emulator failures with "ERROR - Return code: 1" after ": Timeout waiting for marionette on port 'nnnnn'" or "Timeout waiting for the b2g process to start" → Frequent b2g emulator failures with "ERROR - Return code: 1" after ": Timeout waiting for marionette on port 'nnnnn'" or "Timeout waiting for the b2g process to start" or "TypeError: %d format: a number is required, not NoneType"
(In reply to Phil Ringnalda (:philor) from comment #56)
> https://tbpl.mozilla.org/php/getParsedLog.php?id=17055710&tree=Mozilla-
> Inbound
> 
...
> 21:36:13     INFO -    File
> "/home/cltbld/talos-slave/test/build/venv/lib/python2.6/site-packages/
> mozdevice/devicemanager.py", line 65, in shellCheckOutput
> 21:36:13     INFO -      raise DMError("Non-zero return code for command: %s
> (output: '%s', retval: '%i')" % (cmd, output, retval))
> 21:36:13     INFO -  TypeError: %d format: a number is required, not NoneType
> 21:36:13    ERROR - Return code: 1

This is bug 811916.
Depends on: 811916
Blocks: 812149
This patch makes me sad. All I can say is that after 60 try retriggers the error hasn't been reproduced: https://tbpl.mozilla.org/?tree=Try&rev=b75a849e2840&noignore=1

I don't think that means much, but we need this bug fixed asap so I will try anything at this point.
Attachment #679232 - Attachment is obsolete: true
Attachment #679426 - Attachment is obsolete: true
Attachment #679775 - Attachment is obsolete: true
Attachment #680832 - Attachment is obsolete: true
Attachment #681033 - Attachment is obsolete: true
Attachment #681987 - Flags: review?(mdas)
Comment on attachment 681987 [details] [diff] [review]
Patch 4.0 - I will atone for my sins

Review of attachment 681987 [details] [diff] [review]:
-----------------------------------------------------------------

r+'d since we need this working ASAP, but with the caveat that we should file a follow up bug to investigate what we should be waiting for, instead of sleeping for an arbitrary amount of time.
Attachment #681987 - Flags: review?(mdas) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/473d3748d994

Also filed bug 812190 as a follow up to fix this issue properly.
Whiteboard: [orange][automation-needed-in-aurora][leave-open] → [orange][automation-needed-in-aurora]
Nope: https://tbpl.mozilla.org/php/getParsedLog.php?id=17071920&tree=Mozilla-Inbound
Whiteboard: [orange][automation-needed-in-aurora] → [orange][automation-needed-in-aurora][leave-open]
Pretty sure there is some difference on try that makes it less reproducible.
Attached patch WIP restart emulator patch (obsolete) — Splinter Review
:jgriffin, here's the wip emulator restart patch I had if you want to take a look
(In reply to Andrew Halberstadt [:ahal] from comment #67)
> Created attachment 682067 [details] [diff] [review]
> WIP restart emulator patch
> 
> :jgriffin, here's the wip emulator restart patch I had if you want to take a
> look

I got the emulator restart stuff working, and I'll attach it to the bug for posterity, BUT it doesn't help us.  Changes made to the system partition do persist between restarts.  You can verify this manually by launching the emulator manually (using ./run-emulator.sh) and copying some file to /system/b2g.  Then close the emulator, restart it, and the file will be missing. :(
This has some extra debugging info that would need to be taken out, but I don't intent to land it since it doesn't actually solve any problems.
Attachment #682067 - Attachment is obsolete: true
We're basically down to two options, AFAICT:

1 - Get someone on the B2G team to try and figure out why B2G doesn't restart correctly in this situation and fix it
2 - Change the mozharness script to detect this kind of failure and re-run the test from scratch, up to N times

1 will likely take weeks, so that leaves us with 2.  This is pretty brutal, but seems likely to work.
(In reply to Jonathan Griffin (:jgriffin) from comment #70)
> We're basically down to two options, AFAICT:
> 
> 1 - Get someone on the B2G team to try and figure out why B2G doesn't
> restart correctly in this situation and fix it
> 2 - Change the mozharness script to detect this kind of failure and re-run
> the test from scratch, up to N times
> 
> 1 will likely take weeks, so that leaves us with 2.  This is pretty brutal,
> but seems likely to work.

Oh, there is an option 3:  Have rel-eng produce full emulator builds, so we don't have to update gecko in the emulator.  This may be the best medium-term solution.  I think we should probably implement 2 in the interim, however.
According to the emulator docs (http://developer.android.com/tools/devices/emulator.html):

The emulator creates two writeable images at startup that it deletes at device power-off. The images are:

    A writable copy of the Android system image
    The /cache partition image

The emulator does not permit renaming the temporary system image or persisting it at device power-off.
https://tbpl.mozilla.org/php/getParsedLog.php?id=17093856&tree=Mozilla-Aurora

(At least it generously waited until the next push after your patch was pushed to aurora, instead of still failing directly on your push.)
(In reply to Jonathan Griffin (:jgriffin) from comment #71)
> medium-term solution.  I think we should probably implement 2 in the
> interim, however.

This is bug 812395
Depends on: 812395
Also:
https://tbpl.mozilla.org/php/getParsedLog.php?id=17138342&tree=Mozilla-Inbound
{
11:10:59    ERROR -  errors.MarionetteException: Could not successfully complete transport of message to Gecko, socket closed?
11:11:00    ERROR - Return code: 1
}
I just landed bug 812395 on inbound, which could possibly fix this problem.
I landed bug 812395 on aurora and beta as well. I haven't seen this issue anywhere since landing so I'm going to resolve fixed this bug.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Whiteboard: [orange][automation-needed-in-aurora][leave-open] → [orange]
Whiteboard: [orange]
You need to log in before you can comment on or make changes to this bug.