Closed Bug 800102 Opened 7 years ago Closed 7 years ago

Emulator sometimes stalls while Marionette is installing a new gecko in it resulting in a "Timeout exceeded for _checkCmd call"

Categories

(Testing :: Marionette, defect)

defect
Not set

Tracking

(firefox18 fixed, firefox19 fixed)

RESOLVED FIXED
mozilla19
Tracking Status
firefox18 --- fixed
firefox19 --- fixed

People

(Reporter: jgriffin, Assigned: ahal)

References

Details

(Keywords: intermittent-failure)

Attachments

(3 files, 3 obsolete files)

When passing --gecko-path to Marionette, the "adb push" call which copies a new gecko to the emulator sometimes stalls for an indefinite period.  It usually can be woken up again by doing a few simple adb calls in another terminal, e.g., "adb shell".

I have no idea why this happens, but I have a patch that fixes it on my machine.
I have no idea why this works, but I haven't seen this issue at all with this patch.I didn't do a more aggressive 'rm -rf *' because it deletes files that aren't part of the separate gecko builds...I'm guessing they're unpacked during the initial boot.
Attachment #670091 - Flags: review?(ahalberstadt)
Assignee: nobody → jgriffin
Comment on attachment 670091 [details] [diff] [review]
Delete *.so files before updating gecko in the emulator

Review of attachment 670091 [details] [diff] [review]:
-----------------------------------------------------------------

r+. I was hoping to test it out, but by murphy's law I stopped being able to reproduce the issue when I went to do so. But this change shouldn't hurt anything at the least so if it works for you then lgtm.
Attachment #670091 - Flags: review?(ahalberstadt) → review+
https://hg.mozilla.org/mozilla-central/rev/8e047a9698f0
Target Milestone: --- → mozilla18
Comment on attachment 670091 [details] [diff] [review]
Delete *.so files before updating gecko in the emulator

[Approval Request Comment]
Bug caused by (feature/regressing bug #): N/A
User impact if declined: We won't be able to run Marionette tests on TBPL on aurora without this patch.
Testing completed (on m-c, etc.): N/A
Risk to taking this patch (and alternatives if risky): None
String or UUID changes made by this patch: N/A
Attachment #670091 - Flags: approval-mozilla-aurora?
Comment on attachment 670091 [details] [diff] [review]
Delete *.so files before updating gecko in the emulator

Approving , as this is much needed for tests
Attachment #670091 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
https://hg.mozilla.org/releases/mozilla-aurora/rev/fd0dcda77b60
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Sadly, this appears not be fixed, as it's showing up on cedar.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attached patch Attempted fix (obsolete) — Splinter Review
Will land this on cedar to see if it helps.
Pushed another attempted fix to cedar: https://hg.mozilla.org/projects/cedar/rev/0705e84d547d
Attachment #673030 - Attachment is obsolete: true
This is a modified version of a patch which has been running successfully on cedar for about 5 days, with no occurrences of this problem.

The 'fix' (really workaround) is to first copy the files to the /data partition, then copy them from there to /system/b2g using 'dd'.  Additionally, the initial copy must be performed *before* b2g is shut down.  The reason is that 'adb push' is much slower while b2g is running, and this seems to help prevent the deadlock.
Attachment #676710 - Flags: review?(ahalberstadt)
Attachment #674700 - Attachment is obsolete: true
Attachment #675663 - Attachment is obsolete: true
Blocks: 784856
Blocks: 807119
No longer blocks: 784856
Comment on attachment 676710 [details] [diff] [review]
Work around adb deadlock when installing gecko in emulator,

Review of attachment 676710 [details] [diff] [review]:
-----------------------------------------------------------------

r+, keeping in mind that these changes will need to be upstreamed to the mozdevice emulator.py

::: testing/marionette/client/marionette/emulator.py
@@ +387,5 @@
> +        # partition.
> +        for root, dirs, files in os.walk(gecko_path):
> +            for filename in files:
> +                data_local_file = os.path.join('/data/local', filename)
> +                print 'pushing', data_local_file

Were these print statements left on purpose? Won't this generate a lot of redundant logs?
Attachment #676710 - Flags: review?(ahalberstadt) → review+
I left the print statements in so if it hangs on a TPBL job, we can at least see where it hangs (e.g., which copying stage and which file).

I've filed bug 807119 to port this (and other changes) to mozdevice's emulator classes.
Since this touches setup.py, I'm taking the precaution of running it through try:  https://tbpl.mozilla.org/?tree=Try&rev=a6be101c6d42&noignore=1
try run 100% green
https://hg.mozilla.org/mozilla-central/rev/baa695a1f5ca
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Whiteboard: [automation-needed-in-aurora]
Target Milestone: mozilla18 → mozilla19
Duplicate of this bug: 808771
I saw this happen once again on Cedar with reftests:
https://tbpl.mozilla.org/php/getParsedLog.php?id=16759480&tree=Cedar
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
cc'ing sheriffs since it's possible this will occur in the wild before it's fixed.

The output in the log when this happens is something like:

11:53:55     INFO -  pushing /data/local/omni.ja
11:53:55     INFO -  Traceback (most recent call last):
11:53:55     INFO -    File "runreftestb2g.py", line 563, in <module>
11:53:55     INFO -      sys.exit(main())
11:53:55     INFO -    File "runreftestb2g.py", line 483, in main
11:53:55     INFO -      marionette = Marionette(**kwargs)
11:53:55     INFO -    File "/home/cltbld/talos-slave/test/build/venv/lib/python2.6/site-packages/marionette/marionette.py", line 148, in __init__
11:53:55     INFO -      self.emulator.install_gecko(self.gecko_path, self)
11:53:55     INFO -    File "/home/cltbld/talos-slave/test/build/venv/lib/python2.6/site-packages/marionette/emulator.py", line 392, in install_gecko
11:53:55     INFO -      self.dm.pushFile(os.path.join(root, filename), data_local_file)
11:53:55     INFO -    File "/home/cltbld/talos-slave/test/build/venv/lib/python2.6/site-packages/mozdevice/devicemanagerADB.py", line 180, in pushFile
11:53:55     INFO -      self._checkCmd(["push", os.path.realpath(localname), destname])
11:53:55     INFO -    File "/home/cltbld/talos-slave/test/build/venv/lib/python2.6/site-packages/mozdevice/devicemanagerADB.py", line 769, in _checkCmd
11:53:55     INFO -      raise DMError("Timeout exceeded for _checkCmd call")
11:53:55     INFO -  mozdevice.devicemanager.DMError: Timeout exceeded for _checkCmd call
https://tbpl.mozilla.org/php/getParsedLog.php?id=16765363&tree=Mozilla-Inbound
Blocks: 438871
Whiteboard: [automation-needed-in-aurora] → [automation-needed-in-aurora][orange]
https://tbpl.mozilla.org/php/getParsedLog.php?id=16767581&tree=Mozilla-Inbound
Summary: Emulator sometimes stalls while Marionette is installing a new gecko in it → Emulator sometimes stalls while Marionette is installing a new gecko in it resulting in a "Timeout exceeded for _checkCmd call"
I'll take a look at this, but I'm having trouble reproducing it locally. I may just have to implement a system that will retry the install-gecko step if it fails.
Assignee: jgriffin → ahalberstadt
Status: REOPENED → ASSIGNED
Pushed a fix to cedar: https://tbpl.mozilla.org/?tree=Cedar&rev=1bc7d82bebf2

I'll re-trigger a bunch of times to try and reproduce. I'm pretty sure this will get around the problem (though I can't say I'm super happy about it).
Comment on attachment 678858 [details] [diff] [review]
Patch 1.0 - Retry adb push if it times out when installing gecko

Review of attachment 678858 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks.  Third time's the charm!
Attachment #678858 - Flags: review?(jgriffin) → review+
https://hg.mozilla.org/mozilla-central/rev/a09cb93f9d82
Status: ASSIGNED → RESOLVED
Closed: 7 years ago7 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
(In reply to Andrew Halberstadt [:ahal] from comment #36)
> Pushed a fix to cedar: https://tbpl.mozilla.org/?tree=Cedar&rev=1bc7d82bebf2
> 
> I'll re-trigger a bunch of times to try and reproduce. I'm pretty sure this
> will get around the problem (though I can't say I'm super happy about it).

So ahal, this on cedar somehow caused an additional "head" on "default" with hg.

Two options which I'd like to do one of for us:

(a) Merge this change to cedar's tip, along with a current m-c
(b) Backout this change, *then* merge the head down to tip [effectively doing a no-op -- I think, would verify]

So to summarize, "do we still want this changeset on cedar"?

This does block me asking IT to [make sure and] enable the single-head hook.
Flags: needinfo?(ahalberstadt)
Yes, we still want this on cedar, so I think 1) is the way to go.
(In reply to Justin Wood (:Callek) from comment #42)
> So to summarize, "do we still want this changeset on cedar"?

Yes, sorry about that
Flags: needinfo?(ahalberstadt)
Just to round back, we talked about this on IRC. the patch landed on cedar conflicts with what is currently on cedar's tip (which likely includes the patch that was pushed to m-c)

So we'll just merge this head away, and keep what is/was on m-c instead. I'll take care of that.
Flags: needinfo?(bugspam.Callek)
hg heads
hg up -r 112389
hg commit --close-branch -m "rogue head"
hg up -r default
hg ident
hg pull http://hg.mozilla.org/mozilla-central
hg merge
hg commit
hg push

Rogue head closed.
Flags: needinfo?(bugspam.Callek)
(In reply to Aki Sasaki [:aki] from comment #46)
> hg heads
> hg up -r 112389
> hg commit --close-branch -m "rogue head"
> hg up -r default
> hg ident
> hg pull http://hg.mozilla.org/mozilla-central
> hg merge
> hg commit
> hg push
> 
> Rogue head closed.

This extra head will need to be dummy merged rather than closed, if cedar ever needs to merge to mozilla-central (due to the single head hook). Or is cedar just a testing-ground and accepted changes going to be transplanted/cherry-picked to m-c?
For B2G we won't need to merge cedar to m-c; I can't say about other users.
I've only got a single changed file for mozharness talos; pretty easy to cherry pick that one when I'm ready to merge.
Whiteboard: [orange]
You need to log in before you can comment on or make changes to this bug.