Intermittent "Cleanup Device failed" with nothing but a sudden "program finished with exit code 1"

RESOLVED WORKSFORME

Status

Release Engineering
General
RESOLVED WORKSFORME
6 years ago
5 years ago

People

(Reporter: philor, Unassigned)

Tracking

({intermittent-failure})

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

6 years ago
https://tbpl.mozilla.org/php/getParsedLog.php?id=11365038&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=11373963&tree=Fx-Team

recv'ing...
response: /data/data/com.mozilla.SUTAgentAndroid/files/tests
$>
program finished with exit code 1
elapsedTime=91.100019
I have the theory that we have to add stdout.flush() in certain spots before a point where we can fail.
bear mentioned something about that it was some sort of interaction between python and buildbot.
Callek, can you please look into making this failure mode a little more informative? I almost feel that the right approach is to add a print statement before any setflag() which we set before returning a code and exiting.

I was also hoping to turn this "red" (RETCODE_ERROR==1) job into "purple" (4) or "blue" (5) (whatever it actually re-triggers a job).

https://tbpl.mozilla.org/?rev=83ff77ce8d6c&tree=Mozilla-Inbound&jobname=Android%20XUL%20Tegra%20250%20mozilla-inbound%20opt%20test%20mochitest-7


(In reply to Phil Ringnalda (:philor) from comment #0)
> https://tbpl.mozilla.org/php/getParsedLog.php?id=11365038&tree=Mozilla-
> Inbound
> https://tbpl.mozilla.org/php/getParsedLog.php?id=11373963&tree=Fx-Team
> 
> recv'ing...
> response: /data/data/com.mozilla.SUTAgentAndroid/files/tests
> $>
> program finished with exit code 1
> elapsedTime=91.100019

These first two finish exactly the same way.
After checking that the processes have uninstalled it goes into trying to delete the devRoot but it fails silently around there.

Perhaps all we need is to print what flags we are going to set before returning any codes.

http://hg.mozilla.org/build/tools/file/default/sut_tools/cleanup.py#l54


(In reply to Phil Ringnalda (:philor) from comment #1)
> https://tbpl.mozilla.org/php/getParsedLog.php?id=11342315&tree=Mozilla-Aurora

On this one,
we start deleting devRoot but suddenly we can't delete anymore and fail
Deleting file(s) from /mnt/sdcard/tests/profile/extensions/staged/workerbootstrap-test@mozilla.org
	Deleted worker.js
	Deleted install.rdf
	Deleted bootstrap.js
Unable to delete directory /mnt/sdcard/tests/profile/extensions/staged/workerbootstrap-test@mozilla.org
Unable 
recv'ing...
response: to delete directory /mnt/sdcard/tests/profile/extensions/staged
Unable to delete directory /mnt/sdcard/tests/profile/extensions
	Unable to delete userChrome.css
	Unable to delete user.js
	Unable to delete tests.manifest
	Unable to delete tests.jar
	Unable to delete permissions.sqlite
Unable to delete directory /mnt/sdcard/tests/profile
Deleting file(s) from /mnt/sdcard/tests/logs
	Unable to delete mochitest.log
Unable to delete directory /mnt/sdcard/tests/logs
	Unable to delete robocop.apk
	Unable to delete fennec-15.0a1.en-US.android-arm.apk
Unable to delete directory /mnt/sdcard/tests
$>
removeDir() returned

and fail because we check to see if it is gone:
http://hg.mozilla.org/build/tools/file/default/sut_tools/cleanup.py#l66

I assume that you pointed out this log because it ends dramatically with signal 15.
05-01 20:22:32.829 W/Zygote  (  938): Preloaded drawable resource #0x1080process killed by signal 15
program finished with exit code -1
elapsedTime=192.726851

(In reply to Phil Ringnalda (:philor) from comment #2)
> https://tbpl.mozilla.org/php/getParsedLog.php?id=11377001&tree=Firefox

This last log is the same thing as the previous one.
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #4)
> Callek, can you please look into making this failure mode a little more
> informative? I almost feel that the right approach is to add a print
> statement before any setflag() which we set before returning a code and
> exiting.

Yea that is the overall right approach -- the problem actually stems from another issue though.

setFlag should be outputting for us more details here:
http://hg.mozilla.org/build/tools/file/fd310f3edd12/sut_tools/sut_lib.py#l376

log.info unfortunately is not actually doing output.

I have bug 749863 (awaiting review) that should fix this, by making the log.info (and log.debug) calls actually output when called. Marking dep so that if for some reason this doesn't get fixed by that we can notice!

> I was also hoping to turn this "red" (RETCODE_ERROR==1) job into "purple"
> (4) or "blue" (5) (whatever it actually re-triggers a job).

In the cases where cleanup happens AFTER the regular job portions this would be a bad idea, as it would re-run the job for failing to do post-cleanup. If it retries on early-cleanup failures then it is a good thing.

Also part of the issue I forsee though is that clientproxy is racy with error.flg and will kill buildbot once it notices it, which might be before we get any indication to actually RETRY, this is followup fodder though, once we see how much of a problem it is when I get my current patch-queue deployed.
Depends on: 749863

Updated

6 years ago
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → WORKSFORME
(Assignee)

Updated

6 years ago
Keywords: intermittent-failure
(Assignee)

Updated

6 years ago
Whiteboard: [orange]
(Assignee)

Updated

5 years ago
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.