Closed Bug 945610 Opened 11 years ago Closed 7 years ago

Try-server cleanup of b2g test zips is too aggressive

Categories

(Release Engineering :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: allstars.chh, Unassigned)

Details

In the try
https://tbpl.mozilla.org/?tree=Try&rev=3ff723c26233

The C2 in emulator is red, even re-triggering the test for several times.

Asking the Sheriff on the irc(philor), 
his suggestion is to file a bug here.
Hmm, so the action download-and-extract is where this was failing. This uses testbase.py action to download things like the tests.zip which uses mozharns download_file (a retry cmd) so there should be multiple attempts: http://mxr.mozilla.org/build/source/mozharness/scripts/b2g_emulator_unittest.py#174

It was trying to grab the test url from: http://pvtbuilds.pvt.build.mozilla.org//pub/mozilla.org/b2g/try-builds/psiddh@gmail.com-3ff723c26233/try-generic/b2g-28.0a1.en-US.android-arm.tests.zip

I think it had access but the location did not exist. I can't verify if it was there because it seems http://pvtbuilds.pvt.build.mozilla.org//pub/mozilla.org/b2g/try-builds/ is regularly flushed(only revs from Dec 3rd and 4th exist now in it).

I believe this URL is set in the sendchange and then saved as a buildbot prop in ['sourcestamp']['changes']['name']

Looking at today and yesterday in the Try tree, I have seen green builds for C2 so I am not sure why this particular one failed. I'll try figuring out how this tests URL is made up and if it was possible it did not exist at the time.
jlund, thanks for looking into this.

I can see that :yoshi has done more pushes and they have worked:
https://tbpl.mozilla.org/?tree=Try&rev=2ca8af7875c5&jobname=b2g_emulator_vm%20try%20opt%20test%20crashtest-2

The problem is that he retriggered jobs on builds that had been pruned by then.
If we look at the first C2 you can see that this is the date (from the 30th):
> b2g_emulator_vm try opt test crashtest-2 on 2013-11-30 01:32:10 PST for push 3ff723c26233
The first re-triggered job failed 3 days later:
> b2g_emulator_vm try opt test crashtest-2 on 2013-12-02 12:32:40 PST for push 3ff723c26233

As you noticed, we're very aggressive at cleaning up the try server uploads.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WORKSFORME
Yeah, that was why I asked Yoshi to file, because that pruning seemed insanely fast to me. Since it frequently takes 12 hours to get try results, that means that people who only do their job during working hours cannot actually retrigger things on a try push which is pushed any later in the week than Thursday afternoon, because if you push first thing Friday morning you cannot count on getting results before the end of the day Friday, and apparently on Monday you will not be able to retrigger.

Is that actually what we want, that a try push from Friday must be retriggered either on Friday night, or Saturday or Sunday?
Also, the build and the symbols were still there, it was just the tests that were gone. Also, the tests were on pvtbuilds, why, and is that why they are overly-aggressively cleaned up? Also, we don't clean up non-b2g test zips even remotely that quickly, I can still retrigger non-b2g tests on that push even now.

It still being possible to retrigger tests from a Friday push on a Monday seems like an absolute minimum.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Summary: Crashtest failure when running B2G emulator on try-server → Try-server cleanup of b2g test zips is too aggressive
Someone needs to adjust the cleanup cronjobs to be more lenient, but given the volume of try usage, that won't be undertaken lightly. There are hundreds of gigabytes of build artifacts being generated daily and we can't keep them all.

cc-ing Nick in case we have better options here now with S3.
Component: Buildduty → Tools
QA Contact: armenzg → hwine
Looks like this is running on the older private builds host:

[b2gbld@pvtbuilds2.dmz.scl3 ~]$ cat /etc/cron.d/b2gtry-cleanup
@hourly b2gtry find /pub/mozilla.org/b2g/try-builds/ -mindepth 1 -maxdepth 1 -type d -mtime +1 -print0 | xargs -0 rm -rf

That will need IT intervention to remove, as it's owned by root.

There is also a cleanup on /pvt in SVN (sysadmins/puppet/trunk/modules/productdelivery/manifests/pvtbuilds_cron.pp). Both the existing dirs are empty, so I don't think we are publishing anything there, so this can get removed.

Then we'll need to add a cron (.../productdelivery/manifests/upload_cron.pp, or .../productdelivery/files/cron/b2gtry) that cleans up. Note that the path is /mnt/pvt_builds/pub/mozilla.org/b2g/try-builds where this will run (upload-cron), and we should use '-mtime +13' to match the other try expiry.  But, with the emulator builds we do now each try run can be 1.5G in size, so lets consider how much space might be used before we exhaust the disk.
Also, we recently got permission to put private files on a private S3 bucket. We'd have to figure out how to keep the existing password and ldap based authentication first though.
Also, the last 2 days worth of data is 135GB, and we're perennially short of space (bug 855594), so keeping 7 times more data might be an issue from a free space point of view.
Component: Tools → General
Status: REOPENED → RESOLVED
Closed: 11 years ago7 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.