Closed Bug 738557 Opened 12 years ago Closed 8 years ago

Tests on retriggered builds can download the previous build from ftp.m.o

Categories

(Release Engineering :: General, defect, P2)

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: philor, Unassigned)

References

Details

(Keywords: sheriffing-untriaged, Whiteboard: [tests][ftp][cachingishard])

It's an ugly muddle because it's going to have more Android builds than I actually needed, but the story behind https://tbpl.mozilla.org/?rev=ab2ff3b5611f&jobname=Android is this:

* I pushed and got the first on-push build

* I realized that the bustage below meant I needed a clobber, so I set it on the clobberer and waited to see if I lucked into a free-space/periodic clobber

* On Android XUL, I did and the story ends for it

* On Android native, I did not, so the test runs were busted from the browser not starting

* I retriggered the build, and it was a clobber (despite confusing me by not saying so because it clobbered earlier while doing a build on another tree); because it had the same buildid, it uploaded to the same URL on ftp.m.o as the previous broken build

* Mochitests and at least one talos run downloaded the first build in response to the sendchange for the second build - those are the second set of orange and red; jsreftests, crashtests, reftests, and most talos downloaded the second build and went green

I presume that means that there's some caching and multiple servers and whatnot that make ftp.m.o not work the way stage always worked, but this is a great big footgun aimed right at the first person who lands something, realizes it needs a clobber, clobbers, and gets the unclobbered results again.
There is caching on ftp.m.o, which I was counting as a good thing from the point of view not hitting the netapp so much. 

dumitru, is Zeus set to expire cache entries in an hour ? That's my best guess from header requests.
Blocks: 735870
(In reply to Nick Thomas [:nthomas] from comment #1)
> There is caching on ftp.m.o, which I was counting as a good thing from the
> point of view not hitting the netapp so much. 
> 
> dumitru, is Zeus set to expire cache entries in an hour ? That's my best
> guess from header requests.

webcache!time is set to 1800 seconds (30 minutes).
Thanks. That's longer than the ~ 40 minute clobber time for Android native, and I see there are 5 opt builds + one nightly on the 
 https://tbpl.mozilla.org/?rev=ab2ff3b5611f&jobname=Android
so I'm a bit confused. philor ?
Sure, if we uploaded it, the first test slave downloaded it, and then every test slave had already downloaded it in less than 30 minutes, and nobody in the entire world, test slave, spider, or human, downloaded it after that, and then we uploaded a new build after that 30 minutes, we'd be fine.

We built the first time, and uploaded it at 21:06. First thing I see downloading it was at 21:08, so cached to 21:38.

https://tbpl.mozilla.org/php/getParsedLog.php?id=10305098&tree=Firefox (a retry after a late-starting talos failed) was at 21:38, say the seconds work out and it missed the cache and started another 30 minute cache to 22:08.

We built the second time, and uploaded it to the same place, at 22:12.

So, all it would take to have extended to a third cache time was anything downloading the build between 22:08 and 22:12. I don't see a likely timed test in all that spew of failure, but I don't even need to: 30 minute cache and 40 minute clobber making everything fine assumes that no tests will start after 35 minutes, which just isn't the case. The first remote-tdhml started 17 minutes after the first test started, and took 13 minutes to fail and trigger an automatic retry, and there's the gap between your 30 and 40 busted. Looking at clean single-build jobs on inbound right now, where it's less painful to dig out which is what, a suite having two 30+ minute jobs both ending in a retry aren't at all unusual, and there's your cache out to 90 minutes.

The second set of tests did get the first build, and whether it's because of a test that I'm not spotting because it started earlier but was slow to get around to downloading, or because someone other than a test slave downloaded the build during those four minutes doesn't really matter: if we're going to be okay with a 30 minute cache that isn't invalidated by replacing the file that's cached, then we have to be okay with the fact that sometimes (probably frequently) a retriggered build will do no good at all, and will require that we both know to and do sit around for up to 30 minutes before retriggering all the tests.
(In reply to Dumitru Gherman [:dumitru] from comment #2)
> (In reply to Nick Thomas [:nthomas] from comment #1)
> > There is caching on ftp.m.o, which I was counting as a good thing from the
> > point of view not hitting the netapp so much. 
> > 
> > dumitru, is Zeus set to expire cache entries in an hour ? That's my best
> > guess from header requests.
> 
> webcache!time is set to 1800 seconds (30 minutes).

dumitru, does Zeus/ftp.m.o actually cache the content somewhere?
Priority: -- → P2
Whiteboard: [tests][ftp][cachingishard]
And can we configure it to stat the file on disk on each request ?
One thing we could do here is append something unique to the url when we do the sendchange, as Zeus is caching based on the whole url. eg on stage

$ cd /pub/mozilla.org/firefox/nightly/experimental
$ echo 1 > testfile
$ curl -s 'http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/experimental/testfile?foo'
1

$ echo 2 > testfile
$ curl -s 'http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/experimental/testfile?foo'
1
$ curl -s 'http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/experimental/testfile?bar'
2

So if we add ?now=<current_epoch_timestamp> when we sendchange we'll have unique urls for any rebuilds, and not get a cached copy from ftp.m.o. 

Ideally we would configure Zeus to do HEAD against a node to see if the file has changed or not. Is that possible dumitru ?
Product: mozilla.org → Release Engineering
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.