738557 - Tests on retriggered builds can download the previous build from ftp.m.o

Reporter

Description

•

12 years ago

It's an ugly muddle because it's going to have more Android builds than I actually needed, but the story behind https://tbpl.mozilla.org/?rev=ab2ff3b5611f&jobname=Android is this:

* I pushed and got the first on-push build

* I realized that the bustage below meant I needed a clobber, so I set it on the clobberer and waited to see if I lucked into a free-space/periodic clobber

* On Android XUL, I did and the story ends for it

* On Android native, I did not, so the test runs were busted from the browser not starting

* I retriggered the build, and it was a clobber (despite confusing me by not saying so because it clobbered earlier while doing a build on another tree); because it had the same buildid, it uploaded to the same URL on ftp.m.o as the previous broken build

* Mochitests and at least one talos run downloaded the first build in response to the sendchange for the second build - those are the second set of orange and red; jsreftests, crashtests, reftests, and most talos downloaded the second build and went green

I presume that means that there's some caching and multiple servers and whatnot that make ftp.m.o not work the way stage always worked, but this is a great big footgun aimed right at the first person who lands something, realizes it needs a clobber, clobbers, and gets the unclobbered results again.

Nick Thomas [:nthomas] (UTC+12)

Comment 1

•

12 years ago

There is caching on ftp.m.o, which I was counting as a good thing from the point of view not hitting the netapp so much. 

dumitru, is Zeus set to expire cache entries in an hour ? That's my best guess from header requests.

Blocks: 735870

Dumitru Gherman [:dumitru]

Comment 2

•

12 years ago

(In reply to Nick Thomas [:nthomas] from comment #1)
> There is caching on ftp.m.o, which I was counting as a good thing from the
> point of view not hitting the netapp so much. 
> 
> dumitru, is Zeus set to expire cache entries in an hour ? That's my best
> guess from header requests.

webcache!time is set to 1800 seconds (30 minutes).

Nick Thomas [:nthomas] (UTC+12)

Comment 3

•

12 years ago

Thanks. That's longer than the ~ 40 minute clobber time for Android native, and I see there are 5 opt builds + one nightly on the 
 https://tbpl.mozilla.org/?rev=ab2ff3b5611f&jobname=Android
so I'm a bit confused. philor ?

Phil Ringnalda (:philor)

Reporter

Comment 4

•

12 years ago

Sure, if we uploaded it, the first test slave downloaded it, and then every test slave had already downloaded it in less than 30 minutes, and nobody in the entire world, test slave, spider, or human, downloaded it after that, and then we uploaded a new build after that 30 minutes, we'd be fine.

We built the first time, and uploaded it at 21:06. First thing I see downloading it was at 21:08, so cached to 21:38.

https://tbpl.mozilla.org/php/getParsedLog.php?id=10305098&tree=Firefox (a retry after a late-starting talos failed) was at 21:38, say the seconds work out and it missed the cache and started another 30 minute cache to 22:08.

We built the second time, and uploaded it to the same place, at 22:12.

So, all it would take to have extended to a third cache time was anything downloading the build between 22:08 and 22:12. I don't see a likely timed test in all that spew of failure, but I don't even need to: 30 minute cache and 40 minute clobber making everything fine assumes that no tests will start after 35 minutes, which just isn't the case. The first remote-tdhml started 17 minutes after the first test started, and took 13 minutes to fail and trigger an automatic retry, and there's the gap between your 30 and 40 busted. Looking at clean single-build jobs on inbound right now, where it's less painful to dig out which is what, a suite having two 30+ minute jobs both ending in a retry aren't at all unusual, and there's your cache out to 90 minutes.

The second set of tests did get the first build, and whether it's because of a test that I'm not spotting because it started earlier but was slow to get around to downloading, or because someone other than a test slave downloaded the build during those four minutes doesn't really matter: if we're going to be okay with a 30 minute cache that isn't invalidated by replacing the file that's cached, then we have to be okay with the fact that sometimes (probably frequently) a retriggered build will do no good at all, and will require that we both know to and do sit around for up to 30 minutes before retriggering all the tests.

Chris AtLee [:catlee]

Comment 5

•

12 years ago

(In reply to Dumitru Gherman [:dumitru] from comment #2)
> (In reply to Nick Thomas [:nthomas] from comment #1)
> > There is caching on ftp.m.o, which I was counting as a good thing from the
> > point of view not hitting the netapp so much. 
> > 
> > dumitru, is Zeus set to expire cache entries in an hour ? That's my best
> > guess from header requests.
> 
> webcache!time is set to 1800 seconds (30 minutes).

dumitru, does Zeus/ftp.m.o actually cache the content somewhere?

Chris AtLee [:catlee]

Updated

•

12 years ago

Priority: -- → P2

Whiteboard: [tests][ftp][cachingishard]

Nick Thomas [:nthomas] (UTC+12)

Comment 6

•

12 years ago

And can we configure it to stat the file on disk on each request ?

Nick Thomas [:nthomas] (UTC+12)

Comment 7

•

12 years ago

One thing we could do here is append something unique to the url when we do the sendchange, as Zeus is caching based on the whole url. eg on stage

$ cd /pub/mozilla.org/firefox/nightly/experimental
$ echo 1 > testfile
$ curl -s 'http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/experimental/testfile?foo'
1

$ echo 2 > testfile
$ curl -s 'http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/experimental/testfile?foo'
1
$ curl -s 'http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/experimental/testfile?bar'
2

So if we add ?now=<current_epoch_timestamp> when we sendchange we'll have unique urls for any rebuilds, and not get a cached copy from ftp.m.o. 

Ideally we would configure Zeus to do HEAD against a node to see if the file has changed or not. Is that possible dumitru ?

Ed Morley [:emorley]

Updated

•

11 years ago

Keywords: sheriffing-untriaged

Nobody; OK to take it and work on it

Assignee

Updated

•

11 years ago

Product: mozilla.org → Release Engineering

Chris AtLee [:catlee]

Updated

•

8 years ago

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → WONTFIX

Nobody; OK to take it and work on it

Assignee

Updated

•

6 years ago

Component: General Automation → General

Bugzilla

Quick Search

Tests on retriggered builds can download the previous build from ftp.m.o

Categories

(Release Engineering :: General, defect, P2)

Tracking

(Not tracked)

People

(Reporter: philor, Unassigned)

References

Details

(Keywords: sheriffing-untriaged, Whiteboard: [tests][ftp][cachingishard])

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Updated

Comment 6

Comment 7

Updated

Updated

Updated

Updated