Open Bug 1155879 Opened 5 years ago Updated 5 years ago

Linux reftest runtimes on trunk have doubled since early April

Categories

(Testing :: Reftest, defect, critical)

All
Linux
defect
Not set
critical

Tracking

(firefox40 affected)

Tracking Status
firefox40 --- affected

People

(Reporter: RyanVM, Unassigned)

References

Details

We've been seeing a large number of timeouts lately on Linux reftests (see bug 1073442 where the majority have been getting starred). Looking at a graph of ASAN reftest runtimes over the last week, the spike looks to have been around early April.
https://www.hostedgraphite.com/da5c920d/af587ddb-3e87-432e-8d10-7ed541694a6a/graphite/render/?width=1441&height=907&_salt=1412956646.781&target=buildtimes.mozilla-central_ubuntu64-asan_vm_test-reftest.p50&from=-4weeks

Aurora runtimes are consistently 60-70min for the same job and a very similar number of tests.

As of now, this is causing the majority of our Linux reftests to not meet visibility standards.
Depends on: 1156426
I get permissions errors when accessing the failure log here:
https://tbpl.mozilla.org/php/getParsedLog.php?id=48949914&tree=Mozilla-Inbound

Can you post the log elsewhere? A log from an earlier successful run would help too.
Flags: needinfo?(ryanvm)
TBPL was decommissioned last month. Use Treeherder instead.

Failing run:
https://treeherder.mozilla.org/logviewer.html#?job_id=9359972&repo=mozilla-inbound

Green run:
https://treeherder.mozilla.org/logviewer.html#?job_id=9360491&repo=mozilla-inbound
Flags: needinfo?(ryanvm)
Failing run: fails after 7200s at 94%
Green Run: passes after 6836s at 100%

It seems we're running rather close to the limit here, even when we're passing. Do we have green logs from early April when runtimes were reported doubling?
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #7)
> Aurora is pretty representative.
> https://treeherder.mozilla.org/logviewer.html#?job_id=789974&repo=mozilla-
> aurora

Thanks, Ryan!

The elapsed time for layout/reftests/invalidation/ went from 18s (aurora) to 32s (inbound.) Not quite doubled, but points to bug 994541.

Also:
https://bugzilla.mozilla.org/show_bug.cgi?id=994541#c55
Flags: needinfo?(nical.bugzilla)
With OMTC, reading back from the x server (which reftests do a lot) has gotten much more expensive. If reftests are timing out too often we should split them into two chunks until we can get rid of our dependency to xrender (which requires that we switch from gtk2 to gtk3).
Flags: needinfo?(nical.bugzilla)
OMTC landing was one of the points when the times increased, but not the biggest.
I had a closer look at the actual treeherder data for the inbound pushes. It seems like the noise in the test VM's is +/- 20 mins which makes the graphs rather useless. It's quite likely that we've been inching up towards the 120 min ceiling over a longer period than reported. I compiled a report with my findings here:

http://media.junglecode.net/test/1155879/linux_asan.html

Given these findings, I agree that splitting the ASAN tests across 2 machines is the way forward here. I'll post more comments in bug 1156426.
You need to log in before you can comment on or make changes to this bug.