windows build times exceeding timeout on release branches and nonunified trunk branches

RESOLVED FIXED

Status

Release Engineering
Platform Support
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: jlund, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(9 attachments)

this bug derives from https://bugzilla.mozilla.org/show_bug.cgi?id=1047621#c44 to comment 46 as it appears to be unrelated to that seamicro bug.

putting into buildduty while we hit timeout limits on our build machines and have not confirmed if this was the result of a infra change or something in the tree

unknowns:

1) does this affect other win builders on trunk branches?
2) what exact date did we see the jump in build times?
3) what step(s) are the main culprit for extra build time?

we need more data points to work with. mshal has done some of the work here: http://people.mozilla.org/~mshal/histograms.4.png

ryanvm reported a rough ~2 hour jump in overall build times for win builds on release branches.
one obvious culprit is: 'Bug 1019165 - install vs2013 across all our win64 machines and add a junction link for it in our GPO' but data has not confirmed that to be the case.

some irc snippets from today:

10:03:29 <mshal> arr: yeah Windows - any idea why that might make things slower across the board?
10:03:48 <mshal> eg: looking at good: http://buildbot-master85.srv.releng.scl3.mozilla.com:8001/builders/WINNT%205.2%20mozilla-beta%20leak%20test%20build/builds/187  and bad: http://buildbot-master82.srv.releng.scl3.mozilla.com:8001/builders/WINNT%205.2%20mozilla-beta%20leak%20test%20build/builds/181
10:03:51 <arr> mshal: I refer you to jlund, there, since he was involved in the testing  :}
10:04:17 <mshal> so, if you load up those logs - they are a beta WINNT leak test build a day apart
10:04:41 <mshal> and compare step for step - you can see things like 'hg update' and 'compile' taking longer on the bad build, but those are kinda variable anyway
10:05:12 <mshal> even more curious is 'buildsymbols' is twice as long, pkg_tests ~3 times as long package ~50% longer, etc
10:05:17 <jlund> mshal: k let me look. I should note that we are still using vs2010. but we should be switching to vs2013 very soon which should make things 'better'

11:25:41 <arr> jlund: if we didn't see any regression Friday or over the weekend, not sure it's the VS2013 stuff, since a number of machines had already been upgraded
11:26:22 <jlund> arr: ya, just trying to prove that with data right now. it's hard x-referencing. mshal was looking to just test manually
11:26:46 <mshal> arr: the two beta builds I pointed to in #mozbuild were on Friday and Saturday
11:27:16 <mshal> not a guarantee, but I definitely think it's worth ruling it out first
11:27:57 <arr> mshal: yeah, if we saw issues before that
11:28:33 <jlund> mshal: actually that's a good point!
11:28:49 <jlund> http://buildbot-master82.srv.releng.scl3.mozilla.com:8001/builders/WINNT%205.2%20mozilla-beta%20leak%20test%20build/builds/181 is done with b-2008-ix-0002
11:29:02 <jlund> that machine was only updated yesterday https://bugzilla.mozilla.org/show_bug.cgi?id=1019165#c69
11:29:06 <jlund> that log is from sat
11:29:23 <mshal> ahh, cool

11:37:11 <RyanVM|sheriffduty> and https://bugzilla.mozilla.org/show_bug.cgi?id=1047621#c44 covers the 19800sec timeouts showing up
11:37:12 <jlund> RyanVM|sheriffduty: have you only noticed on release branches?
11:37:17 <RyanVM|sheriffduty> yeah
11:37:22 <RyanVM|sheriffduty> (unified builds on trunk ;))
11:37:44 <RyanVM|sheriffduty> would be fun to see what a non-unified trunk PGO build looked like
11:37:55 <RyanVM|sheriffduty> if we had the capability to do so (short of Try mozconfig hackery)
11:38:04 <RyanVM|sheriffduty> i'm sure we'd see the same problem, though

17:13:02 <RyanVM|afk> mshal: jlund: not sure if it's really a useful data point, but the b2g release branches show similar slowdowns as well recently
17:13:05 <mshal> ix only has 8GB - maybe swapping more?
17:13:17 <RyanVM|afk> and pgo is disabled for windows opt builds there
17:13:28 <mshal> oic
17:13:50 <RyanVM|afk> but I guess it furthers the point that this is not a code-related issue
17:13:58 <RyanVM|afk> seeing as how even b2g28 is affected :)
17:14:07 <RyanVM|afk> (and I'm sure esr24 would be as well if I were to look)
17:14:22 <RyanVM|afk> hah, yeah
17:14:29 <RyanVM|afk> most recent esr24 opt build took 447min
17:14:38 <RyanVM|afk> nightly 360
17:14:55 <RyanVM|afk> the 447min build was a b-2008-ix machine
17:15:28 <RyanVM|afk> oh interesting
17:15:30 <RyanVM|afk> once nice thing about esr24
17:15:38 <RyanVM|afk> it runs nightlies every day even when there have been no changes
17:15:45 <RyanVM|afk> https://tbpl.mozilla.org/?tree=Mozilla-Esr24&rev=35610d8e9a4e
17:16:21 <RyanVM|afk> so thursday took 289min, friday 291, saturday 357
17:16:37 <RyanVM|afk> unfortunately, the first 2 were ix and the latter was sm
17:17:54 <RyanVM|afk> looks like sm slaves taking ~360min on esr24 has been the norm for awhile :(
17:18:32 <RyanVM|afk> but an ix dep build took 290min on saturday still
17:18:41 <RyanVM|afk> 223 on sunday
17:18:51 <RyanVM|afk> then 447 monday
17:19:00 <RyanVM|afk> all ix slaves in those cases
See Also: → bug 1019165, bug 1047621
Does anyone have a good way to get per-step buildbot data for the last month or so? I'd like to get some data for say, Windows XP Opt times for compile, buildsymbols, pkg_tests, check, etc steps individually. I can get a couple data points from recent builds by scraping the builder screen, but after a few days it no longer seems to be available.

Looking at the overall turnaround time is a bit problematic, since things like 'hg update' and 'compile' can vary wildly depending on when the machine last built a particular tree. Additionally, the random steps of checking_clobber_times and clean_old_builds can make some builds appear to be way slower than others, but they just happened to be the one on cleanup duty. So it's hard to sum all those random numbers up and make conclusions based on that. I think by pulling out data on some of the more consistent steps (like buildsymbols, check, etc) we can see if any patterns emerge.

Based on what I can see so far, no single step is responsible for the slowdown. It seems that the ix disk is just slow, and/or the fact that it only has 8GB of RAM compared to sm's 16 may be causing it to swap more. I don't know if anything has changed recently there to cause ix to behave worse (like some background process using the disk, or whatever).
Created attachment 8475787 [details]
step data, csv.gz

time mysql -e "SELECT builds.id, builders.name AS builder_name, slaves.name AS slave_name, steps.name AS step_name, steps.starttime AS step_starttime_UTC, unix_timestamp(steps.endtime) - unix_timestamp(steps.starttime) AS step_elapsed_seconds FROM builds JOIN builders ON builds.builder_id=builders.id JOIN slaves ON builds.slave_id=slaves.id JOIN steps ON builds.id=steps.build_id WHERE builders.name LIKE '%-win32' AND builders.name!='try-win32' AND builds.starttime > '2014-07-01' AND steps.name IN ('compile', 'make_buildsymbols', 'make_pkg_tests', 'check') AND steps.status=0 ORDER BY builds.id ASC;" buildbot | sed 's/\t/,/g' > output.csv

real	5m24.180s
user	0m0.043s
sys	0m0.022s

The names are a little different:
 TBPL:     Windows XP Opt: B
 Buildbot: WINNT 5.2 mozilla-inbound build
 This db:  mozilla-inbound-win32
Thanks Nick! That helps. I put together some graphs of the more "consistent" operations for mozilla-beta-win32 opt builds. The ix machines are on the left, and sm machines on the right - each group is sorted by build date:

http://people.mozilla.org/~mshal/beta-buildsymbols.png
http://people.mozilla.org/~mshal/beta-pkg-tests.png
http://people.mozilla.org/~mshal/beta-check.png

For buildsymbols and pkg-tests, you can see that the runtimes vary quite a lot (sometimes by as much as 2x), but I don't see a particular slowdown recently.

The 'make check' graph is a little more consistent in that it is either fairly quick or really slow. There are more instances of it being slow on ix in recent days, but as you can see, there are similar slow instances as far back as the data goes.

So, I don't think there has been a recent change to ix that affects these run times.

One graph that does look suspect is hg-update:

http://people.mozilla.org/~mshal/beta-hg-update.png

A lot of the "worst" times are all on ix and in the last few days, several of which are over an hour. Could that account for much of the overall slowdown we're seeing? I think if you get one of these slow updates, plus the build gets picked for cleanup duty, plus a randomly slow check, and you can end up with a terrible turnaround time.
Depends on: 1057022
(In reply to Michael Shal [:mshal] from comment #4)
> The 'make check' graph is a little more consistent in that it is either
> fairly quick or really slow. There are more instances of it being slow on ix
> in recent days, but as you can see, there are similar slow instances as far
> back as the data goes.
> 
> So, I don't think there has been a recent change to ix that affects these
> run times.

I take that back. Looking at more of the data (not just beta), I think there is some indication that a few of these steps (like check, buildsymbols, package-tests - not just hg-update) show signs of slowing down after around 8/16 or 8/17. jlund & Q helped me roll back a build machine to the pre-VS2013 install - one thing I noticed is that the updated machines have a sqlservr.exe process that spikes the CPU (all cores!) about every minute. The previous machine also has an sqlservr.exe process running, but it doesn't ever appear to be active. I'm not sure why it is only active on the updated machines. Both are running what appears to be the same binary.

I checked the Services in msconfig and noticed a few differences, but I'm not sure if they're relevant:

1) The new configuration with VS2013 has a few services that are now stopped:
 - Microsoft .NET Framework NGEN
 - Windows Installer
 - Microsoft Software Shadow Copy Provider
 - WinHTTP Web Proxy Auto-Discovery Service

2) The new configuration with VS2013 has a few services that are now running:
 - Protected Storage
 - Windows Modules Installer

3) The new configuration with VS2013 has a few new services, but all appear to be stopped:
 - Visual Studio ETW Event Collection Service (Stopped)
 - Claims to Windows Token Service (Stopped)
 - Release Management Monitor (Stopped)
 - Te.Service (Stopped)
 - Windows App Certification Kit Fast User Switching Utility Service (Stopped)

I'm not sure if these are relevant to the sqlservr.exe thing or not. Both machines have a "SQL Server (SQLEXPRESS)" and a "SQL Server VSS Writer" running.

If we aren't using SQL Server for anything, can we try uninstalling that component from Visual Studio and see if it makes that periodic CPU bump go away? I'm not entirely sure at the moment, but I'm suspicious that it's what's causing the slowdowns across all build steps.
Flags: needinfo?(q)
Created attachment 8477104 [details]
sqlservr.png

Screenshot of a new machine when idle - the bump goes by every minute.
Created attachment 8477105 [details]
prevs2013.png

Screenshot of a machine with a pre-VS2013 image. It still has sqlservr.exe, but it doesn't chew up the CPU.
Ugh, Windows. Now the pre-VS2013 machine is also spiking sqlservr.exe once a minute. It definitely wasn't doing that yesterday (it's easy to tell since the Resource Monitor graph is 60s wide, and it happens every minute). It would still probably be nice to get rid of, but now I don't think that's causing this bug.

Suggestions are welcome.
Flags: needinfo?(q)
Created attachment 8477574 [details]
check.png
Created attachment 8477575 [details]
compile.png
Created attachment 8477576 [details]
hg_update.png
Created attachment 8477577 [details]
make_buildsymbols.png
Created attachment 8477578 [details]
make_pkg_tests.png
I attached some graphs of the build data for IX machines only (windows opt builds across all trees, except b2g-inbound). These show why I disagree with my position in #c4. In each graph you can see some difference between builds of (approximately) pre-8/16 and post-8/16. The hg_update, make_buildsymbols, and make_pkg_tests graphs all show a pretty clear difference - the others are a bit harder to see, but the trend is still there. I tried to check if the long poles were specific to any particlar IX machines, but there were no clear outliers.

One thing I want to emphasize is how completely random an individual build time can be, so it's really hard to just look at two individual builds and say that one is "fast" and the other is "slow". In other words, I can't just run a single build (or even a few builds) against the machine we imaged with pre-VS2013 and say that it definitely works or definitely doesn't.

Here are some basic statistics in case the graph isn't convincing (all times in seconds):

pre-8/16:
hg_update: min=8 max=5461 avg=290.405099 stdev=435.447198
compile: min=331 max=15860 avg=2459.543478 stdev=3021.769580
make_buildsymbols: min=221 max=681 avg=329.094631 stdev=81.903703
make_pkg_tests: min=93 max=1872 avg=452.650554 stdev=176.811994
check: min=153 max=1826 avg=487.930565 stdev=149.126536

post-8/16:
hg_update: min=40 max=5950 avg=890.400844 stdev=1281.643107
compile: min=358 max=18720 avg=3961.521277 stdev=3749.037949
make_buildsymbols: min=241 max=1168 avg=481.768421 stdev=204.006352
make_pkg_tests: min=119 max=1612 avg=585.578947 stdev=292.892764
check: min=164 max=1901 avg=748.458101 stdev=317.222819

In all cases the average runtime is up (which is bad, of course), and with a higher std deviation.

I thought the sqlservr.exe thing was promising, but now I have no idea why it runs sometimes and not others. If anyone has thoughts on things to check, please let me know.

My current recommendation is to backout the VS2013 update on a large enough set of machines (say, 1/4 or 1/2 of the IX pool) so that we can get enough samples of build data to say for sure that something in the VS2013 update is to blame or not. (Obviously I want to move forward with VS2013 as fast as possible to get the wins there, but if that came with some other crap that's slowing us down, we need to pinpoint it).

Thoughts?
FYI I won't have time to continue to look at this. I'd still recommend trying to backout the VS2013 update on a group of machines to see  if that group compares favorably to the VS2013 group, so we at least know if something in those updates are causing the problem or not.
Created attachment 8478755 [details]
Screenshot of installed apps on b-2008-ix-0007

I was poking around on b-2008-ix-0007 and noticed that VS2013 (and a few other things) were marked as installed on Aug 25, while bug 1019165 comment #69 has it being reimaged on Aug 18. Does that line up with GPO logs ?  I'm wondering installs are taking a lot longer than we thought.
Flags: needinfo?(arich)

Updated

3 years ago
Attachment #8478755 - Attachment description: List of installed apps on b-2008-ix-0007 → Screenshot of installed apps on b-2008-ix-0007
Flags: needinfo?(q)
Also, on b-2008-ix-0133 ~ 3am Pacific I saw these processes running
   NDP451-KB2858728-x86-x64-AllOS-ENU.exe
   VS2013.2CTP.exe

A quick google suggests they are installers. I'm confused by a Community Technology Preview, thought we were on final versions of updates.
Flags: needinfo?(arich)

Comment 18

3 years ago
After much I can't seem to find any evidence of the gpo reapplying. I refined the wmi query to be more specific and I added logging to see if I can catch a reinstall on any machine.
Flags: needinfo?(q)
(In reply to Q from comment #18)
> After much I can't seem to find any evidence of the gpo reapplying. I
> refined the wmi query to be more specific and I added logging to see if I
> can catch a reinstall on any machine.

hmm, so something is eating away at our io on win machines. It does not appear to be every time. looking at graphs from comment 9-13, it is clear that most jobs take a similar amount of time pre-vs2013 but the frequency of really long jobs has gone up post-vs2013.

best logical guesses are either:
1) vs2013 comes with something that is occasionally eating away at resources during a job run (not just compile step)

2) gpo is doing something that is occasionally eating resources

Q, if your refined query yields no findings, that might rule out 2.

But going with mshal's suggestion of rolling back a subset of machines might not be the best option. We are in the process of switching over to vs2013 and if we can do that soonish for trunk + try, that could be a bigger win overall. rolling back a subset will make it hard to test vs2013.

sheriffs: now that we have disabled seamicros, are we still hitting timeouts/failures?
ben: how close do you think we are from being able to avail of vs2013?
Flags: needinfo?(sheriffs)
Flags: needinfo?(benjamin)
When we were discussion this in irc perviously, it was stated that someone found issues with a build that was timing out before the machine it ran on was even updated: http://buildbot-master82.srv.releng.scl3.mozilla.com:8001/builders/WINNT%205.2%20mozilla-beta%20leak%20test%20build/builds/181

Are we sure this is tied to vs2013 at all?
I don't believe the sheriffs@mozilla.bugs alias works for needinfo requests.
Flags: needinfo?(sheriffs)
Flags: needinfo?(ryanvm)
Flags: needinfo?(emorley)
(In reply to Amy Rich [:arich] [:arr] from comment #20)
> When we were discussion this in irc perviously, it was stated that someone
> found issues with a build that was timing out before the machine it ran on
> was even updated:
> http://buildbot-master82.srv.releng.scl3.mozilla.com:8001/builders/WINNT%205.
> 2%20mozilla-beta%20leak%20test%20build/builds/181
> 
> Are we sure this is tied to vs2013 at all?

you're right, there are cases of really long build times prior to vs2013 being installed. but since we have had more timeouts and higher build times post vs2013 and no other changes lining up with that timeframe, it seemed like a likely culprit. it is hard to tell as it happens seldom. 

In any case, unless 1) we are intermittently failing often or 2) actually availing of vs2013 is far away from now, I suggest we not worry about any further investigation (for now) regardless of the smoking gun. Instead, put effort on getting vs2013 out the door.
There have been 19800s timeouts as recently as today.

https://tbpl.mozilla.org/php/getParsedLog.php?id=46861136&tree=Mozilla-Aurora
https://tbpl.mozilla.org/php/getParsedLog.php?id=46858389&tree=Mozilla-Esr31

Not to mention the spike in gtest 600s timeouts that we've seen at much higher frequency across all trees since this started.

And what Wes said, you can't needinfo? a watcher account.
Flags: needinfo?(ryanvm)
And another: https://tbpl.mozilla.org/php/getParsedLog.php?id=46911668&tree=Try
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #23)
> And what Wes said, you can't needinfo? a watcher account.

(Bug 179701 filed for this)
Flags: needinfo?(emorley)

Comment 26

3 years ago
I don't have much input to give here. It sounds like we should look at one of the stuck machines in process explorer and see what's eating the CPU. Is it possible to send somebody an alert when it gets into this state before it gets killed, so they can remote-desktop in and check it by hand?
Flags: needinfo?(benjamin)
https://tbpl.mozilla.org/php/getParsedLog.php?id=47078483&tree=Mozilla-Aurora
https://tbpl.mozilla.org/php/getParsedLog.php?id=47083920&tree=Mozilla-Aurora
https://tbpl.mozilla.org/php/getParsedLog.php?id=47078341&tree=Mozilla-Aurora
Blocks: 1062877
Nothing actionable by buildduty here, moving to Platform Support.

Hopefully work on stuff from 1062877 helps here
Component: Buildduty → Platform Support
QA Contact: bugspam.Callek → coop
jordan, per :arr we're wontfixing all SeaMicro (in releng) specific bugs, its not clear to me in a 10 second skim if this is both SeaMicro and iX systems though.
Flags: needinfo?(jlund)
This was on all Windows build slaves regardless of type.
The graphs from #c9 through #c14 were all iX-only data, and I think they highlight the issue. So, this isn't a SeaMicro-only bug.
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #32)
> This was on all Windows build slaves regardless of type.

speaking of which, are you still seeing timeouts? I think if the number of intermittents are decreasing, it is very likely that this bug can be summed up as: https://bugzilla.mozilla.org/show_bug.cgi?id=1062877#c1

we have vs2013 on 5 machines that are imaged differently and are running pretty well. I wonder if we can close this bug for now..
Flags: needinfo?(jlund) → needinfo?(ryanvm)
No issues for some time now.
Flags: needinfo?(ryanvm)
going to close this for now based on comment 34-35
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.