1055876 - windows build times exceeding timeout on release branches and nonunified trunk branches

Reporter

Description

•

11 years ago

this bug derives from https://bugzilla.mozilla.org/show_bug.cgi?id=1047621#c44 to comment 46 as it appears to be unrelated to that seamicro bug. putting into buildduty while we hit timeout limits on our build machines and have not confirmed if this was the result of a infra change or something in the tree unknowns: 1) does this affect other win builders on trunk branches? 2) what exact date did we see the jump in build times? 3) what step(s) are the main culprit for extra build time? we need more data points to work with. mshal has done some of the work here: http://people.mozilla.org/~mshal/histograms.4.png ryanvm reported a rough ~2 hour jump in overall build times for win builds on release branches.

Jordan Lund (:jlund)

Reporter

Comment 1

•

11 years ago

one obvious culprit is: 'Bug 1019165 - install vs2013 across all our win64 machines and add a junction link for it in our GPO' but data has not confirmed that to be the case. some irc snippets from today: 10:03:29 <mshal> arr: yeah Windows - any idea why that might make things slower across the board? 10:03:48 <mshal> eg: looking at good: http://buildbot-master85.srv.releng.scl3.mozilla.com:8001/builders/WINNT%205.2%20mozilla-beta%20leak%20test%20build/builds/187 and bad: http://buildbot-master82.srv.releng.scl3.mozilla.com:8001/builders/WINNT%205.2%20mozilla-beta%20leak%20test%20build/builds/181 10:03:51 <arr> mshal: I refer you to jlund, there, since he was involved in the testing :} 10:04:17 <mshal> so, if you load up those logs - they are a beta WINNT leak test build a day apart 10:04:41 <mshal> and compare step for step - you can see things like 'hg update' and 'compile' taking longer on the bad build, but those are kinda variable anyway 10:05:12 <mshal> even more curious is 'buildsymbols' is twice as long, pkg_tests ~3 times as long package ~50% longer, etc 10:05:17 <jlund> mshal: k let me look. I should note that we are still using vs2010. but we should be switching to vs2013 very soon which should make things 'better' 11:25:41 <arr> jlund: if we didn't see any regression Friday or over the weekend, not sure it's the VS2013 stuff, since a number of machines had already been upgraded 11:26:22 <jlund> arr: ya, just trying to prove that with data right now. it's hard x-referencing. mshal was looking to just test manually 11:26:46 <mshal> arr: the two beta builds I pointed to in #mozbuild were on Friday and Saturday 11:27:16 <mshal> not a guarantee, but I definitely think it's worth ruling it out first 11:27:57 <arr> mshal: yeah, if we saw issues before that 11:28:33 <jlund> mshal: actually that's a good point! 11:28:49 <jlund> http://buildbot-master82.srv.releng.scl3.mozilla.com:8001/builders/WINNT%205.2%20mozilla-beta%20leak%20test%20build/builds/181 is done with b-2008-ix-0002 11:29:02 <jlund> that machine was only updated yesterday https://bugzilla.mozilla.org/show_bug.cgi?id=1019165#c69 11:29:06 <jlund> that log is from sat 11:29:23 <mshal> ahh, cool 11:37:11 <RyanVM|sheriffduty> and https://bugzilla.mozilla.org/show_bug.cgi?id=1047621#c44 covers the 19800sec timeouts showing up 11:37:12 <jlund> RyanVM|sheriffduty: have you only noticed on release branches? 11:37:17 <RyanVM|sheriffduty> yeah 11:37:22 <RyanVM|sheriffduty> (unified builds on trunk ;)) 11:37:44 <RyanVM|sheriffduty> would be fun to see what a non-unified trunk PGO build looked like 11:37:55 <RyanVM|sheriffduty> if we had the capability to do so (short of Try mozconfig hackery) 11:38:04 <RyanVM|sheriffduty> i'm sure we'd see the same problem, though 17:13:02 <RyanVM|afk> mshal: jlund: not sure if it's really a useful data point, but the b2g release branches show similar slowdowns as well recently 17:13:05 <mshal> ix only has 8GB - maybe swapping more? 17:13:17 <RyanVM|afk> and pgo is disabled for windows opt builds there 17:13:28 <mshal> oic 17:13:50 <RyanVM|afk> but I guess it furthers the point that this is not a code-related issue 17:13:58 <RyanVM|afk> seeing as how even b2g28 is affected :) 17:14:07 <RyanVM|afk> (and I'm sure esr24 would be as well if I were to look) 17:14:22 <RyanVM|afk> hah, yeah 17:14:29 <RyanVM|afk> most recent esr24 opt build took 447min 17:14:38 <RyanVM|afk> nightly 360 17:14:55 <RyanVM|afk> the 447min build was a b-2008-ix machine 17:15:28 <RyanVM|afk> oh interesting 17:15:30 <RyanVM|afk> once nice thing about esr24 17:15:38 <RyanVM|afk> it runs nightlies every day even when there have been no changes 17:15:45 <RyanVM|afk> https://tbpl.mozilla.org/?tree=Mozilla-Esr24&rev=35610d8e9a4e 17:16:21 <RyanVM|afk> so thursday took 289min, friday 291, saturday 357 17:16:37 <RyanVM|afk> unfortunately, the first 2 were ix and the latter was sm 17:17:54 <RyanVM|afk> looks like sm slaves taking ~360min on esr24 has been the norm for awhile :( 17:18:32 <RyanVM|afk> but an ix dep build took 290min on saturday still 17:18:41 <RyanVM|afk> 223 on sunday 17:18:51 <RyanVM|afk> then 447 monday 17:19:00 <RyanVM|afk> all ix slaves in those cases

Jordan Lund (:jlund)

Reporter

Updated

•

11 years ago

Comment 2

•

11 years ago

Does anyone have a good way to get per-step buildbot data for the last month or so? I'd like to get some data for say, Windows XP Opt times for compile, buildsymbols, pkg_tests, check, etc steps individually. I can get a couple data points from recent builds by scraping the builder screen, but after a few days it no longer seems to be available. Looking at the overall turnaround time is a bit problematic, since things like 'hg update' and 'compile' can vary wildly depending on when the machine last built a particular tree. Additionally, the random steps of checking_clobber_times and clean_old_builds can make some builds appear to be way slower than others, but they just happened to be the one on cleanup duty. So it's hard to sum all those random numbers up and make conclusions based on that. I think by pulling out data on some of the more consistent steps (like buildsymbols, check, etc) we can see if any patterns emerge. Based on what I can see so far, no single step is responsible for the slowdown. It seems that the ix disk is just slow, and/or the fact that it only has 8GB of RAM compared to sm's 16 may be causing it to swap more. I don't know if anything has changed recently there to cause ix to behave worse (like some background process using the disk, or whatever).

Nick Thomas [:nthomas] (UTC+12)

Comment 3

•

11 years ago

Attached file step data, csv.gz — Details

time mysql -e "SELECT builds.id, builders.name AS builder_name, slaves.name AS slave_name, steps.name AS step_name, steps.starttime AS step_starttime_UTC, unix_timestamp(steps.endtime) - unix_timestamp(steps.starttime) AS step_elapsed_seconds FROM builds JOIN builders ON builds.builder_id=builders.id JOIN slaves ON builds.slave_id=slaves.id JOIN steps ON builds.id=steps.build_id WHERE builders.name LIKE '%-win32' AND builders.name!='try-win32' AND builds.starttime > '2014-07-01' AND steps.name IN ('compile', 'make_buildsymbols', 'make_pkg_tests', 'check') AND steps.status=0 ORDER BY builds.id ASC;" buildbot | sed 's/\t/,/g' > output.csv real 5m24.180s user 0m0.043s sys 0m0.022s The names are a little different: TBPL: Windows XP Opt: B Buildbot: WINNT 5.2 mozilla-inbound build This db: mozilla-inbound-win32

Michael Shal [:mshal]

Comment 4

•

11 years ago

Thanks Nick! That helps. I put together some graphs of the more "consistent" operations for mozilla-beta-win32 opt builds. The ix machines are on the left, and sm machines on the right - each group is sorted by build date: http://people.mozilla.org/~mshal/beta-buildsymbols.png http://people.mozilla.org/~mshal/beta-pkg-tests.png http://people.mozilla.org/~mshal/beta-check.png For buildsymbols and pkg-tests, you can see that the runtimes vary quite a lot (sometimes by as much as 2x), but I don't see a particular slowdown recently. The 'make check' graph is a little more consistent in that it is either fairly quick or really slow. There are more instances of it being slow on ix in recent days, but as you can see, there are similar slow instances as far back as the data goes. So, I don't think there has been a recent change to ix that affects these run times. One graph that does look suspect is hg-update: http://people.mozilla.org/~mshal/beta-hg-update.png A lot of the "worst" times are all on ix and in the last few days, several of which are over an hour. Could that account for much of the overall slowdown we're seeing? I think if you get one of these slow updates, plus the build gets picked for cleanup duty, plus a randomly slow check, and you can end up with a terrible turnaround time.

Jordan Lund (:jlund)

Reporter

Updated

•

11 years ago

Depends on: 1057022

Michael Shal [:mshal]

Comment 5

•

11 years ago

(In reply to Michael Shal [:mshal] from comment #4) > The 'make check' graph is a little more consistent in that it is either > fairly quick or really slow. There are more instances of it being slow on ix > in recent days, but as you can see, there are similar slow instances as far > back as the data goes. > > So, I don't think there has been a recent change to ix that affects these > run times. I take that back. Looking at more of the data (not just beta), I think there is some indication that a few of these steps (like check, buildsymbols, package-tests - not just hg-update) show signs of slowing down after around 8/16 or 8/17. jlund & Q helped me roll back a build machine to the pre-VS2013 install - one thing I noticed is that the updated machines have a sqlservr.exe process that spikes the CPU (all cores!) about every minute. The previous machine also has an sqlservr.exe process running, but it doesn't ever appear to be active. I'm not sure why it is only active on the updated machines. Both are running what appears to be the same binary. I checked the Services in msconfig and noticed a few differences, but I'm not sure if they're relevant: 1) The new configuration with VS2013 has a few services that are now stopped: - Microsoft .NET Framework NGEN - Windows Installer - Microsoft Software Shadow Copy Provider - WinHTTP Web Proxy Auto-Discovery Service 2) The new configuration with VS2013 has a few services that are now running: - Protected Storage - Windows Modules Installer 3) The new configuration with VS2013 has a few new services, but all appear to be stopped: - Visual Studio ETW Event Collection Service (Stopped) - Claims to Windows Token Service (Stopped) - Release Management Monitor (Stopped) - Te.Service (Stopped) - Windows App Certification Kit Fast User Switching Utility Service (Stopped) I'm not sure if these are relevant to the sqlservr.exe thing or not. Both machines have a "SQL Server (SQLEXPRESS)" and a "SQL Server VSS Writer" running. If we aren't using SQL Server for anything, can we try uninstalling that component from Visual Studio and see if it makes that periodic CPU bump go away? I'm not entirely sure at the moment, but I'm suspicious that it's what's causing the slowdowns across all build steps.

Flags: needinfo?(q)

Michael Shal [:mshal]

Comment 6

•

11 years ago

Attached image sqlservr.png — Details

Screenshot of a new machine when idle - the bump goes by every minute.

Michael Shal [:mshal]

Comment 7

•

11 years ago

Attached image prevs2013.png — Details

Screenshot of a machine with a pre-VS2013 image. It still has sqlservr.exe, but it doesn't chew up the CPU.

Michael Shal [:mshal]

Comment 8

•

11 years ago

Ugh, Windows. Now the pre-VS2013 machine is also spiking sqlservr.exe once a minute. It definitely wasn't doing that yesterday (it's easy to tell since the Resource Monitor graph is 60s wide, and it happens every minute). It would still probably be nice to get rid of, but now I don't think that's causing this bug. Suggestions are welcome.

Flags: needinfo?(q)

Michael Shal [:mshal]

Comment 9

•

11 years ago

Attached image check.png — Details

Michael Shal [:mshal]

Comment 10

•

11 years ago

Attached image compile.png — Details

Michael Shal [:mshal]

Comment 11

•

11 years ago

Attached image hg_update.png — Details

Michael Shal [:mshal]

Comment 12

•

11 years ago

Attached image make_buildsymbols.png — Details

Michael Shal [:mshal]

Comment 13

•

11 years ago

Attached image make_pkg_tests.png — Details

Michael Shal [:mshal]

Comment 14

•

11 years ago

I attached some graphs of the build data for IX machines only (windows opt builds across all trees, except b2g-inbound). These show why I disagree with my position in #c4. In each graph you can see some difference between builds of (approximately) pre-8/16 and post-8/16. The hg_update, make_buildsymbols, and make_pkg_tests graphs all show a pretty clear difference - the others are a bit harder to see, but the trend is still there. I tried to check if the long poles were specific to any particlar IX machines, but there were no clear outliers. One thing I want to emphasize is how completely random an individual build time can be, so it's really hard to just look at two individual builds and say that one is "fast" and the other is "slow". In other words, I can't just run a single build (or even a few builds) against the machine we imaged with pre-VS2013 and say that it definitely works or definitely doesn't. Here are some basic statistics in case the graph isn't convincing (all times in seconds): pre-8/16: hg_update: min=8 max=5461 avg=290.405099 stdev=435.447198 compile: min=331 max=15860 avg=2459.543478 stdev=3021.769580 make_buildsymbols: min=221 max=681 avg=329.094631 stdev=81.903703 make_pkg_tests: min=93 max=1872 avg=452.650554 stdev=176.811994 check: min=153 max=1826 avg=487.930565 stdev=149.126536 post-8/16: hg_update: min=40 max=5950 avg=890.400844 stdev=1281.643107 compile: min=358 max=18720 avg=3961.521277 stdev=3749.037949 make_buildsymbols: min=241 max=1168 avg=481.768421 stdev=204.006352 make_pkg_tests: min=119 max=1612 avg=585.578947 stdev=292.892764 check: min=164 max=1901 avg=748.458101 stdev=317.222819 In all cases the average runtime is up (which is bad, of course), and with a higher std deviation. I thought the sqlservr.exe thing was promising, but now I have no idea why it runs sometimes and not others. If anyone has thoughts on things to check, please let me know. My current recommendation is to backout the VS2013 update on a large enough set of machines (say, 1/4 or 1/2 of the IX pool) so that we can get enough samples of build data to say for sure that something in the VS2013 update is to blame or not. (Obviously I want to move forward with VS2013 as fast as possible to get the wins there, but if that came with some other crap that's slowing us down, we need to pinpoint it). Thoughts?

Michael Shal [:mshal]

Comment 15

•

11 years ago

FYI I won't have time to continue to look at this. I'd still recommend trying to backout the VS2013 update on a group of machines to see if that group compares favorably to the VS2013 group, so we at least know if something in those updates are causing the problem or not.

Nick Thomas [:nthomas] (UTC+12)

Comment 16

•

11 years ago

Attached image Screenshot of installed apps on b-2008-ix-0007 — Details

I was poking around on b-2008-ix-0007 and noticed that VS2013 (and a few other things) were marked as installed on Aug 25, while bug 1019165 comment #69 has it being reimaged on Aug 18. Does that line up with GPO logs ? I'm wondering installs are taking a lot longer than we thought.

Flags: needinfo?(arich)

Nick Thomas [:nthomas] (UTC+12)

Updated

•

11 years ago

Attachment #8478755 - Attachment description: List of installed apps on b-2008-ix-0007 → Screenshot of installed apps on b-2008-ix-0007

Amy Rich [:arr] [:arich]

Updated

•

11 years ago

Flags: needinfo?(q)

Nick Thomas [:nthomas] (UTC+12)

Comment 17

•

11 years ago

Also, on b-2008-ix-0133 ~ 3am Pacific I saw these processes running NDP451-KB2858728-x86-x64-AllOS-ENU.exe VS2013.2CTP.exe A quick google suggests they are installers. I'm confused by a Community Technology Preview, thought we were on final versions of updates.

Amy Rich [:arr] [:arich]

Updated

•

11 years ago

Flags: needinfo?(arich)

Q

Comment 18

•

11 years ago

After much I can't seem to find any evidence of the gpo reapplying. I refined the wmi query to be more specific and I added logging to see if I can catch a reinstall on any machine.

Flags: needinfo?(q)

Jordan Lund (:jlund)

Reporter

Comment 19

•

11 years ago

(In reply to Q from comment #18) > After much I can't seem to find any evidence of the gpo reapplying. I > refined the wmi query to be more specific and I added logging to see if I > can catch a reinstall on any machine. hmm, so something is eating away at our io on win machines. It does not appear to be every time. looking at graphs from comment 9-13, it is clear that most jobs take a similar amount of time pre-vs2013 but the frequency of really long jobs has gone up post-vs2013. best logical guesses are either: 1) vs2013 comes with something that is occasionally eating away at resources during a job run (not just compile step) 2) gpo is doing something that is occasionally eating resources Q, if your refined query yields no findings, that might rule out 2. But going with mshal's suggestion of rolling back a subset of machines might not be the best option. We are in the process of switching over to vs2013 and if we can do that soonish for trunk + try, that could be a bigger win overall. rolling back a subset will make it hard to test vs2013. sheriffs: now that we have disabled seamicros, are we still hitting timeouts/failures? ben: how close do you think we are from being able to avail of vs2013?

Flags: needinfo?(sheriffs)

Flags: needinfo?(benjamin)

Amy Rich [:arr] [:arich]

Comment 20

•

11 years ago

When we were discussion this in irc perviously, it was stated that someone found issues with a build that was timing out before the machine it ran on was even updated: http://buildbot-master82.srv.releng.scl3.mozilla.com:8001/builders/WINNT%205.2%20mozilla-beta%20leak%20test%20build/builds/181 Are we sure this is tied to vs2013 at all?

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 21

•

11 years ago

I don't believe the sheriffs@mozilla.bugs alias works for needinfo requests.

Flags: needinfo?(sheriffs)

Flags: needinfo?(ryanvm)

Flags: needinfo?(emorley)

Jordan Lund (:jlund)

Reporter

Comment 22

•

11 years ago

(In reply to Amy Rich [:arich] [:arr] from comment #20) > When we were discussion this in irc perviously, it was stated that someone > found issues with a build that was timing out before the machine it ran on > was even updated: > http://buildbot-master82.srv.releng.scl3.mozilla.com:8001/builders/WINNT%205. > 2%20mozilla-beta%20leak%20test%20build/builds/181 > > Are we sure this is tied to vs2013 at all? you're right, there are cases of really long build times prior to vs2013 being installed. but since we have had more timeouts and higher build times post vs2013 and no other changes lining up with that timeframe, it seemed like a likely culprit. it is hard to tell as it happens seldom. In any case, unless 1) we are intermittently failing often or 2) actually availing of vs2013 is far away from now, I suggest we not worry about any further investigation (for now) regardless of the smoking gun. Instead, put effort on getting vs2013 out the door.

Ryan VanderMeulen [:RyanVM]

Comment 23

•

11 years ago

There have been 19800s timeouts as recently as today. https://tbpl.mozilla.org/php/getParsedLog.php?id=46861136&tree=Mozilla-Aurora https://tbpl.mozilla.org/php/getParsedLog.php?id=46858389&tree=Mozilla-Esr31 Not to mention the spike in gtest 600s timeouts that we've seen at much higher frequency across all trees since this started. And what Wes said, you can't needinfo? a watcher account.

Flags: needinfo?(ryanvm)

Ryan VanderMeulen [:RyanVM]

Comment 24

•

11 years ago

And another: https://tbpl.mozilla.org/php/getParsedLog.php?id=46911668&tree=Try

Ed Morley [:emorley]

Comment 25

•

11 years ago

(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #23) > And what Wes said, you can't needinfo? a watcher account. (Bug 179701 filed for this)

Flags: needinfo?(emorley)

Benjamin Smedberg

Comment 26

•

11 years ago

I don't have much input to give here. It sounds like we should look at one of the stuck machines in process explorer and see what's eating the CPU. Is it possible to send somebody an alert when it gets into this state before it gets killed, so they can remote-desktop in and check it by hand?

Flags: needinfo?(benjamin)

Phil Ringnalda (:philor)

Comment 27

•

11 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=47078483&tree=Mozilla-Aurora

Phil Ringnalda (:philor)

Comment 28

•

11 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=47083920&tree=Mozilla-Aurora

Phil Ringnalda (:philor)

Comment 29

•

11 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=47078341&tree=Mozilla-Aurora

Amy Rich [:arr] [:arich]

Updated

•

11 years ago

Blocks: 1062877

Justin Wood (:Callek)

Comment 30

•

11 years ago

Nothing actionable by buildduty here, moving to Platform Support. Hopefully work on stuff from 1062877 helps here

Component: Buildduty → Platform Support

QA Contact: bugspam.Callek → coop

Justin Wood (:Callek)

Comment 31

•

11 years ago

jordan, per :arr we're wontfixing all SeaMicro (in releng) specific bugs, its not clear to me in a 10 second skim if this is both SeaMicro and iX systems though.

Flags: needinfo?(jlund)

Ryan VanderMeulen [:RyanVM]

Comment 32

•

11 years ago

This was on all Windows build slaves regardless of type.

Michael Shal [:mshal]

Comment 33

•

11 years ago

The graphs from #c9 through #c14 were all iX-only data, and I think they highlight the issue. So, this isn't a SeaMicro-only bug.

Jordan Lund (:jlund)

Reporter

Comment 34

•

11 years ago

(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #32) > This was on all Windows build slaves regardless of type. speaking of which, are you still seeing timeouts? I think if the number of intermittents are decreasing, it is very likely that this bug can be summed up as: https://bugzilla.mozilla.org/show_bug.cgi?id=1062877#c1 we have vs2013 on 5 machines that are imaged differently and are running pretty well. I wonder if we can close this bug for now..

Flags: needinfo?(jlund) → needinfo?(ryanvm)

Ryan VanderMeulen [:RyanVM]

Comment 35

•

11 years ago

No issues for some time now.

Flags: needinfo?(ryanvm)

Jordan Lund (:jlund)

Reporter

Comment 36

•

11 years ago

going to close this for now based on comment 34-35

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Assignee

Updated

•

7 years ago

Component: Platform Support → Buildduty

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

6 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

step data, csv.gz 11 years ago Nick Thomas [:nthomas] (UTC+12) 168.75 KB, application/x-gzip		Details
sqlservr.png 11 years ago Michael Shal [:mshal] 29.72 KB, image/png		Details
prevs2013.png 11 years ago Michael Shal [:mshal] 70.96 KB, image/png		Details
check.png 11 years ago Michael Shal [:mshal] 16.99 KB, image/png		Details
compile.png 11 years ago Michael Shal [:mshal] 18.91 KB, image/png		Details
hg_update.png 11 years ago Michael Shal [:mshal] 15.83 KB, image/png		Details
make_buildsymbols.png 11 years ago Michael Shal [:mshal] 21.03 KB, image/png		Details
make_pkg_tests.png 11 years ago Michael Shal [:mshal] 21.50 KB, image/png		Details
Screenshot of installed apps on b-2008-ix-0007 11 years ago Nick Thomas [:nthomas] (UTC+12) 149.48 KB, image/png		Details