1004575 - ondemand_update test runs stalling out during recording results.

Reporter

Description

•

10 years ago

During todays ondemand_update test runs for beta channel the test runs began to stall in writing results. it was taking often ~30 minutes for runs to complete this step. Nils initial suspicion is the SurefireArchiver plugin. The end result was an entire run that took nearly 2 hours to complete.

sample console output:
http://mm-ci-master.qa.scl3.mozilla.com:8080/job/ondemand_update/66178/console

Tracy Walker [:tracy]

Reporter

Updated

•

10 years ago

Summary: ondemand_ppdate test runs stalling out during recording results. → ondemand_update test runs stalling out during recording results.

Nils Ohlmeier [:drno]

Comment 1

•

10 years ago

The last message before the big pause:
"Recording test results"
comes from the SurefireArchiver plugin. I checked a few of the machines which were waiting in this state, but did not see any CPU or other IO load. Looked like the machines were waiting for some network resource or so.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 2

•

10 years ago

(In reply to Nils Ohlmeier [:drno] from comment #1)
> "Recording test results"
> comes from the SurefireArchiver plugin. I checked a few of the machines

Not sure where you got this name from, but we are not running such a plugin. :)

In such a case its best to open up the trend for this testrun, which is available here:
http://mm-ci-master.qa.scl3.mozilla.com:8080/job/ondemand_update/buildTimeTrend

As you can see job 66141 is the first one with this massive delay, and which also failed:
http://mm-ci-master.qa.scl3.mozilla.com:8080/job/ondemand_update/66141/console

When you check elapsed time on the left side, you will notice that it has been taken a very long time to install Firefox:

> 00:10:31.786 *** Installing build: c:\jenkins\workspace\ondemand_update\builds\firefox-29.0b8.it.win32.exe
> 00:25:37.846 *** Creating backup of binary: c:\jenkins\workspace\ondemand_update\data\binary_backup

We exactly spent 15 minutes in installing Firefox, after already 10 minutes in trying to download Firefox. So something may have been wrong on mm-win-8-32-3 or additional a network glitch.

Best would have been to cancel such a job on that machine and put the machine offline for further investigation.

As said on IRC yesterday, collecting the results depend on the former job to be finished. So if one job takes a very long time, it will delay all successive ones.

Component: Mozmill Automation → Infrastructure

Keywords: regression

OS: All → Windows 8

QA Contact: hskupin

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 3

•

10 years ago

Actually the download hasn't been taken that long:

> 00:01:15.832 [EnvInject] - Variables injected successfully.
> 00:04:42.574 Copied 1,954 artifacts from "get_mozmill-environments" build number

Kinda long time to getting the environments copied over, but with a huge load of our machines, it could be caused by overloading the master with requests. Also Jenkins still compresses the files each time. We have to wait for a fix.

> 00:04:42.724 [ondemand_update] $ cmd.exe /C '"mozmill-env-%ENV_PLATFORM%\run mozdownload --type=%BUILD_TYPE% --platform=%PLATFORM% --version=%VERSION% --locale=%LOCALE% --build-number=%BUILD_NUMBER% --retry-attempts=10 --retry-delay=30 --directory=builds && exit %%ERRORLEVEL%%"'
> 00:07:36.316   INFO | Downloading from: https://ftp.mozilla.org/pub/mozilla.org/firefox/releases/29.0b8/win32/it/Firefox Setup 29.0b8.exe

Hm, not sure why it has been taken 3 minutes on the machine to get the download started. I assume something was broken here. Andreea could you please check the event history, if something is visible on that host, which might help us to finally kill those delays? Thanks

Flags: needinfo?(andreea.matei)

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 4

•

10 years ago

Actually the first broken build on that machine is:
http://mm-ci-master.qa.scl3.mozilla.com:8080/job/ondemand_update/66054/console

Interestingly it hasn't run any tests so pass/fail count is 0. Maybe it had a hung process?

Andreea Matei [:AndreeaMatei]

Comment 5

•

10 years ago

I wonder if there weren't an issue like with the notifications piling up. Have Tracy or Nils checked that machine? Not sure if they can get closed after a time or somehow.. but I know they interfere with the testruns. I'll check further what other testruns we had on that machine previously and event history.

Flags: needinfo?(andreea.matei)

Andreea Matei [:AndreeaMatei]

Comment 6

•

10 years ago

Attached image Screenshot — Details

I found an event that took 57 minutes longer without completing at the time that first testrun was in progress: svchost, warning 1336, event id 910.
It seems from time to time this one takes longer, without completing, in April 9th and 10th, March 10th  took 60 seconds. The warning says this can result in severe performance degradation.
I'll have to learn more about it and see if we can find a solution for it.

Andreea Matei [:AndreeaMatei]

Comment 7

•

10 years ago

It seems a lot of people encounter this on Win 8: http://bit.ly/1kB6Zcz

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 8

•

10 years ago

Given that on the referred page people were talking about possible issues with Windows Update, I followed the steps and did a repair of the windows updater components. The following was visible afterward:

> Windows Update components must be repaired Fixed Fixed 
> One or more Windows Update components are configured incorrectly. 

Sadly no further details are shown, so nothing more we actually can do here.

Tracy and Nils, next time it happens we should directly take the affected node offline. Now you know how to find the correct one.

This bug I would call fixed now given that tests I see running now are performing well.

Assignee: nobody → hskupin

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Tracy Walker [:tracy]

Reporter

Comment 9

•

10 years ago

This could probably be morphed into "make individual test runs independent of each other."  It seems quite fragile that subsequent test runs depend on previous runs completing (successfully or not).  Having to manually monitor the system to ensure no machines are hanging in individual test runs (and kill them if they are hung) in order for a full test run to complete in a reasonable amount of time doesn't seem like a proper fix to an automated system.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 10

•

10 years ago

This behavior is necessary so the system can identify possible regressions. If we remove that feature, we will no longer see when it actually has been started. For ondemand update tests it mostly doesn't make sense, so we could indeed remove it. Feel free to file an issue for mozmill-ci here: https://github.com/mozilla/mozmill-ci/issues

BMO Automation

Updated

•

6 years ago

Product: Mozilla QA → Mozilla QA Graveyard

Bugzilla

Quick Search

ondemand_update test runs stalling out during recording results.

Categories

(Mozilla QA Graveyard :: Infrastructure, defect)

Tracking

(Not tracked)

People

(Reporter: tracy, Assigned: whimboo)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Updated

Attachment

General

Description

File Name

Content Type