Closed Bug 1004575 Opened 10 years ago Closed 10 years ago

ondemand_update test runs stalling out during recording results.

Categories

(Mozilla QA Graveyard :: Infrastructure, defect)

x86
Windows 8
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: tracy, Assigned: whimboo)

Details

Attachments

(1 file)

During todays ondemand_update test runs for beta channel the test runs began to stall in writing results. it was taking often ~30 minutes for runs to complete this step. Nils initial suspicion is the SurefireArchiver plugin. The end result was an entire run that took nearly 2 hours to complete.

sample console output:
http://mm-ci-master.qa.scl3.mozilla.com:8080/job/ondemand_update/66178/console
Summary: ondemand_ppdate test runs stalling out during recording results. → ondemand_update test runs stalling out during recording results.
The last message before the big pause:
"Recording test results"
comes from the SurefireArchiver plugin. I checked a few of the machines which were waiting in this state, but did not see any CPU or other IO load. Looked like the machines were waiting for some network resource or so.
(In reply to Nils Ohlmeier [:drno] from comment #1)
> "Recording test results"
> comes from the SurefireArchiver plugin. I checked a few of the machines

Not sure where you got this name from, but we are not running such a plugin. :)

In such a case its best to open up the trend for this testrun, which is available here:
http://mm-ci-master.qa.scl3.mozilla.com:8080/job/ondemand_update/buildTimeTrend

As you can see job 66141 is the first one with this massive delay, and which also failed:
http://mm-ci-master.qa.scl3.mozilla.com:8080/job/ondemand_update/66141/console

When you check elapsed time on the left side, you will notice that it has been taken a very long time to install Firefox:

> 00:10:31.786 *** Installing build: c:\jenkins\workspace\ondemand_update\builds\firefox-29.0b8.it.win32.exe
> 00:25:37.846 *** Creating backup of binary: c:\jenkins\workspace\ondemand_update\data\binary_backup

We exactly spent 15 minutes in installing Firefox, after already 10 minutes in trying to download Firefox. So something may have been wrong on mm-win-8-32-3 or additional a network glitch.

Best would have been to cancel such a job on that machine and put the machine offline for further investigation.

As said on IRC yesterday, collecting the results depend on the former job to be finished. So if one job takes a very long time, it will delay all successive ones.
Component: Mozmill Automation → Infrastructure
Keywords: regression
OS: All → Windows 8
QA Contact: hskupin
Actually the download hasn't been taken that long:

> 00:01:15.832 [EnvInject] - Variables injected successfully.
> 00:04:42.574 Copied 1,954 artifacts from "get_mozmill-environments" build number

Kinda long time to getting the environments copied over, but with a huge load of our machines, it could be caused by overloading the master with requests. Also Jenkins still compresses the files each time. We have to wait for a fix.

> 00:04:42.724 [ondemand_update] $ cmd.exe /C '"mozmill-env-%ENV_PLATFORM%\run mozdownload --type=%BUILD_TYPE% --platform=%PLATFORM% --version=%VERSION% --locale=%LOCALE% --build-number=%BUILD_NUMBER% --retry-attempts=10 --retry-delay=30 --directory=builds && exit %%ERRORLEVEL%%"'
> 00:07:36.316   INFO | Downloading from: https://ftp.mozilla.org/pub/mozilla.org/firefox/releases/29.0b8/win32/it/Firefox Setup 29.0b8.exe

Hm, not sure why it has been taken 3 minutes on the machine to get the download started. I assume something was broken here. Andreea could you please check the event history, if something is visible on that host, which might help us to finally kill those delays? Thanks
Flags: needinfo?(andreea.matei)
Actually the first broken build on that machine is:
http://mm-ci-master.qa.scl3.mozilla.com:8080/job/ondemand_update/66054/console

Interestingly it hasn't run any tests so pass/fail count is 0. Maybe it had a hung process?
I wonder if there weren't an issue like with the notifications piling up. Have Tracy or Nils checked that machine? Not sure if they can get closed after a time or somehow.. but I know they interfere with the testruns. I'll check further what other testruns we had on that machine previously and event history.
Flags: needinfo?(andreea.matei)
Attached image Screenshot
I found an event that took 57 minutes longer without completing at the time that first testrun was in progress: svchost, warning 1336, event id 910.
It seems from time to time this one takes longer, without completing, in April 9th and 10th, March 10th  took 60 seconds. The warning says this can result in severe performance degradation.
I'll have to learn more about it and see if we can find a solution for it.
It seems a lot of people encounter this on Win 8: http://bit.ly/1kB6Zcz
Given that on the referred page people were talking about possible issues with Windows Update, I followed the steps and did a repair of the windows updater components. The following was visible afterward:

> Windows Update components must be repaired Fixed Fixed 
> One or more Windows Update components are configured incorrectly. 

Sadly no further details are shown, so nothing more we actually can do here.

Tracy and Nils, next time it happens we should directly take the affected node offline. Now you know how to find the correct one.

This bug I would call fixed now given that tests I see running now are performing well.
Assignee: nobody → hskupin
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
This could probably be morphed into "make individual test runs independent of each other."  It seems quite fragile that subsequent test runs depend on previous runs completing (successfully or not).  Having to manually monitor the system to ensure no machines are hanging in individual test runs (and kill them if they are hung) in order for a full test run to complete in a reasonable amount of time doesn't seem like a proper fix to an automated system.
This behavior is necessary so the system can identify possible regressions. If we remove that feature, we will no longer see when it actually has been started. For ondemand update tests it mostly doesn't make sense, so we could indeed remove it. Feel free to file an issue for mozmill-ci here: https://github.com/mozilla/mozmill-ci/issues
Product: Mozilla QA → Mozilla QA Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: