Closed
Bug 1004575
Opened 10 years ago
Closed 10 years ago
ondemand_update test runs stalling out during recording results.
Categories
(Mozilla QA Graveyard :: Infrastructure, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: tracy, Assigned: whimboo)
Details
Attachments
(1 file)
148.80 KB,
image/png
|
Details |
During todays ondemand_update test runs for beta channel the test runs began to stall in writing results. it was taking often ~30 minutes for runs to complete this step. Nils initial suspicion is the SurefireArchiver plugin. The end result was an entire run that took nearly 2 hours to complete. sample console output: http://mm-ci-master.qa.scl3.mozilla.com:8080/job/ondemand_update/66178/console
Reporter | ||
Updated•10 years ago
|
Summary: ondemand_ppdate test runs stalling out during recording results. → ondemand_update test runs stalling out during recording results.
Comment 1•10 years ago
|
||
The last message before the big pause: "Recording test results" comes from the SurefireArchiver plugin. I checked a few of the machines which were waiting in this state, but did not see any CPU or other IO load. Looked like the machines were waiting for some network resource or so.
Assignee | ||
Comment 2•10 years ago
|
||
(In reply to Nils Ohlmeier [:drno] from comment #1) > "Recording test results" > comes from the SurefireArchiver plugin. I checked a few of the machines Not sure where you got this name from, but we are not running such a plugin. :) In such a case its best to open up the trend for this testrun, which is available here: http://mm-ci-master.qa.scl3.mozilla.com:8080/job/ondemand_update/buildTimeTrend As you can see job 66141 is the first one with this massive delay, and which also failed: http://mm-ci-master.qa.scl3.mozilla.com:8080/job/ondemand_update/66141/console When you check elapsed time on the left side, you will notice that it has been taken a very long time to install Firefox: > 00:10:31.786 *** Installing build: c:\jenkins\workspace\ondemand_update\builds\firefox-29.0b8.it.win32.exe > 00:25:37.846 *** Creating backup of binary: c:\jenkins\workspace\ondemand_update\data\binary_backup We exactly spent 15 minutes in installing Firefox, after already 10 minutes in trying to download Firefox. So something may have been wrong on mm-win-8-32-3 or additional a network glitch. Best would have been to cancel such a job on that machine and put the machine offline for further investigation. As said on IRC yesterday, collecting the results depend on the former job to be finished. So if one job takes a very long time, it will delay all successive ones.
Component: Mozmill Automation → Infrastructure
Keywords: regression
OS: All → Windows 8
QA Contact: hskupin
Assignee | ||
Comment 3•10 years ago
|
||
Actually the download hasn't been taken that long: > 00:01:15.832 [EnvInject] - Variables injected successfully. > 00:04:42.574 Copied 1,954 artifacts from "get_mozmill-environments" build number Kinda long time to getting the environments copied over, but with a huge load of our machines, it could be caused by overloading the master with requests. Also Jenkins still compresses the files each time. We have to wait for a fix. > 00:04:42.724 [ondemand_update] $ cmd.exe /C '"mozmill-env-%ENV_PLATFORM%\run mozdownload --type=%BUILD_TYPE% --platform=%PLATFORM% --version=%VERSION% --locale=%LOCALE% --build-number=%BUILD_NUMBER% --retry-attempts=10 --retry-delay=30 --directory=builds && exit %%ERRORLEVEL%%"' > 00:07:36.316 INFO | Downloading from: https://ftp.mozilla.org/pub/mozilla.org/firefox/releases/29.0b8/win32/it/Firefox Setup 29.0b8.exe Hm, not sure why it has been taken 3 minutes on the machine to get the download started. I assume something was broken here. Andreea could you please check the event history, if something is visible on that host, which might help us to finally kill those delays? Thanks
Flags: needinfo?(andreea.matei)
Assignee | ||
Comment 4•10 years ago
|
||
Actually the first broken build on that machine is: http://mm-ci-master.qa.scl3.mozilla.com:8080/job/ondemand_update/66054/console Interestingly it hasn't run any tests so pass/fail count is 0. Maybe it had a hung process?
Comment 5•10 years ago
|
||
I wonder if there weren't an issue like with the notifications piling up. Have Tracy or Nils checked that machine? Not sure if they can get closed after a time or somehow.. but I know they interfere with the testruns. I'll check further what other testruns we had on that machine previously and event history.
Flags: needinfo?(andreea.matei)
Comment 6•10 years ago
|
||
I found an event that took 57 minutes longer without completing at the time that first testrun was in progress: svchost, warning 1336, event id 910. It seems from time to time this one takes longer, without completing, in April 9th and 10th, March 10th took 60 seconds. The warning says this can result in severe performance degradation. I'll have to learn more about it and see if we can find a solution for it.
Comment 7•10 years ago
|
||
It seems a lot of people encounter this on Win 8: http://bit.ly/1kB6Zcz
Assignee | ||
Comment 8•10 years ago
|
||
Given that on the referred page people were talking about possible issues with Windows Update, I followed the steps and did a repair of the windows updater components. The following was visible afterward:
> Windows Update components must be repaired Fixed Fixed
> One or more Windows Update components are configured incorrectly.
Sadly no further details are shown, so nothing more we actually can do here.
Tracy and Nils, next time it happens we should directly take the affected node offline. Now you know how to find the correct one.
This bug I would call fixed now given that tests I see running now are performing well.
Assignee: nobody → hskupin
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 9•10 years ago
|
||
This could probably be morphed into "make individual test runs independent of each other." It seems quite fragile that subsequent test runs depend on previous runs completing (successfully or not). Having to manually monitor the system to ensure no machines are hanging in individual test runs (and kill them if they are hung) in order for a full test run to complete in a reasonable amount of time doesn't seem like a proper fix to an automated system.
Assignee | ||
Comment 10•10 years ago
|
||
This behavior is necessary so the system can identify possible regressions. If we remove that feature, we will no longer see when it actually has been started. For ondemand update tests it mostly doesn't make sense, so we could indeed remove it. Feel free to file an issue for mozmill-ci here: https://github.com/mozilla/mozmill-ci/issues
Updated•6 years ago
|
Product: Mozilla QA → Mozilla QA Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•