Closed Bug 859571 Opened 11 years ago Closed 10 years ago

False alarms caused by bimodal data in Talos Ts Paint MAX Dirty on Windows 8, Win7 (WINNT 6.1 ix) - 15 seconds of sleeping during D3D10 layer manager initialization

Categories

(Testing :: Talos, defect)

x86_64
Windows 8
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mbrubeck, Assigned: jmaher)

References

Details

(Whiteboard: [SfN])

Attachments

(3 files, 2 obsolete files)

"Ts Paint, MAX Dirty Profile" on WINNT 6.2 x64 is extremely bimodal, with some results clustered around 560 and other results clustered around 20700 (about 36 times slower!):
http://graphs.mozilla.org/graph.html#tests=[[227,63,31]]&sel=1364860448405,1365465248405&displayrange=7&datatype=running

This causes false positives in regression detection, for example:
https://groups.google.com/d/msg/mozilla.dev.tree-management/C6k9FUkMK5Q/_N2sXsrFNAoJ

Any idea what is causing such a discrepancy from one run to the next?  I don't see any correlation with the machine_id field in the graph server data -- I often see the same machine_id associated with both "fast" and "slow" results.
Summary: Bimonal data in Ts Paint MAX Dirty on Windows 8 → False alarms caused by bimodal data in Talos Ts Paint MAX Dirty on Windows 8
I looked over the last 14 days of history on inbound and didn't find any correlation to specific machines.  In fact the same machine can run a 500 ms startup and a 20000 ms startup.  I do notice that if we are collecting high values the entire data set is high values, the same applies to the lower values.

This affects:
ts_paint
tspaint_generated_med
tspaint_generated_max

I do not see this affecting any other tests on windows 8, nor do I see this affecting other platforms.

Right now this appears to be happening on about 36% of the tests, so this is a serious problem.

Next steps would be to try and reproduce this locally.  Maybe there is something we can observe in the console log, stdout/err to help us figure this out.  I do not have a windows 8 box handy, I could vnc/remotedesktop to one of the ones in the datacenter if that would help.
Depends on: 860022
Depends on: 860322
Assignee: nobody → jmaher
Currently I have a windows 8 slave for the sole purpose of reproducing this problem.  I am focusing on ts_paint as ts_paint exhibits the problem at the same time that tspaint_generated_med|max does.  Knowing that I have downloaded a build that produced bogus numbers on all 3 tests (2 different machines) and was not able to reproduce the problem.

So far I have tried two builds (one that reported 20 seconds and another that reported 500ms) with no luck in reproducing the numbers after many attempts, reboots, etc.  My concern is that we are seeing inflated numbers for a given build which means that we see this across two different machines each time this happens.  Could it be possible the hardware is doing something on two different machines for a given build?

I have seen this 20 second startup time reported for pgo and non pgo builds, likewise I have seen the 500ms startup times reports on both pgo and non pgo builds.  What is a hacker to do, but continue hacking and ask for more ideas.

Would modifying Talos to show the console messages help?  I am not familiar with windows 8 nor am I very familiar with how the tests are run on there.
I can't get back very far on tbpl but it looks like the slaves doing this are random - back to the 10th:

t-w864-ix-092
t-w864-ix-008
t-w864-ix-020
t-w864-ix-051
If this isn't showing up in other paint tests, has to be a startup specific problem. Is 'Ts, Paint' currently a warm start or cold start test?
here is some data from 14 days of inbound:
https://docs.google.com/spreadsheet/ccc?key=0ArS97F99-BEZdFVreS1GNExIWVR5ZEZwSTR3c0JobWc#gid=0

ts, paint is a warm test, we fail on all varieties of this.
Not sure if this info helps in any way (I could get more info from slaves tomorrow):

t-w864-ix-046 took a debug jetpack job and after 28 mins it took a talos other job which reported high.
There's no way for me to tell that there was a reboot in between.

t-w864-ix-063 took a mochitest-2 job and after 20 mins it took a talos other job which reported high.
There's no way for me to tell that there was a reboot in between.
Ok, so sounds like we can rule out individual slave problems.
Is there any chance that these slow startups are timing out (without the browser being launched) and getting reported as a long load value?
good question- the test is supposed to work by printing out the endtime after the mozafterpaint is received.  the start time is determined by the test harness right before we launch the process.  While I don't think it is possible, you never know- I could fiddle with that a bit and see if I can make that scenario happen.  I know on mobile sometimes we never launch the browser with the webpage specified and the test times out...so it is slim, but something I should double check.
(In reply to Joel Maher (:jmaher) from comment #9)
> good question- the test is supposed to work by printing out the endtime
> after the mozafterpaint is received.  the start time is determined by the
> test harness right before we launch the process.  While I don't think it is
> possible, you never know- I could fiddle with that a bit and see if I can
> make that scenario happen.  I know on mobile sometimes we never launch the
> browser with the webpage specified and the test times out...so it is slim,
> but something I should double check.

I would expect timeout as well. Just throwing out some ideas. 

The persistent 20 seconds is interesting. It implies we are blocked on something with a constant timeout. One thought I had was the application of updates, but I'd expect that to be more random if it were the cause.

Are there any logs generated from the tests runs?
is there a pref I could set to toggle the application updates?
app.update.enabled 
set to false to disable
Also maybe there could be an update staged here?
C:\Users\Brian\AppData\Local\Mozilla\Firefox\Nightly
I guess we could turn on / add startup logging and throw a build at try next.
It looks like this problem affects the WINNT 6.1 ix slaves too:
https://groups.google.com/d/topic/mozilla.dev.tree-management/ohBi_eHPAQk/discussion
Summary: False alarms caused by bimodal data in Talos Ts Paint MAX Dirty on Windows 8 → False alarms caused by bimodal data in Talos Ts Paint MAX Dirty on Windows 8, Win7 (WINNT 6.1 ix)
Depends on: win7-ix-releng
FYI, I can loan a Win7 32-bit machine if it helps.

It seems that we do not have this issue for WinXP on iX:
http://graphs.mozilla.org/graph.html#tests=[[227,63,25],[227,63,31],[227,63,37]]&sel=none&displayrange=7&datatype=running

This has been going since the beginning:
http://graphs.mozilla.org/graph.html#tests=[[227,63,25],[227,63,31],[227,63,37],[227,63,1],[227,63,12]]&sel=none&displayrange=90&datatype=running

We did not have this issue with rev3 minis:
http://graphs.mozilla.org/graph.html#tests=[[227,63,25],[227,63,31],[227,63,37],[227,63,1],[227,63,12]]&sel=none&displayrange=7&datatype=running

If we look at this job, it shows both opt and pgo results:
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=Windows%207.*talos&rev=c5ac92c2881e

If we look at the log of one of the bad results I noticed this:
> Running test tspaint_places_generated_med: 
> 		Started Thu, 23 May 2013 07:50:23
> INFO : __start_report20858__end_report
> 
> __startTimestamp1369320689315__endTimestamp
> 
> INFO : __startBeforeLaunchTimestamp1369320668457__endBeforeLaunchTimestamp
> __startAfterTerminationTimestamp1369320689781__endAfterTerminationTimestamp

If we subtract 1369320689781-1369320668457 we get 21324 milliseconds.

If we do the same with a good log we get this:
1369342564689-1369342563364=1325 milliseconds.
> __startTimestamp1369342564222__endTimestamp
> __startBeforeLaunchTimestamp1369342563364__endBeforeLaunchTimestamp
> __startAfterTerminationTimestamp1369342564689__endAfterTerminationTimestamp

Can we run this job in debug mode? Can we take a snapshot to see what is going on?
What happens at the beginning? Even the start report line shows something wonky:
INFO : __start_report20858__end_report  <---- BAD
INFO : __start_report858__end_report    <---- GOOD

I have tried looking at the Windows logs and I did not find anything relevant.
No longer depends on: win7-ix-releng
When I had a loaner box, I was unable to reproduce this.  There is something happening which is causing a 20 second delay, probably some file system issue.
(In reply to Joel Maher (:jmaher) from comment #17)
> When I had a loaner box, I was unable to reproduce this.  There is something
> happening which is causing a 20 second delay, probably some file system
> issue.

It's a perfect twenty second delay every time so my guess would be some sort of networking related hold up.
would this most likely be a system wide networking delay vs something introduced by firefox?  could be a timeout based on dns ?  Nick, any thoughts on this?
Flags: needinfo?(hurley)
(In reply to Joel Maher (:jmaher) from comment #19)
> would this most likely be a system wide networking delay vs something
> introduced by firefox?  could be a timeout based on dns ?  Nick, any
> thoughts on this?

Maybe a slave config problem that shows up when the test harness or app start, for example a dns lookup on an address that times out after twenty seconds? I don't think you would get a reliable delay like this for os functions that can cause latency issues like disk io or some service starting up in the background.
Is there something I could add to talos to make it more verbose?
Blocks: 870453
No longer blocks: win7-ix-releng
I don't know what we could add to talos.  I am sure there is some debugging we could do by turning on prefs that would help us determine this.  What is odd is that this happens for the entire ts run for a given build, instead of just some random startups.  

Could it be machine specific?
(In reply to Joel Maher (:jmaher) from comment #22)
> I don't know what we could add to talos.  I am sure there is some debugging
> we could do by turning on prefs that would help us determine this.  What is
> odd is that this happens for the entire ts run for a given build, instead of
> just some random startups.  
> 
> Could it be machine specific?

See comment 6 and comment 3, doesn't look like it.
(In reply to Joel Maher (:jmaher) from comment #19)
> would this most likely be a system wide networking delay vs something
> introduced by firefox?  could be a timeout based on dns ?  Nick, any
> thoughts on this?

Sorry for the delay, I was on PTO last week and failed to change my bugzilla info :)

After a quick grep through the firefox source, I do see that we have a 20-second timeout in our DNS resolver, but that timeout is only during shutdown of the resolver, and even then, only in debug builds, and on all platforms, so (based on comments above) this is almost certainly not the timeout we're looking for.

It would be reasonable to guess that there could be some os-specific timeout for DNS or some other network interaction, but I'm not a master of OS timeout lengths (and 20 seconds seems awfully long for most network timeouts I can think of).
Flags: needinfo?(hurley)
Thanks Nick!

Maybe we can do a platform post? Who else could have some insight about this?
(In reply to Armen Zambrano G. [:armenzg] (Release Enginerring) from comment #25)
> Maybe we can do a platform post? Who else could have some insight about this?

Patrick McManus might know more about os-level network timeouts that might be 20 seconds long, but I would probably just ask on dev-platform to see if anyone has any ideas.
I wonder if this could be caused by broken internal timing functions? Over in bug 886109, we're getting weird numbers from StartupTimeline data. In mochitest-metro-chrome logs, the clock in the mozharness log shows that the browser is running for around six minutes, while various numbers returned by StartupTimeline have a twenty second offset including firstPaint, sessionRestored, and firstLoadURI. These test runs also don't fire MozAfterPaints for some reason.
Blocks: 886109
Possibly related: bug 869285
Per earlier feedback, integrating resource collection into BaseScript
directly won't be possible because of a classic chicken-or-egg problem.
So, it will need to be implemented in a derived class.

To correlate resource usage with build steps, resource monitoring will
need to integrate with the "dispatch loop" in BaseScript.run().
Currently, there's no way to do that aside from reimplementing that
method. This patch changes that.

This patch adds two functions to mozharness.base that can be used as
decorators - PreScriptStep and PostScriptStep. If a BaseScript-derived
class contains methods with these decorators, those methods will be
called before or after execution of each step. More detailed info is in
source docs.

These "listeners" are generic and can be used for anything. I intend to
use them for resource monitoring, but others may very well come up with
other clever uses for them.

In a subsequent patch, I'd like to add pre-run and post-run hook
points. But, I'd like to run this implementation by you first to make
sure we're on the same page.

I'm also open to the idea of adding decorators that can be registered
against a specific action (perhaps by passing an argument into the
decorators I've defined so far) and replacing the preflight and
postflight functions with these.
Attachment #771033 - Flags: review?(aki)
Assignee: jmaher → gps
Comment on attachment 771033 [details] [diff] [review]
Part 1: Register action listeners via decorators

A clear sign it's time to start my weekend.
Attachment #771033 - Flags: review?(aki)
Attachment #771033 - Attachment is obsolete: true
Assignee: gps → jmaher
I managed to get a profile of this using https://github.com/mstange/analyze-tryserver-profiles on a Windows 7 ts_paint run (https://tbpl.mozilla.org/php/getParsedLog.php?id=25433830&tree=Try).

Here's the profile:
http://tests.themasta.com/cleopatra/?report=831be0e46b580372116c7b790c14d68594093014

Every run looks the same and spends a solid 15 seconds sleeping with this callstack:

KiFastSystemCallRet
KiFastSystemCallRet
Sleep
0x8885 (in NvSCPAPI.pdb)
0x19d12 (in NvSCPAPI.pdb)
0x16b74 (in NvSCPAPI.pdb)
0x1b411 (in NvSCPAPI.pdb)
0xaa03 (in nvStereoApiI.pdb)
0x5bd8 (in nvStereoApiI.pdb)
0x194284 (in nvapi.pdb)
0x190c00 (in nvapi.pdb)
0x5260e (in Nv3DVStreaming.pdb)
0x186d (in Nv3DVStreaming.pdb)
mozilla::layers::LayerManagerD3D10::Initialize(bool,long *)
nsWindow::GetLayerManager(mozilla::layers::PLayerTransactionChild *,mozilla::layers::LayersBackend,nsIWidget::LayerManagerPersistence,bool *)
PresShell::GetLayerManager()
nsRefreshDriver::Tick
mozilla::RefreshDriverTimer::TickDriver(nsRefreshDriver *,__int64,mozilla::TimeStamp)
mozilla::RefreshDriverTimer::Tick()
nsTimerImpl::Fire()
Timer::Fire
nsTimerEvent::Run()
nsThread::ProcessNextEvent(bool,bool *)
NS_ProcessNextEvent(nsIThread *,bool)
mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate *)
MessageLoop::RunHandler()
MessageLoop::Run()
nsBaseAppShell::Run()
nsAppShell::Run()
nsAppStartup::Run()
XREMain::XRE_mainRun()
XREMain::XRE_main(int,char * * const,nsXREAppData const *)
Startup::XRE_Main
XRE_main
do_main
NS_internal_main(int,char * *)
wmain
__tmainCRTStartup
BaseThreadInitThunk
__RtlUserThreadStart
_RtlUserThreadStart
Summary: False alarms caused by bimodal data in Talos Ts Paint MAX Dirty on Windows 8, Win7 (WINNT 6.1 ix) → False alarms caused by bimodal data in Talos Ts Paint MAX Dirty on Windows 8, Win7 (WINNT 6.1 ix) - 15 seconds of sleeping during D3D10 layer manager initialization
NvSCPAPI/nvStereoApiI/Nv3DVStreaming are all related to 3D vision.

It would be good to rule out that this problem doesn't go away once a monitor is plugged in. jmaher would it be hard to reproduce a run with a monitor plugged in? If it is we can instead just push to try with the 3D rendering feature disabled.
Flags: needinfo?(jmaher)
:BenWa - good idea;  I don't know how we could realistically get a monitor hooked up.  It would be possible to get a machine, go to the data center and hook a monitor up.  It wouldn't be guaranteed that you could reproduce it on that machine.

:armenzg, I wonder if we have dongles for these machines?

:BenWa, what prefs could we send to disable 3D rendering?
Flags: needinfo?(jmaher) → needinfo?(armenzg)
I just looked. We don't have any prefs. I can prepare a patch. Joe tells me what some machine have a 'resistor dongle' that simulates a monitor. Which machines have these? All, none or a subset of windows machine?
All of the Win7 and Win8 machines should have a dongle.

Could we dump into the logs some info to determine if we're running an incorrect screen resolution? Or lack of dongle? Or dump dxdirect info? I'm just throwing ideas out there.
Flags: needinfo?(armenzg)
Attached patch patchSplinter Review
mstange can you get another profiling run with this patch applied?
Flags: needinfo?(mstange)
Sure. I've started a try run at https://tbpl.mozilla.org/?tree=Try&rev=caeadd69b4b7
Flags: needinfo?(mstange)
This looks much better! All the runs had ts_paint results in the 1.9 seconds range, instead of 20 seconds.

Here's a profile: http://people.mozilla.com/~bgirard/cleopatra/#report=8b9530a5fb2506a691913979423a297da3a9ee07
I've contacted NVidia about the issue and they would like clarrification about the kind of display used in our slave.

Amy I'm told you can provide more details about the kind of display that's connected to these slaves:
t-w732-ix-095
t-w732-ix-022
t-w732-ix-104
t-w732-ix-068

I hear it's some kind of dongle/special attachment that simulates the monitor. I'd like more information so that I can forward to NVidia.

If it's not to much trouble it would be nice knowing the GPU used while running these tests and send me a copy of Nv3DVStreaming.*, nvStereoApiI.*, nvapi.* privately.
Flags: needinfo?(arich)
There are no dongles; there's a physical graphics card.  I've pinged Q to give you the details on how that card is selected as the primary display.
Flags: needinfo?(arich) → needinfo?(q)
(In reply to Amy Rich [:arich] [:arr] from comment #40)
> There are no dongles; there's a physical graphics card.

Is there anything attach to the physical graphics card? If so what?
No, there are no physical attachments to the card.
We are attaching a virtual vga display. Via the win api.
Flags: needinfo?(q)
Not sure what we are waiting for here. Should I put together a patch that hides this nvidia stereoscopic stuff behind a pref and disable it by default on mc?
(In reply to Jim Mathies [:jimm] from comment #44)
> Not sure what we are waiting for here. Should I put together a patch that
> hides this nvidia stereoscopic stuff behind a pref and disable it by default
> on mc?

Not yet. We've committed to NVidia to support the stereoscopic feature. I'm discussing this with NVidia and send a follow up email.
Here's the exact hardware config for all xp, w7, and w8 iX test machines.

2U Neutron "Gemini" Series Four Node Hot-Pluggable Server 12 x SAS/SATA Hard Drive Bays, 920W High-Efficiency redundant (1+1) Power Supply - Quick Release Rackmount Rail Kit.
4 Server Nodes Per System – 4 of 4 Nodes contains the following components:

* Single Socket Xeon Motherboard (Intel 3420 Tylersburg Chipset) - 4 DIMM Slots (32GB Max Memory) - SATAII + Dual Gigabit Intel 82574L LAN, (1) PCI- E + IPMI Dedicated LAN
* Intel Quad Core 2.66Ghz Xeon (LGA1156) 8MB cache (X3450) 
* Heatsink: Multi Core X8 Passive Aluminum XEON 1U Heatsink 
* 2x4GB DDR3 1333Mhz ECC/REG (8GB total) 
* Western Digital 500GB Enterprise (RAID Duty) SATA2 7200RPM 64MB Cache RE4
* RAID: None
Graphics: EVGA GeForce GT 610 1GB 64-bit DDR3 PCI Express 2.0 x16 HDCP Ready Video Card (Low-Profile)
* Integrated IPMI 2.0 with virtual media over LAN and KVM-over-LAN Dedicated LAN port
* 4-Post Slide Rail Kit
(In reply to Benoit Girard (:BenWa) from comment #45)
> (In reply to Jim Mathies [:jimm] from comment #44)
> > Not sure what we are waiting for here. Should I put together a patch that
> > hides this nvidia stereoscopic stuff behind a pref and disable it by default
> > on mc?
> 
> Not yet. We've committed to NVidia to support the stereoscopic feature. I'm
> discussing this with NVidia and send a follow up email.

Any update on this?
No, the last email contact was on August 15 where they asked for more information which I provided that they. Their last theory was:

'The delay in stereo may be occurring during  the initialization time when we check hardware info and get EDID.  At this moment there may be delay because DD is not ready to provide this info and some NVAPI failed.'
It seems like the problem stopped on m-c on 2013-10-21. Do we know why?
I was working on talos updates around then, but they landed in November- must be something in the product.
(In reply to Joel Maher (:jmaher) from comment #50)
> I was working on talos updates around then, but they landed in November-
> must be something in the product.

Is it possible there were driver updates on the talos machines that eliminated this 15s wait that other comments discussed?
(In reply to Matthew N. [:MattN] (catching up on reviews) from comment #49)
> Created attachment 828569 [details]
> Screenshot of graph server showing the problem stop
> 
> It seems like the problem stopped on m-c on 2013-10-21. Do we know why?

Am I reading this right? Looks like all data points now read zero.
(In reply to Jim Mathies [:jimm] from comment #52)
> (In reply to Matthew N. [:MattN] (catching up on reviews) from comment #49)
> > Created attachment 828569 [details]
> > Screenshot of graph server showing the problem stop
> > 
> > It seems like the problem stopped on m-c on 2013-10-21. Do we know why?
> 
> Am I reading this right? Looks like all data points now read zero.

Hmm, maybe we do have some data there but it's not visible due to the range.
the data is still reporting normal, I looked at more data on datazilla, it shows the same behavior where around oct 22nd we stopped reporting this bi-modal data on win7 and win8.

I recall changing the screen saver settings on win8 machines to not kick in during talos tests.  We were seeing issues in metro mode and this took place in the general time window: https://bugzilla.mozilla.org/show_bug.cgi?id=929473.

:armenzg, can you help is figure out if anything changed on either mozharness or windows updates on october 21/22?  I know these machines have windows update enabled, so that is likely- also any manual updates would be noticed.
Flags: needinfo?(armenzg)
Side note: there's a regression introduced in the 31st if anyone could follow that.

################################
I looked at this:
https://wiki.mozilla.org/ReleaseEngineering/Changes/2013

The only changes from releng deployed on that day are from bug 842870.
None of them had anything to do with Windows and talos.

Context: a while ago we noticed that our Windows test machines started reporting performance numbers (for a specific suite) with a bimodal distribution and since October 21st it has normalized.

This URL shows how the issue dissapears after Oct. 21st:
https://datazilla.mozilla.org/?start=1382155200&stop=1383838693&product=Firefox&repository=Mozilla-Inbound&os=win&os_version=6.1.7601&test=ts_paint&x86_64=false&error_bars=false&project=talos

This URL shows the normal behaviour that should have always been:
https://datazilla.mozilla.org/?start=1382672520&stop=1383838693&product=Firefox&repository=Mozilla-Inbound&os=win&os_version=6.1.7601&test=ts_paint&x86_64=false&error_bars=false&project=talos

Q: is there a way to know if other GPO changes went it for Win7 and Win8 machines?

I know that we disabled the screen saver on Win8 machines on the 22nd (bug 929473), however, I'm not aware of Win7 changes on that bug.
Flags: needinfo?(armenzg) → needinfo?(q)
(In reply to Armen Zambrano [:armenzg] (Release Engineering) (EDT/UTC-4) from comment #55)
> Side note: there's a regression introduced in the 31st if anyone could
> follow that.

I think that's bug 932389[1] AFAICT.

[1] https://groups.google.com/forum/#!searchin/mozilla.dev.platform/ts_paint/mozilla.dev.platform/DltrLRix4F8/muMt33jKKEsJ
Actually that was backed out and landed again on Nov 1:
http://hg.mozilla.org/mozilla-central/filelog/545887140a1b/testing/talos/talos.json

I wish it were the end of the mystery!
Attached patch re-enable regression analysis (obsolete) — Splinter Review
We had disabled regression analysis for Ts Paint on Windows 7/8 because this bug was causing false positives and preventing real regressions from being detected.  This graph server config change re-enables the analysis.
Attachment #828922 - Flags: review?(catlee)
Armen, I can't find any all MS platform wide changes during those dates.
Flags: needinfo?(q)
Whiteboard: [SfN] → [SfN][alarms disabled]
Forgot to update unit tests.
Attachment #828922 - Attachment is obsolete: true
Attachment #828922 - Flags: review?(catlee)
Attachment #8337875 - Flags: review?(catlee)
Attachment #8337875 - Flags: review?(catlee) → review+
Comment on attachment 8337875 [details] [diff] [review]
re-enable regression analysis (v2)

http://hg.mozilla.org/graphs/rev/15609490c297
(needs an update to analysis.cfg in production)
Whiteboard: [SfN][alarms disabled] → [SfN]
deployed
can we close this bug?  I don't see any reason to keep it open.
we dont run ts paint max dirty anymore and haven't for a long time.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: