859571 - False alarms caused by bimodal data in Talos Ts Paint MAX Dirty on Windows 8, Win7 (WINNT 6.1 ix) - 15 seconds of sleeping during D3D10 layer manager initialization

Reporter

Description

•

12 years ago

"Ts Paint, MAX Dirty Profile" on WINNT 6.2 x64 is extremely bimodal, with some results clustered around 560 and other results clustered around 20700 (about 36 times slower!): http://graphs.mozilla.org/graph.html#tests=[[227,63,31]]&sel=1364860448405,1365465248405&displayrange=7&datatype=running This causes false positives in regression detection, for example: https://groups.google.com/d/msg/mozilla.dev.tree-management/C6k9FUkMK5Q/_N2sXsrFNAoJ Any idea what is causing such a discrepancy from one run to the next? I don't see any correlation with the machine_id field in the graph server data -- I often see the same machine_id associated with both "fast" and "slow" results.

Matt Brubeck (:mbrubeck)

Reporter

Updated

•

12 years ago

Summary: Bimonal data in Ts Paint MAX Dirty on Windows 8 → False alarms caused by bimodal data in Talos Ts Paint MAX Dirty on Windows 8

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 1

•

12 years ago

I looked over the last 14 days of history on inbound and didn't find any correlation to specific machines. In fact the same machine can run a 500 ms startup and a 20000 ms startup. I do notice that if we are collecting high values the entire data set is high values, the same applies to the lower values. This affects: ts_paint tspaint_generated_med tspaint_generated_max I do not see this affecting any other tests on windows 8, nor do I see this affecting other platforms. Right now this appears to be happening on about 36% of the tests, so this is a serious problem. Next steps would be to try and reproduce this locally. Maybe there is something we can observe in the console log, stdout/err to help us figure this out. I do not have a windows 8 box handy, I could vnc/remotedesktop to one of the ones in the datacenter if that would help.

Armen [:armenzg]

Updated

•

12 years ago

Depends on: 860022

Armen [:armenzg]

Updated

•

12 years ago

Depends on: 860322

Armen [:armenzg]

Updated

•

12 years ago

Assignee: nobody → jmaher

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 2

•

12 years ago

Currently I have a windows 8 slave for the sole purpose of reproducing this problem. I am focusing on ts_paint as ts_paint exhibits the problem at the same time that tspaint_generated_med|max does. Knowing that I have downloaded a build that produced bogus numbers on all 3 tests (2 different machines) and was not able to reproduce the problem. So far I have tried two builds (one that reported 20 seconds and another that reported 500ms) with no luck in reproducing the numbers after many attempts, reboots, etc. My concern is that we are seeing inflated numbers for a given build which means that we see this across two different machines each time this happens. Could it be possible the hardware is doing something on two different machines for a given build? I have seen this 20 second startup time reported for pgo and non pgo builds, likewise I have seen the 500ms startup times reports on both pgo and non pgo builds. What is a hacker to do, but continue hacking and ask for more ideas. Would modifying Talos to show the console messages help? I am not familiar with windows 8 nor am I very familiar with how the tests are run on there.

Jim Mathies [:jimm]

Comment 3

•

12 years ago

I can't get back very far on tbpl but it looks like the slaves doing this are random - back to the 10th: t-w864-ix-092 t-w864-ix-008 t-w864-ix-020 t-w864-ix-051

Jim Mathies [:jimm]

Comment 4

•

12 years ago

If this isn't showing up in other paint tests, has to be a startup specific problem. Is 'Ts, Paint' currently a warm start or cold start test?

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 5

•

12 years ago

here is some data from 14 days of inbound: https://docs.google.com/spreadsheet/ccc?key=0ArS97F99-BEZdFVreS1GNExIWVR5ZEZwSTR3c0JobWc#gid=0 ts, paint is a warm test, we fail on all varieties of this.

Armen [:armenzg]

Comment 6

•

12 years ago

Not sure if this info helps in any way (I could get more info from slaves tomorrow): t-w864-ix-046 took a debug jetpack job and after 28 mins it took a talos other job which reported high. There's no way for me to tell that there was a reboot in between. t-w864-ix-063 took a mochitest-2 job and after 20 mins it took a talos other job which reported high. There's no way for me to tell that there was a reboot in between.

Jim Mathies [:jimm]

Comment 7

•

12 years ago

Ok, so sounds like we can rule out individual slave problems.

Jim Mathies [:jimm]

Comment 8

•

12 years ago

Is there any chance that these slow startups are timing out (without the browser being launched) and getting reported as a long load value?

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 9

•

12 years ago

good question- the test is supposed to work by printing out the endtime after the mozafterpaint is received. the start time is determined by the test harness right before we launch the process. While I don't think it is possible, you never know- I could fiddle with that a bit and see if I can make that scenario happen. I know on mobile sometimes we never launch the browser with the webpage specified and the test times out...so it is slim, but something I should double check.

Jim Mathies [:jimm]

Comment 10

•

12 years ago

(In reply to Joel Maher (:jmaher) from comment #9) > good question- the test is supposed to work by printing out the endtime > after the mozafterpaint is received. the start time is determined by the > test harness right before we launch the process. While I don't think it is > possible, you never know- I could fiddle with that a bit and see if I can > make that scenario happen. I know on mobile sometimes we never launch the > browser with the webpage specified and the test times out...so it is slim, > but something I should double check. I would expect timeout as well. Just throwing out some ideas. The persistent 20 seconds is interesting. It implies we are blocked on something with a constant timeout. One thought I had was the application of updates, but I'd expect that to be more random if it were the cause. Are there any logs generated from the tests runs?

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 11

•

12 years ago

is there a pref I could set to toggle the application updates?

Brian R. Bondy [:bbondy]

Comment 12

•

12 years ago

app.update.enabled set to false to disable

Brian R. Bondy [:bbondy]

Comment 13

•

12 years ago

Also maybe there could be an update staged here? C:\Users\Brian\AppData\Local\Mozilla\Firefox\Nightly

Jim Mathies [:jimm]

Comment 14

•

12 years ago

I guess we could turn on / add startup logging and throw a build at try next.

Matt Brubeck (:mbrubeck)

Reporter

Comment 15

•

12 years ago

It looks like this problem affects the WINNT 6.1 ix slaves too: https://groups.google.com/d/topic/mozilla.dev.tree-management/ohBi_eHPAQk/discussion

Summary: False alarms caused by bimodal data in Talos Ts Paint MAX Dirty on Windows 8 → False alarms caused by bimodal data in Talos Ts Paint MAX Dirty on Windows 8, Win7 (WINNT 6.1 ix)

Armen [:armenzg]

Updated

•

12 years ago

Depends on: win7-ix-releng

Armen [:armenzg]

Comment 16

•

12 years ago

FYI, I can loan a Win7 32-bit machine if it helps. It seems that we do not have this issue for WinXP on iX: http://graphs.mozilla.org/graph.html#tests=[[227,63,25],[227,63,31],[227,63,37]]&sel=none&displayrange=7&datatype=running This has been going since the beginning: http://graphs.mozilla.org/graph.html#tests=[[227,63,25],[227,63,31],[227,63,37],[227,63,1],[227,63,12]]&sel=none&displayrange=90&datatype=running We did not have this issue with rev3 minis: http://graphs.mozilla.org/graph.html#tests=[[227,63,25],[227,63,31],[227,63,37],[227,63,1],[227,63,12]]&sel=none&displayrange=7&datatype=running If we look at this job, it shows both opt and pgo results: https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=Windows%207.*talos&rev=c5ac92c2881e If we look at the log of one of the bad results I noticed this: > Running test tspaint_places_generated_med: > Started Thu, 23 May 2013 07:50:23 > INFO : __start_report20858__end_report > > __startTimestamp1369320689315__endTimestamp > > INFO : __startBeforeLaunchTimestamp1369320668457__endBeforeLaunchTimestamp > __startAfterTerminationTimestamp1369320689781__endAfterTerminationTimestamp If we subtract 1369320689781-1369320668457 we get 21324 milliseconds. If we do the same with a good log we get this: 1369342564689-1369342563364=1325 milliseconds. > __startTimestamp1369342564222__endTimestamp > __startBeforeLaunchTimestamp1369342563364__endBeforeLaunchTimestamp > __startAfterTerminationTimestamp1369342564689__endAfterTerminationTimestamp Can we run this job in debug mode? Can we take a snapshot to see what is going on? What happens at the beginning? Even the start report line shows something wonky: INFO : __start_report20858__end_report <---- BAD INFO : __start_report858__end_report <---- GOOD I have tried looking at the Windows logs and I did not find anything relevant.

Blocks: win7-ix-releng

No longer depends on: win7-ix-releng

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 17

•

12 years ago

When I had a loaner box, I was unable to reproduce this. There is something happening which is causing a 20 second delay, probably some file system issue.

Jim Mathies [:jimm]

Comment 18

•

12 years ago

(In reply to Joel Maher (:jmaher) from comment #17) > When I had a loaner box, I was unable to reproduce this. There is something > happening which is causing a 20 second delay, probably some file system > issue. It's a perfect twenty second delay every time so my guess would be some sort of networking related hold up.

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 19

•

12 years ago

would this most likely be a system wide networking delay vs something introduced by firefox? could be a timeout based on dns ? Nick, any thoughts on this?

Flags: needinfo?(hurley)

Jim Mathies [:jimm]

Comment 20

•

12 years ago

(In reply to Joel Maher (:jmaher) from comment #19) > would this most likely be a system wide networking delay vs something > introduced by firefox? could be a timeout based on dns ? Nick, any > thoughts on this? Maybe a slave config problem that shows up when the test harness or app start, for example a dns lookup on an address that times out after twenty seconds? I don't think you would get a reliable delay like this for os functions that can cause latency issues like disk io or some service starting up in the background.

Armen [:armenzg]

Comment 21

•

12 years ago

Is there something I could add to talos to make it more verbose?

Armen [:armenzg]

Updated

•

12 years ago

Blocks: 870453
No longer blocks: win7-ix-releng

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 22

•

12 years ago

I don't know what we could add to talos. I am sure there is some debugging we could do by turning on prefs that would help us determine this. What is odd is that this happens for the entire ts run for a given build, instead of just some random startups. Could it be machine specific?

Jim Mathies [:jimm]

Comment 23

•

12 years ago

(In reply to Joel Maher (:jmaher) from comment #22) > I don't know what we could add to talos. I am sure there is some debugging > we could do by turning on prefs that would help us determine this. What is > odd is that this happens for the entire ts run for a given build, instead of > just some random startups. > > Could it be machine specific? See comment 6 and comment 3, doesn't look like it.

u408661

Comment 24

•

12 years ago

(In reply to Joel Maher (:jmaher) from comment #19) > would this most likely be a system wide networking delay vs something > introduced by firefox? could be a timeout based on dns ? Nick, any > thoughts on this? Sorry for the delay, I was on PTO last week and failed to change my bugzilla info :) After a quick grep through the firefox source, I do see that we have a 20-second timeout in our DNS resolver, but that timeout is only during shutdown of the resolver, and even then, only in debug builds, and on all platforms, so (based on comments above) this is almost certainly not the timeout we're looking for. It would be reasonable to guess that there could be some os-specific timeout for DNS or some other network interaction, but I'm not a master of OS timeout lengths (and 20 seconds seems awfully long for most network timeouts I can think of).

Flags: needinfo?(hurley)

Armen [:armenzg]

Comment 25

•

12 years ago

Thanks Nick! Maybe we can do a platform post? Who else could have some insight about this?

u408661

Comment 26

•

12 years ago

(In reply to Armen Zambrano G. [:armenzg] (Release Enginerring) from comment #25) > Maybe we can do a platform post? Who else could have some insight about this? Patrick McManus might know more about os-level network timeouts that might be 20 seconds long, but I would probably just ask on dev-platform to see if anyone has any ideas.

Jim Mathies [:jimm]

Comment 27

•

12 years ago

I wonder if this could be caused by broken internal timing functions? Over in bug 886109, we're getting weird numbers from StartupTimeline data. In mochitest-metro-chrome logs, the clock in the mozharness log shows that the browser is running for around six minutes, while various numbers returned by StartupTimeline have a twenty second offset including firstPaint, sessionRestored, and firstLoadURI. These test runs also don't fire MozAfterPaints for some reason.

Blocks: 886109

Jim Mathies [:jimm]

Comment 28

•

12 years ago

Possibly related: bug 869285

Gregory Szorc [:gps]

Comment 29

•

12 years ago

Attached patch Part 1: Register action listeners via decorators (obsolete) — Details — Splinter Review

Per earlier feedback, integrating resource collection into BaseScript directly won't be possible because of a classic chicken-or-egg problem. So, it will need to be implemented in a derived class. To correlate resource usage with build steps, resource monitoring will need to integrate with the "dispatch loop" in BaseScript.run(). Currently, there's no way to do that aside from reimplementing that method. This patch changes that. This patch adds two functions to mozharness.base that can be used as decorators - PreScriptStep and PostScriptStep. If a BaseScript-derived class contains methods with these decorators, those methods will be called before or after execution of each step. More detailed info is in source docs. These "listeners" are generic and can be used for anything. I intend to use them for resource monitoring, but others may very well come up with other clever uses for them. In a subsequent patch, I'd like to add pre-run and post-run hook points. But, I'd like to run this implementation by you first to make sure we're on the same page. I'm also open to the idea of adding decorators that can be registered against a specific action (perhaps by passing an argument into the decorators I've defined so far) and replacing the preflight and postflight functions with these.

Attachment #771033 - Flags: review?(aki)

Gregory Szorc [:gps]

Updated

•

12 years ago

Assignee: jmaher → gps

Gregory Szorc [:gps]

Comment 30

•

12 years ago

Comment on attachment 771033 [details] [diff] [review] Part 1: Register action listeners via decorators A clear sign it's time to start my weekend.

Attachment #771033 - Flags: review?(aki)

Gregory Szorc [:gps]

Updated

•

12 years ago

Attachment #771033 - Attachment is obsolete: true

Gregory Szorc [:gps]

Updated

•

12 years ago

Assignee: gps → jmaher

Markus Stange [:mstange]

Comment 31

•

12 years ago

I managed to get a profile of this using https://github.com/mstange/analyze-tryserver-profiles on a Windows 7 ts_paint run (https://tbpl.mozilla.org/php/getParsedLog.php?id=25433830&tree=Try). Here's the profile: http://tests.themasta.com/cleopatra/?report=831be0e46b580372116c7b790c14d68594093014 Every run looks the same and spends a solid 15 seconds sleeping with this callstack: KiFastSystemCallRet KiFastSystemCallRet Sleep 0x8885 (in NvSCPAPI.pdb) 0x19d12 (in NvSCPAPI.pdb) 0x16b74 (in NvSCPAPI.pdb) 0x1b411 (in NvSCPAPI.pdb) 0xaa03 (in nvStereoApiI.pdb) 0x5bd8 (in nvStereoApiI.pdb) 0x194284 (in nvapi.pdb) 0x190c00 (in nvapi.pdb) 0x5260e (in Nv3DVStreaming.pdb) 0x186d (in Nv3DVStreaming.pdb) mozilla::layers::LayerManagerD3D10::Initialize(bool,long *) nsWindow::GetLayerManager(mozilla::layers::PLayerTransactionChild *,mozilla::layers::LayersBackend,nsIWidget::LayerManagerPersistence,bool *) PresShell::GetLayerManager() nsRefreshDriver::Tick mozilla::RefreshDriverTimer::TickDriver(nsRefreshDriver *,__int64,mozilla::TimeStamp) mozilla::RefreshDriverTimer::Tick() nsTimerImpl::Fire() Timer::Fire nsTimerEvent::Run() nsThread::ProcessNextEvent(bool,bool *) NS_ProcessNextEvent(nsIThread *,bool) mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate *) MessageLoop::RunHandler() MessageLoop::Run() nsBaseAppShell::Run() nsAppShell::Run() nsAppStartup::Run() XREMain::XRE_mainRun() XREMain::XRE_main(int,char * * const,nsXREAppData const *) Startup::XRE_Main XRE_main do_main NS_internal_main(int,char * *) wmain __tmainCRTStartup BaseThreadInitThunk __RtlUserThreadStart _RtlUserThreadStart

Summary: False alarms caused by bimodal data in Talos Ts Paint MAX Dirty on Windows 8, Win7 (WINNT 6.1 ix) → False alarms caused by bimodal data in Talos Ts Paint MAX Dirty on Windows 8, Win7 (WINNT 6.1 ix) - 15 seconds of sleeping during D3D10 layer manager initialization

Benoit Girard (:BenWa)

Comment 32

•

12 years ago

NvSCPAPI/nvStereoApiI/Nv3DVStreaming are all related to 3D vision. It would be good to rule out that this problem doesn't go away once a monitor is plugged in. jmaher would it be hard to reproduce a run with a monitor plugged in? If it is we can instead just push to try with the 3D rendering feature disabled.

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 33

•

12 years ago

:BenWa - good idea; I don't know how we could realistically get a monitor hooked up. It would be possible to get a machine, go to the data center and hook a monitor up. It wouldn't be guaranteed that you could reproduce it on that machine. :armenzg, I wonder if we have dongles for these machines? :BenWa, what prefs could we send to disable 3D rendering?

Flags: needinfo?(jmaher) → needinfo?(armenzg)

Benoit Girard (:BenWa)

Comment 34

•

12 years ago

I just looked. We don't have any prefs. I can prepare a patch. Joe tells me what some machine have a 'resistor dongle' that simulates a monitor. Which machines have these? All, none or a subset of windows machine?

Armen [:armenzg]

Comment 35

•

12 years ago

All of the Win7 and Win8 machines should have a dongle. Could we dump into the logs some info to determine if we're running an incorrect screen resolution? Or lack of dongle? Or dump dxdirect info? I'm just throwing ideas out there.

Flags: needinfo?(armenzg)

Benoit Girard (:BenWa)

Comment 36

•

12 years ago

Attached patch patch — Details — Splinter Review

mstange can you get another profiling run with this patch applied?

Flags: needinfo?(mstange)

Markus Stange [:mstange]

Comment 37

•

12 years ago

Sure. I've started a try run at https://tbpl.mozilla.org/?tree=Try&rev=caeadd69b4b7

Flags: needinfo?(mstange)

Markus Stange [:mstange]

Comment 38

•

12 years ago

This looks much better! All the runs had ts_paint results in the 1.9 seconds range, instead of 20 seconds. Here's a profile: http://people.mozilla.com/~bgirard/cleopatra/#report=8b9530a5fb2506a691913979423a297da3a9ee07

Benoit Girard (:BenWa)

Comment 39

•

12 years ago

I've contacted NVidia about the issue and they would like clarrification about the kind of display used in our slave. Amy I'm told you can provide more details about the kind of display that's connected to these slaves: t-w732-ix-095 t-w732-ix-022 t-w732-ix-104 t-w732-ix-068 I hear it's some kind of dongle/special attachment that simulates the monitor. I'd like more information so that I can forward to NVidia. If it's not to much trouble it would be nice knowing the GPU used while running these tests and send me a copy of Nv3DVStreaming.*, nvStereoApiI.*, nvapi.* privately.

Flags: needinfo?(arich)

Amy Rich [:arr] [:arich]

Comment 40

•

12 years ago

There are no dongles; there's a physical graphics card. I've pinged Q to give you the details on how that card is selected as the primary display.

Flags: needinfo?(arich) → needinfo?(q)

Benoit Girard (:BenWa)

Comment 41

•

12 years ago

(In reply to Amy Rich [:arich] [:arr] from comment #40) > There are no dongles; there's a physical graphics card. Is there anything attach to the physical graphics card? If so what?

Amy Rich [:arr] [:arich]

Comment 42

•

12 years ago

No, there are no physical attachments to the card.

Q

Comment 43

•

12 years ago

We are attaching a virtual vga display. Via the win api.

Flags: needinfo?(q)

Jim Mathies [:jimm]

Comment 44

•

12 years ago

Not sure what we are waiting for here. Should I put together a patch that hides this nvidia stereoscopic stuff behind a pref and disable it by default on mc?

Benoit Girard (:BenWa)

Comment 45

•

12 years ago

(In reply to Jim Mathies [:jimm] from comment #44) > Not sure what we are waiting for here. Should I put together a patch that > hides this nvidia stereoscopic stuff behind a pref and disable it by default > on mc? Not yet. We've committed to NVidia to support the stereoscopic feature. I'm discussing this with NVidia and send a follow up email.

Amy Rich [:arr] [:arich]

Comment 46

•

12 years ago

Here's the exact hardware config for all xp, w7, and w8 iX test machines. 2U Neutron "Gemini" Series Four Node Hot-Pluggable Server 12 x SAS/SATA Hard Drive Bays, 920W High-Efficiency redundant (1+1) Power Supply - Quick Release Rackmount Rail Kit. 4 Server Nodes Per System – 4 of 4 Nodes contains the following components: * Single Socket Xeon Motherboard (Intel 3420 Tylersburg Chipset) - 4 DIMM Slots (32GB Max Memory) - SATAII + Dual Gigabit Intel 82574L LAN, (1) PCI- E + IPMI Dedicated LAN * Intel Quad Core 2.66Ghz Xeon (LGA1156) 8MB cache (X3450) * Heatsink: Multi Core X8 Passive Aluminum XEON 1U Heatsink * 2x4GB DDR3 1333Mhz ECC/REG (8GB total) * Western Digital 500GB Enterprise (RAID Duty) SATA2 7200RPM 64MB Cache RE4 * RAID: None Graphics: EVGA GeForce GT 610 1GB 64-bit DDR3 PCI Express 2.0 x16 HDCP Ready Video Card (Low-Profile) * Integrated IPMI 2.0 with virtual media over LAN and KVM-over-LAN Dedicated LAN port * 4-Post Slide Rail Kit

Jim Mathies [:jimm]

Comment 47

•

12 years ago

(In reply to Benoit Girard (:BenWa) from comment #45) > (In reply to Jim Mathies [:jimm] from comment #44) > > Not sure what we are waiting for here. Should I put together a patch that > > hides this nvidia stereoscopic stuff behind a pref and disable it by default > > on mc? > > Not yet. We've committed to NVidia to support the stereoscopic feature. I'm > discussing this with NVidia and send a follow up email. Any update on this?

Benoit Girard (:BenWa)

Comment 48

•

12 years ago

No, the last email contact was on August 15 where they asked for more information which I provided that they. Their last theory was: 'The delay in stereo may be occurring during the initialization time when we check hardware info and get EDID. At this moment there may be delay because DD is not ready to provide this info and some NVAPI failed.'

Matthew N. [:MattN]

Comment 49

•

12 years ago

Attached image Screenshot of graph server showing the problem stop — Details

It seems like the problem stopped on m-c on 2013-10-21. Do we know why?

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 50

•

12 years ago

I was working on talos updates around then, but they landed in November- must be something in the product.

:Gijs (he/him)

Comment 51

•

12 years ago

(In reply to Joel Maher (:jmaher) from comment #50) > I was working on talos updates around then, but they landed in November- > must be something in the product. Is it possible there were driver updates on the talos machines that eliminated this 15s wait that other comments discussed?

Jim Mathies [:jimm]

Comment 52

•

12 years ago

(In reply to Matthew N. [:MattN] (catching up on reviews) from comment #49) > Created attachment 828569 [details] > Screenshot of graph server showing the problem stop > > It seems like the problem stopped on m-c on 2013-10-21. Do we know why? Am I reading this right? Looks like all data points now read zero.

Jim Mathies [:jimm]

Comment 53

•

12 years ago

(In reply to Jim Mathies [:jimm] from comment #52) > (In reply to Matthew N. [:MattN] (catching up on reviews) from comment #49) > > Created attachment 828569 [details] > > Screenshot of graph server showing the problem stop > > > > It seems like the problem stopped on m-c on 2013-10-21. Do we know why? > > Am I reading this right? Looks like all data points now read zero. Hmm, maybe we do have some data there but it's not visible due to the range.

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 54

•

12 years ago

the data is still reporting normal, I looked at more data on datazilla, it shows the same behavior where around oct 22nd we stopped reporting this bi-modal data on win7 and win8. I recall changing the screen saver settings on win8 machines to not kick in during talos tests. We were seeing issues in metro mode and this took place in the general time window: https://bugzilla.mozilla.org/show_bug.cgi?id=929473. :armenzg, can you help is figure out if anything changed on either mozharness or windows updates on october 21/22? I know these machines have windows update enabled, so that is likely- also any manual updates would be noticed.

Flags: needinfo?(armenzg)

Armen [:armenzg]

Comment 55

•

12 years ago

Side note: there's a regression introduced in the 31st if anyone could follow that. ################################ I looked at this: https://wiki.mozilla.org/ReleaseEngineering/Changes/2013 The only changes from releng deployed on that day are from bug 842870. None of them had anything to do with Windows and talos. Context: a while ago we noticed that our Windows test machines started reporting performance numbers (for a specific suite) with a bimodal distribution and since October 21st it has normalized. This URL shows how the issue dissapears after Oct. 21st: https://datazilla.mozilla.org/?start=1382155200&stop=1383838693&product=Firefox&repository=Mozilla-Inbound&os=win&os_version=6.1.7601&test=ts_paint&x86_64=false&error_bars=false&project=talos This URL shows the normal behaviour that should have always been: https://datazilla.mozilla.org/?start=1382672520&stop=1383838693&product=Firefox&repository=Mozilla-Inbound&os=win&os_version=6.1.7601&test=ts_paint&x86_64=false&error_bars=false&project=talos Q: is there a way to know if other GPO changes went it for Win7 and Win8 machines? I know that we disabled the screen saver on Win8 machines on the 22nd (bug 929473), however, I'm not aware of Win7 changes on that bug.

Flags: needinfo?(armenzg) → needinfo?(q)

Matthew N. [:MattN]

Comment 56

•

12 years ago

(In reply to Armen Zambrano [:armenzg] (Release Engineering) (EDT/UTC-4) from comment #55) > Side note: there's a regression introduced in the 31st if anyone could > follow that. I think that's bug 932389[1] AFAICT. [1] https://groups.google.com/forum/#!searchin/mozilla.dev.platform/ts_paint/mozilla.dev.platform/DltrLRix4F8/muMt33jKKEsJ

Armen [:armenzg]

Comment 57

•

12 years ago

Landed on the 22nd: http://hg.mozilla.org/integration/mozilla-inbound/rev/1ae5d58532ad which started to point to: http://hg.mozilla.org/build/talos/rev/58bd513cc54e which also includes this: http://hg.mozilla.org/build/talos/rev/49fe0b4c3014 End of mystery? :)

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 58

•

12 years ago

Actually that was backed out and landed again on Nov 1: http://hg.mozilla.org/mozilla-central/filelog/545887140a1b/testing/talos/talos.json I wish it were the end of the mystery!

Matt Brubeck (:mbrubeck)

Reporter

Comment 59

•

12 years ago

Attached patch re-enable regression analysis (obsolete) — Details — Splinter Review

We had disabled regression analysis for Ts Paint on Windows 7/8 because this bug was causing false positives and preventing real regressions from being detected. This graph server config change re-enables the analysis.

Attachment #828922 - Flags: review?(catlee)

Q

Comment 60

•

12 years ago

Armen, I can't find any all MS platform wide changes during those dates.

Flags: needinfo?(q)

Matt Brubeck (:mbrubeck)

Reporter

Updated

•

12 years ago

Whiteboard: [SfN] → [SfN][alarms disabled]

Matt Brubeck (:mbrubeck)

Reporter

Comment 61

•

12 years ago

Attached patch re-enable regression analysis (v2) — Details — Splinter Review

Forgot to update unit tests.

Attachment #828922 - Attachment is obsolete: true

Attachment #828922 - Flags: review?(catlee)

Attachment #8337875 - Flags: review?(catlee)

Chris AtLee [:catlee]

Updated

•

12 years ago

Attachment #8337875 - Flags: review?(catlee) → review+

Matt Brubeck (:mbrubeck)

Reporter

Comment 62

•

12 years ago

Comment on attachment 8337875 [details] [diff] [review] re-enable regression analysis (v2) http://hg.mozilla.org/graphs/rev/15609490c297 (needs an update to analysis.cfg in production)

Matt Brubeck (:mbrubeck)

Reporter

Updated

•

12 years ago

Whiteboard: [SfN][alarms disabled] → [SfN]

Chris AtLee [:catlee]

Comment 63

•

12 years ago

deployed

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 64

•

12 years ago

can we close this bug? I don't see any reason to keep it open.

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 65

•

11 years ago

we dont run ts paint max dirty anymore and haven't for a long time.

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Part 1: Register action listeners via decorators 12 years ago Gregory Szorc [:gps] 8.99 KB, patch		Details \| Diff \| Splinter Review
patch 12 years ago Benoit Girard (:BenWa) 926 bytes, patch		Details \| Diff \| Splinter Review
Screenshot of graph server showing the problem stop 12 years ago Matthew N. [:MattN] 372.18 KB, image/png		Details
re-enable regression analysis 12 years ago Matt Brubeck (:mbrubeck) 1.13 KB, patch		Details \| Diff \| Splinter Review
re-enable regression analysis (v2) 12 years ago Matt Brubeck (:mbrubeck) 2.68 KB, patch	catlee : review+	Details \| Diff \| Splinter Review