Last Comment Bug 519893 - CGContextFillPath sometimes takes over 100ms on Talos, Talos machines should behave as if there was a screen connected to them
: CGContextFillPath sometimes takes over 100ms on Talos, Talos machines should ...
Status: RESOLVED DUPLICATE of bug 563836
[talos][hardware]
: perf
Product: Release Engineering
Classification: Other
Component: Other (show other bugs)
: other
: All Mac OS X
: P3 normal (vote)
: ---
Assigned To: Nobody; OK to take it and work on it
:
Mentors:
: 563187 (view as bug list)
Depends on: 520512 563836
Blocks: 564125 334697
  Show dependency treegraph
 
Reported: 2009-09-30 20:05 PDT by Markus Stange [:mstange]
Modified: 2013-08-12 21:54 PDT (History)
28 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---


Attachments

Description Markus Stange [:mstange] 2009-09-30 20:05:18 PDT
Sometimes, CGContextFillPath takes over 100ms instead of the usual 0 to 20ms. When this happens, it only happens during the first CGContextFillPath of the current painting cycle (i.e. in the first call to it since the beginning of drawRect).

I haven't found a real consistency of when this happens. I only know that during the startup performance test it consistently happens during the 3rd and 6th repaint of every cycle.

And the worst thing is that I can't reproduce this effect locally. Maybe it's a general Talos weirdness, maybe it's a bug in Mac OS 10.5.2 that's fixed in later versions, or maybe I'm just not emulating Talos conditions on my machine closely enough.

This phenomenon has existed independently of the patch in bug 517804, but it's responsible for the ts_shutdown "regression" that that patch caused. The patch in bug 517804 reduces the number of repaints before onload from 6 to 5, so the 6th repaint moves from before onload to after onload, so the additional 100ms move into the ts_shutdown numbers.
Comment 1 Markus Stange [:mstange] 2009-10-07 15:33:06 PDT
What I've found by now is a little unsettling: The phenomenon goes away as soon as I watch the machine via VNC. If I close the VNC window (and watch browser_output.txt through the SSH shell), the phenomenon occurs again.

So maybe we should just have a connected VNC session for every Mac Talos machine all the time? Just kidding ;-)

I've even got Shark to take a profile, by launching the test from the Terminal in the VNC session and quickly closing the VNC window afterwards. The only problem is that this Shark session isn't helpful at all because my build doesn't have any symbols. :(

I'm currently working on getting a build with proper symbols.
Comment 2 Markus Stange [:mstange] 2009-10-08 18:21:32 PDT
Fun stuff!

The profile looks like this:

0.0%  7.0%  CGContextFillRect  
0.0%  7.0%   CGContextFillRects  
0.0%  7.0%    ripc_DrawRects  
0.0%  7.0%     ripc_Render  
0.0%  7.0%      ripl_BltShape  
0.0%  7.0%       ripd_Lock  
0.0%  7.0%        CGSDeviceLock  
0.0%  7.0%         _CGSLockWindow  
0.0%  7.0%          _CGSSynchronizeWindowBackingStore  
0.0%  7.0%            mach_msg  
7.0%  7.0%             mach_msg_trap  

In other words, we're synchronizing something with the window server.

And why couldn't I reproduce it on the Mac Mini in our office? Because I had a screen connected to it. As soon as I disconnect the screen, the synchronization phenomenon kicks in.

So there are three things we could do now:
 1. Leave things as they are and land the patch in bug 517804 anyway. The
    ts_shutdown regression is just an artifact of unrealistic testing
    conditions.
 2. Buy lots of screens and attach them to all of our Talos machines.
    Probably not the cheapest idea.
 3. Run a VNC client on all of our Talos machines, watching themselves, in
    order to emulate screen-attachedness. I've tried this, it works.

I'd like to argue for 1 now, and 3 when people think it's necessary.
Comment 3 Markus Stange [:mstange] 2009-10-08 18:30:58 PDT
For the record:
This bug has hit two bugs independently: bug 517804 and bug 334697. Both patches changed paint timing in subtle ways, causing the second occurance of this bug during a ts cycle to move from before onload to after onload, and thus moving 100ms from ts to ts_shutdown.
Comment 4 Boris Zbarsky [:bz] (Out June 25-July 6) 2009-10-08 18:41:11 PDT
So... couldn't this exact phenomenon be biting Tp and such?  It seems to me that if we're going to take any of our Mac T numbers seriously, we need 3 (which need not interfere with 1).  Or something.  100ms moving around between pages on Tp or between tests on Tdhtml is a huge number.

ccing some folks who might be interested.
Comment 5 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2009-10-08 18:49:41 PDT
Yes, it seems pretty clear we have to do #3.
Comment 6 alice nodelman [:alice] [:anode] 2009-10-08 18:52:11 PDT
We've only reproduced this on leopard, so we'd need far more testing on all platforms before making any radical changes to slaves.
Comment 7 Markus Stange [:mstange] 2009-10-08 18:55:44 PDT
Make "all of our Talos machines" "all of our Leopard Talos machines". I don't think anybody is asking for doing this on non-Leopard machines.
Comment 8 Boris Zbarsky [:bz] (Out June 25-July 6) 2009-10-08 19:02:38 PDT
All platforms as in all mac platforms?  This is pretty likely to be a very mac-specific issue...
Comment 9 Josh Aas 2009-10-08 20:04:33 PDT
We should file a bug with Apple about this. They might not consider it to be a bug but we should make sure they know about it.
Comment 10 Vladimir Vukicevic [:vlad] [:vladv] 2009-10-13 17:58:09 PDT
There's another option -- we should be able to get dongles that pretend that a monitor is connected (I think there's a paperclip solution as well?) for pretty cheap (like single-digit $), and should just plug them into all our slaves.  Running a local VNC client might introduce additional overhead (like if the VNC screen capture/update happens right during a page), though that should be far far less than the 100ms seen here.
Comment 11 Chris AtLee [:catlee] 2009-10-15 06:48:49 PDT
I thought we had dongles on all our talos slaves already?
Comment 12 John Ford [:jhford] 2009-10-19 13:04:56 PDT
IT: Is it correct that all of our talos slaves have vga adapters with resistors in them?
Comment 13 matthew zeier [:mrz] 2009-10-19 14:22:20 PDT
Phong will know for sure but I think it's mostly the ones running Windows.
Comment 14 Phong Tran [:phong] 2009-10-19 14:27:55 PDT
only windows and vista minis have them.
Comment 15 Phong Tran [:phong] 2009-10-19 14:40:20 PDT
I mean windows (vista & XP) and Linux minis have them.
Comment 16 matthew zeier [:mrz] 2009-10-21 15:02:11 PDT
IT action - resistors on every Mini.
Comment 17 Markus Stange [:mstange] 2009-11-01 13:56:52 PST
What's the ETA here?
Comment 18 Markus Stange [:mstange] 2009-11-10 01:40:12 PST
We just had another spurious 100ms regression in the startup test:
http://groups.google.com/group/mozilla.dev.tree-management/browse_thread/thread/2a79d5e6c9fcb517
Comment 19 Paul O'Shannessy [:zpao] (not reading much bugmail, email directly) 2009-11-12 12:08:39 PST
It's been suggested this is potentially the cause of another recent regression. See bug 514490 comment #41
Comment 20 matthew zeier [:mrz] 2009-11-18 11:32:51 PST
Been having problems sourcing a lot of resisters - found them at digikey.

For future reference, 
http://search.digikey.com/scripts/DkSearch/dksus.dll?Detail&name=100XBK-ND
Comment 21 Phong Tran [:phong] 2009-12-01 12:50:02 PST
added resistor to all the Talos Leopard minis.
Comment 22 alice nodelman [:alice] [:anode] 2009-12-01 16:51:18 PST
Hopefully this includes the talos try machines and any machines currently hosed at MV?
Comment 23 Chris Cooper [:coop] 2010-01-08 09:37:51 PST
Re-assigning to IT since that's where the work is happening.

(In reply to comment #22)
> Hopefully this includes the talos try machines and any machines currently hosed
> at MV?

Alice: which machines are currently hosed, or are they fixed by now?

Phong: did resistors get added to try talos machines also?
Comment 24 Phong Tran [:phong] 2010-01-20 14:47:17 PST
can you give me a list of which machines need to be double checked?
Comment 25 matthew zeier [:mrz] 2010-02-03 10:51:46 PST
Re-open if there's a list.
Comment 26 Dão Gottwald [:dao] 2010-03-14 09:48:31 PDT
I seem to be running into this on the try server, e.g. qm-pleopard-try07.
Comment 27 Phong Tran [:phong] 2010-03-18 10:19:13 PDT
(In reply to comment #26)
> I seem to be running into this on the try server, e.g. qm-pleopard-try07.

This mini has the resistor installed.  Not sure if there is much else we can do on our end.
Comment 28 Chris Cooper [:coop] 2010-03-25 12:48:51 PDT
(In reply to comment #27)
> (In reply to comment #26)
> > I seem to be running into this on the try server, e.g. qm-pleopard-try07.
> 
> This mini has the resistor installed.  Not sure if there is much else we can do
> on our end.

Dão: has this recurred on qm-pleopard-try07 (or any other mac try slave for that matter)?
Comment 29 Dão Gottwald [:dao] 2010-03-25 14:24:03 PDT
I don't know. The patch I was using landed on mozilla-central (without problems).
Comment 30 Chris Cooper [:coop] 2010-03-25 14:46:39 PDT
(In reply to comment #27)
> This mini has the resistor installed.  Not sure if there is much else we can do
> on our end.

(In reply to comment #29)
> I don't know. The patch I was using landed on mozilla-central (without
> problems).

Given these 2 points, we'll flag qm-pleopard-try07 as potentially bad and move on.
Comment 31 alice nodelman [:alice] [:anode] 2010-05-04 17:21:15 PDT
This issue was re-introduced when we switched to the rev3 talos boxes.

All talos rev3 leopard + snow leopard boxes should have resistors installed.
Comment 32 Justin Dow [:jabba] 2010-05-05 11:17:51 PDT
>  3. Run a VNC client on all of our Talos machines, watching themselves, in
>     order to emulate screen-attachedness. I've tried this, it works.
> 

The rev3 minis don't have resistors. Reason is because they have a different DVI output and we'd need to buy adapters for each one. I think the solution of running a local VNC client on each one is the correct solution here. My understanding is that the extra overhead is acceptable as long as it is the same on all boxes in the pool?
Comment 33 Chris AtLee [:catlee] 2010-05-07 14:17:51 PDT
Blocked on IT adding new dongles.
Comment 34 Smokey Ardisson (offline for a while; not following bugs - do not email) 2010-05-10 09:32:23 PDT
(In reply to comment #9)
> We should file a bug with Apple about this. They might not consider it to be a
> bug but we should make sure they know about it.

Did anyone ever rdar:// this?
Comment 35 Markus Stange [:mstange] 2010-05-10 10:08:44 PDT
Apparently I didn't. I will.
Comment 36 Dave Townsend [:mossop] 2010-05-11 11:37:49 PDT
*** Bug 563187 has been marked as a duplicate of this bug. ***
Comment 37 Justin Dow [:jabba] 2010-05-14 10:57:02 PDT
The resistors have been install on all Leopard Talos machines. Can this bug be closed?
Comment 38 John O'Duinn [:joduinn] (please use "needinfo?" flag) 2010-05-18 19:51:06 PDT
(In reply to comment #37)
> The resistors have been install on all Leopard Talos machines. Can this bug be
> closed?

I believe this is a DUP of bug#563836, which is now fixed. 

If you are still seeing problems, then obviously its not a DUP, and please reopen.

*** This bug has been marked as a duplicate of bug 563836 ***

Note You need to log in before you can comment on or make changes to this bug.