CGContextFillPath sometimes takes over 100ms on Talos, Talos machines should behave as if there was a screen connected to them

RESOLVED DUPLICATE of bug 563836

Status

Release Engineering
Other
P3
normal
RESOLVED DUPLICATE of bug 563836
8 years ago
4 years ago

People

(Reporter: mstange, Unassigned)

Tracking

(Blocks: 1 bug, {perf})

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [talos][hardware])

(Reporter)

Description

8 years ago
Sometimes, CGContextFillPath takes over 100ms instead of the usual 0 to 20ms. When this happens, it only happens during the first CGContextFillPath of the current painting cycle (i.e. in the first call to it since the beginning of drawRect).

I haven't found a real consistency of when this happens. I only know that during the startup performance test it consistently happens during the 3rd and 6th repaint of every cycle.

And the worst thing is that I can't reproduce this effect locally. Maybe it's a general Talos weirdness, maybe it's a bug in Mac OS 10.5.2 that's fixed in later versions, or maybe I'm just not emulating Talos conditions on my machine closely enough.

This phenomenon has existed independently of the patch in bug 517804, but it's responsible for the ts_shutdown "regression" that that patch caused. The patch in bug 517804 reduces the number of repaints before onload from 6 to 5, so the 6th repaint moves from before onload to after onload, so the additional 100ms move into the ts_shutdown numbers.

Updated

8 years ago
Blocks: 334697

Updated

8 years ago
blocking2.0: --- → ?
(Reporter)

Updated

8 years ago
Depends on: 520512
(Reporter)

Comment 1

8 years ago
What I've found by now is a little unsettling: The phenomenon goes away as soon as I watch the machine via VNC. If I close the VNC window (and watch browser_output.txt through the SSH shell), the phenomenon occurs again.

So maybe we should just have a connected VNC session for every Mac Talos machine all the time? Just kidding ;-)

I've even got Shark to take a profile, by launching the test from the Terminal in the VNC session and quickly closing the VNC window afterwards. The only problem is that this Shark session isn't helpful at all because my build doesn't have any symbols. :(

I'm currently working on getting a build with proper symbols.
(Reporter)

Comment 2

8 years ago
Fun stuff!

The profile looks like this:

0.0%  7.0%  CGContextFillRect  
0.0%  7.0%   CGContextFillRects  
0.0%  7.0%    ripc_DrawRects  
0.0%  7.0%     ripc_Render  
0.0%  7.0%      ripl_BltShape  
0.0%  7.0%       ripd_Lock  
0.0%  7.0%        CGSDeviceLock  
0.0%  7.0%         _CGSLockWindow  
0.0%  7.0%          _CGSSynchronizeWindowBackingStore  
0.0%  7.0%            mach_msg  
7.0%  7.0%             mach_msg_trap  

In other words, we're synchronizing something with the window server.

And why couldn't I reproduce it on the Mac Mini in our office? Because I had a screen connected to it. As soon as I disconnect the screen, the synchronization phenomenon kicks in.

So there are three things we could do now:
 1. Leave things as they are and land the patch in bug 517804 anyway. The
    ts_shutdown regression is just an artifact of unrealistic testing
    conditions.
 2. Buy lots of screens and attach them to all of our Talos machines.
    Probably not the cheapest idea.
 3. Run a VNC client on all of our Talos machines, watching themselves, in
    order to emulate screen-attachedness. I've tried this, it works.

I'd like to argue for 1 now, and 3 when people think it's necessary.
(Reporter)

Comment 3

8 years ago
For the record:
This bug has hit two bugs independently: bug 517804 and bug 334697. Both patches changed paint timing in subtle ways, causing the second occurance of this bug during a ts cycle to move from before onload to after onload, and thus moving 100ms from ts to ts_shutdown.
Component: Graphics → Release Engineering
Product: Core → mozilla.org
QA Contact: thebes → release
Summary: CGContextFillPath sometimes takes over 100ms on Talos → CGContextFillPath sometimes takes over 100ms on Talos, Talos machines should behave as if there was a screen connected to them
Version: Trunk → other
So... couldn't this exact phenomenon be biting Tp and such?  It seems to me that if we're going to take any of our Mac T numbers seriously, we need 3 (which need not interfere with 1).  Or something.  100ms moving around between pages on Tp or between tests on Tdhtml is a huge number.

ccing some folks who might be interested.
Yes, it seems pretty clear we have to do #3.
We've only reproduced this on leopard, so we'd need far more testing on all platforms before making any radical changes to slaves.
(Reporter)

Comment 7

8 years ago
Make "all of our Talos machines" "all of our Leopard Talos machines". I don't think anybody is asking for doing this on non-Leopard machines.
All platforms as in all mac platforms?  This is pretty likely to be a very mac-specific issue...

Comment 9

8 years ago
We should file a bug with Apple about this. They might not consider it to be a bug but we should make sure they know about it.
There's another option -- we should be able to get dongles that pretend that a monitor is connected (I think there's a paperclip solution as well?) for pretty cheap (like single-digit $), and should just plug them into all our slaves.  Running a local VNC client might introduce additional overhead (like if the VNC screen capture/update happens right during a page), though that should be far far less than the 100ms seen here.
I thought we had dongles on all our talos slaves already?
IT: Is it correct that all of our talos slaves have vga adapters with resistors in them?
Assignee: nobody → server-ops
Component: Release Engineering → Server Operations
QA Contact: release → mrz
Phong will know for sure but I think it's mostly the ones running Windows.
Assignee: server-ops → phong

Comment 14

8 years ago
only windows and vista minis have them.

Comment 15

8 years ago
I mean windows (vista & XP) and Linux minis have them.
IT action - resistors on every Mini.
(Reporter)

Comment 17

8 years ago
What's the ETA here?
(Reporter)

Comment 18

8 years ago
We just had another spurious 100ms regression in the startup test:
http://groups.google.com/group/mozilla.dev.tree-management/browse_thread/thread/2a79d5e6c9fcb517
It's been suggested this is potentially the cause of another recent regression. See bug 514490 comment #41
Been having problems sourcing a lot of resisters - found them at digikey.

For future reference, 
http://search.digikey.com/scripts/DkSearch/dksus.dll?Detail&name=100XBK-ND

Comment 21

8 years ago
added resistor to all the Talos Leopard minis.
Assignee: phong → nobody
Component: Server Operations → Release Engineering
QA Contact: mrz → release
Hopefully this includes the talos try machines and any machines currently hosed at MV?
Re-assigning to IT since that's where the work is happening.

(In reply to comment #22)
> Hopefully this includes the talos try machines and any machines currently hosed
> at MV?

Alice: which machines are currently hosed, or are they fixed by now?

Phong: did resistors get added to try talos machines also?
Assignee: nobody → server-ops
Component: Release Engineering → Server Operations
QA Contact: release → mrz

Updated

7 years ago
Assignee: server-ops → phong

Comment 24

7 years ago
can you give me a list of which machines need to be double checked?
Re-open if there's a list.
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → INCOMPLETE
I seem to be running into this on the try server, e.g. qm-pleopard-try07.
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---

Comment 27

7 years ago
(In reply to comment #26)
> I seem to be running into this on the try server, e.g. qm-pleopard-try07.

This mini has the resistor installed.  Not sure if there is much else we can do on our end.
Assignee: phong → nobody
Component: Server Operations → Release Engineering
QA Contact: mrz → release
(In reply to comment #27)
> (In reply to comment #26)
> > I seem to be running into this on the try server, e.g. qm-pleopard-try07.
> 
> This mini has the resistor installed.  Not sure if there is much else we can do
> on our end.

Dão: has this recurred on qm-pleopard-try07 (or any other mac try slave for that matter)?
I don't know. The patch I was using landed on mozilla-central (without problems).
(In reply to comment #27)
> This mini has the resistor installed.  Not sure if there is much else we can do
> on our end.

(In reply to comment #29)
> I don't know. The patch I was using landed on mozilla-central (without
> problems).

Given these 2 points, we'll flag qm-pleopard-try07 as potentially bad and move on.
Status: REOPENED → RESOLVED
Last Resolved: 7 years ago7 years ago
Resolution: --- → FIXED

Updated

7 years ago
Whiteboard: [badslave?]
This issue was re-introduced when we switched to the rev3 talos boxes.

All talos rev3 leopard + snow leopard boxes should have resistors installed.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Depends on: 563836
Blocks: 563187

Comment 32

7 years ago
>  3. Run a VNC client on all of our Talos machines, watching themselves, in
>     order to emulate screen-attachedness. I've tried this, it works.
> 

The rev3 minis don't have resistors. Reason is because they have a different DVI output and we'd need to buy adapters for each one. I think the solution of running a local VNC client on each one is the correct solution here. My understanding is that the extra overhead is acceptable as long as it is the same on all boxes in the pool?
Blocks: 564125
Blocked on IT adding new dongles.
Priority: -- → P3
Whiteboard: [badslave?]
(In reply to comment #9)
> We should file a bug with Apple about this. They might not consider it to be a
> bug but we should make sure they know about it.

Did anyone ever rdar:// this?
(Reporter)

Comment 35

7 years ago
Apparently I didn't. I will.
No longer blocks: 563187
Duplicate of this bug: 563187

Updated

7 years ago
Whiteboard: [talos][hardware]

Comment 37

7 years ago
The resistors have been install on all Leopard Talos machines. Can this bug be closed?
(In reply to comment #37)
> The resistors have been install on all Leopard Talos machines. Can this bug be
> closed?

I believe this is a DUP of bug#563836, which is now fixed. 

If you are still seeing problems, then obviously its not a DUP, and please reopen.
Status: REOPENED → RESOLVED
Last Resolved: 7 years ago7 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 563836
blocking2.0: ? → ---
(Assignee)

Updated

4 years ago
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.