519893 - CGContextFillPath sometimes takes over 100ms on Talos, Talos machines should behave as if there was a screen connected to them

Reporter

Description

•

16 years ago

Sometimes, CGContextFillPath takes over 100ms instead of the usual 0 to 20ms. When this happens, it only happens during the first CGContextFillPath of the current painting cycle (i.e. in the first call to it since the beginning of drawRect). I haven't found a real consistency of when this happens. I only know that during the startup performance test it consistently happens during the 3rd and 6th repaint of every cycle. And the worst thing is that I can't reproduce this effect locally. Maybe it's a general Talos weirdness, maybe it's a bug in Mac OS 10.5.2 that's fixed in later versions, or maybe I'm just not emulating Talos conditions on my machine closely enough. This phenomenon has existed independently of the patch in bug 517804, but it's responsible for the ts_shutdown "regression" that that patch caused. The patch in bug 517804 reduces the number of repaints before onload from 6 to 5, so the 6th repaint moves from before onload to after onload, so the additional 100ms move into the ts_shutdown numbers.

Dão Gottwald [:dao]

Updated

•

16 years ago

Blocks: 334697

Dão Gottwald [:dao]

Updated

•

16 years ago

blocking2.0: --- → ?

Markus Stange [:mstange]

Reporter

Updated

•

16 years ago

Depends on: 520512

Markus Stange [:mstange]

Reporter

Comment 1

•

16 years ago

What I've found by now is a little unsettling: The phenomenon goes away as soon as I watch the machine via VNC. If I close the VNC window (and watch browser_output.txt through the SSH shell), the phenomenon occurs again. So maybe we should just have a connected VNC session for every Mac Talos machine all the time? Just kidding ;-) I've even got Shark to take a profile, by launching the test from the Terminal in the VNC session and quickly closing the VNC window afterwards. The only problem is that this Shark session isn't helpful at all because my build doesn't have any symbols. :( I'm currently working on getting a build with proper symbols.

Markus Stange [:mstange]

Reporter

Comment 2

•

16 years ago

Fun stuff! The profile looks like this: 0.0% 7.0% CGContextFillRect 0.0% 7.0% CGContextFillRects 0.0% 7.0% ripc_DrawRects 0.0% 7.0% ripc_Render 0.0% 7.0% ripl_BltShape 0.0% 7.0% ripd_Lock 0.0% 7.0% CGSDeviceLock 0.0% 7.0% _CGSLockWindow 0.0% 7.0% _CGSSynchronizeWindowBackingStore 0.0% 7.0% mach_msg 7.0% 7.0% mach_msg_trap In other words, we're synchronizing something with the window server. And why couldn't I reproduce it on the Mac Mini in our office? Because I had a screen connected to it. As soon as I disconnect the screen, the synchronization phenomenon kicks in. So there are three things we could do now: 1. Leave things as they are and land the patch in bug 517804 anyway. The ts_shutdown regression is just an artifact of unrealistic testing conditions. 2. Buy lots of screens and attach them to all of our Talos machines. Probably not the cheapest idea. 3. Run a VNC client on all of our Talos machines, watching themselves, in order to emulate screen-attachedness. I've tried this, it works. I'd like to argue for 1 now, and 3 when people think it's necessary.

Markus Stange [:mstange]

Reporter

Comment 3

•

16 years ago

For the record: This bug has hit two bugs independently: bug 517804 and bug 334697. Both patches changed paint timing in subtle ways, causing the second occurance of this bug during a ts cycle to move from before onload to after onload, and thus moving 100ms from ts to ts_shutdown.

Component: Graphics → Release Engineering

Product: Core → mozilla.org

QA Contact: thebes → release

Summary: CGContextFillPath sometimes takes over 100ms on Talos → CGContextFillPath sometimes takes over 100ms on Talos, Talos machines should behave as if there was a screen connected to them

Version: Trunk → other

Boris Zbarsky [:bzbarsky]

Comment 4

•

16 years ago

So... couldn't this exact phenomenon be biting Tp and such? It seems to me that if we're going to take any of our Mac T numbers seriously, we need 3 (which need not interfere with 1). Or something. 100ms moving around between pages on Tp or between tests on Tdhtml is a huge number. ccing some folks who might be interested.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 5

•

16 years ago

Yes, it seems pretty clear we have to do #3.

alice nodelman [:alice] [:anode]

Comment 6

•

16 years ago

We've only reproduced this on leopard, so we'd need far more testing on all platforms before making any radical changes to slaves.

Markus Stange [:mstange]

Reporter

Comment 7

•

16 years ago

Make "all of our Talos machines" "all of our Leopard Talos machines". I don't think anybody is asking for doing this on non-Leopard machines.

Boris Zbarsky [:bzbarsky]

Comment 8

•

16 years ago

All platforms as in all mac platforms? This is pretty likely to be a very mac-specific issue...

Josh Aas

Comment 9

•

16 years ago

We should file a bug with Apple about this. They might not consider it to be a bug but we should make sure they know about it.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Comment 10

•

16 years ago

There's another option -- we should be able to get dongles that pretend that a monitor is connected (I think there's a paperclip solution as well?) for pretty cheap (like single-digit $), and should just plug them into all our slaves. Running a local VNC client might introduce additional overhead (like if the VNC screen capture/update happens right during a page), though that should be far far less than the 100ms seen here.

Chris AtLee [:catlee]

Comment 11

•

16 years ago

I thought we had dongles on all our talos slaves already?

John Ford [:jhford] CET/CEST Berlin Time

Comment 12

•

16 years ago

IT: Is it correct that all of our talos slaves have vga adapters with resistors in them?

Assignee: nobody → server-ops

Component: Release Engineering → Server Operations

QA Contact: release → mrz

matthew zeier [:mrz]

Comment 13

•

16 years ago

Phong will know for sure but I think it's mostly the ones running Windows.

Assignee: server-ops → phong

Phong Tran [:phong]

Comment 14

•

16 years ago

only windows and vista minis have them.

Phong Tran [:phong]

Comment 15

•

16 years ago

I mean windows (vista & XP) and Linux minis have them.

matthew zeier [:mrz]

Comment 16

•

16 years ago

IT action - resistors on every Mini.

Markus Stange [:mstange]

Reporter

Comment 17

•

16 years ago

What's the ETA here?

Markus Stange [:mstange]

Reporter

Comment 18

•

15 years ago

We just had another spurious 100ms regression in the startup test: http://groups.google.com/group/mozilla.dev.tree-management/browse_thread/thread/2a79d5e6c9fcb517

Paul O'Shannessy [:zpao] (not bugmail, email directly)

Comment 19

•

15 years ago

It's been suggested this is potentially the cause of another recent regression. See bug 514490 comment #41

matthew zeier [:mrz]

Comment 20

•

15 years ago

Been having problems sourcing a lot of resisters - found them at digikey. For future reference, http://search.digikey.com/scripts/DkSearch/dksus.dll?Detail&name=100XBK-ND

Phong Tran [:phong]

Comment 21

•

15 years ago

added resistor to all the Talos Leopard minis.

Assignee: phong → nobody

Component: Server Operations → Release Engineering

QA Contact: mrz → release

alice nodelman [:alice] [:anode]

Comment 22

•

15 years ago

Hopefully this includes the talos try machines and any machines currently hosed at MV?

Chris Cooper [:coop] (he/him)

Comment 23

•

15 years ago

Re-assigning to IT since that's where the work is happening. (In reply to comment #22) > Hopefully this includes the talos try machines and any machines currently hosed > at MV? Alice: which machines are currently hosed, or are they fixed by now? Phong: did resistors get added to try talos machines also?

Assignee: nobody → server-ops

Component: Release Engineering → Server Operations

QA Contact: release → mrz

chizu

Updated

•

15 years ago

Assignee: server-ops → phong

Phong Tran [:phong]

Comment 24

•

15 years ago

can you give me a list of which machines need to be double checked?

matthew zeier [:mrz]

Comment 25

•

15 years ago

Re-open if there's a list.

Status: NEW → RESOLVED

Closed: 15 years ago

Resolution: --- → INCOMPLETE

Dão Gottwald [:dao]

Comment 26

•

15 years ago

I seem to be running into this on the try server, e.g. qm-pleopard-try07.

Status: RESOLVED → REOPENED

Resolution: INCOMPLETE → ---

Phong Tran [:phong]

Comment 27

•

15 years ago

(In reply to comment #26) > I seem to be running into this on the try server, e.g. qm-pleopard-try07. This mini has the resistor installed. Not sure if there is much else we can do on our end.

Assignee: phong → nobody

Component: Server Operations → Release Engineering

QA Contact: mrz → release

Chris Cooper [:coop] (he/him)

Comment 28

•

15 years ago

(In reply to comment #27) > (In reply to comment #26) > > I seem to be running into this on the try server, e.g. qm-pleopard-try07. > > This mini has the resistor installed. Not sure if there is much else we can do > on our end. Dão: has this recurred on qm-pleopard-try07 (or any other mac try slave for that matter)?

Dão Gottwald [:dao]

Comment 29

•

15 years ago

I don't know. The patch I was using landed on mozilla-central (without problems).

Chris Cooper [:coop] (he/him)

Comment 30

•

15 years ago

(In reply to comment #27) > This mini has the resistor installed. Not sure if there is much else we can do > on our end. (In reply to comment #29) > I don't know. The patch I was using landed on mozilla-central (without > problems). Given these 2 points, we'll flag qm-pleopard-try07 as potentially bad and move on.

Status: REOPENED → RESOLVED

Closed: 15 years ago → 15 years ago

Resolution: --- → FIXED

Chris Cooper [:coop] (he/him)

Updated

•

15 years ago

Whiteboard: [badslave?]

alice nodelman [:alice] [:anode]

Comment 31

•

15 years ago

This issue was re-introduced when we switched to the rev3 talos boxes. All talos rev3 leopard + snow leopard boxes should have resistors installed.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

alice nodelman [:alice] [:anode]

Updated

•

15 years ago

Depends on: 563836

Dave Townsend [:mossop]

Updated

•

15 years ago

Blocks: 563187

Justin Dow [:jabba]

Comment 32

•

15 years ago

> 3. Run a VNC client on all of our Talos machines, watching themselves, in > order to emulate screen-attachedness. I've tried this, it works. > The rev3 minis don't have resistors. Reason is because they have a different DVI output and we'd need to buy adapters for each one. I think the solution of running a local VNC client on each one is the correct solution here. My understanding is that the extra overhead is acceptable as long as it is the same on all boxes in the pool?

Henri Sivonen (:hsivonen)

Updated

•

15 years ago

Blocks: 564125

Chris AtLee [:catlee]

Comment 33

•

15 years ago

Blocked on IT adding new dongles.

Priority: -- → P3

Whiteboard: [badslave?]

Smokey Ardisson (offline for a while; not following bugs - do not email)

Comment 34

•

15 years ago

(In reply to comment #9) > We should file a bug with Apple about this. They might not consider it to be a > bug but we should make sure they know about it. Did anyone ever rdar:// this?

Markus Stange [:mstange]

Reporter

Comment 35

•

15 years ago

Apparently I didn't. I will.

Dave Townsend [:mossop]

Updated

•

15 years ago

No longer blocks: 563187

Chris Cooper [:coop] (he/him)

Updated

•

15 years ago

Whiteboard: [talos][hardware]

Justin Dow [:jabba]

Comment 37

•

15 years ago

The resistors have been install on all Leopard Talos machines. Can this bug be closed?

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 38

•

15 years ago

(In reply to comment #37) > The resistors have been install on all Leopard Talos machines. Can this bug be > closed? I believe this is a DUP of bug#563836, which is now fixed. If you are still seeing problems, then obviously its not a DUP, and please reopen.

Status: REOPENED → RESOLVED

Closed: 15 years ago → 15 years ago

Resolution: --- → DUPLICATE

Benjamin Smedberg

Updated

•

15 years ago

blocking2.0: ? → ---

Nobody; OK to take it and work on it

Assignee

Updated

•

12 years ago

Product: mozilla.org → Release Engineering