Sometimes, CGContextFillPath takes over 100ms instead of the usual 0 to 20ms. When this happens, it only happens during the first CGContextFillPath of the current painting cycle (i.e. in the first call to it since the beginning of drawRect). I haven't found a real consistency of when this happens. I only know that during the startup performance test it consistently happens during the 3rd and 6th repaint of every cycle. And the worst thing is that I can't reproduce this effect locally. Maybe it's a general Talos weirdness, maybe it's a bug in Mac OS 10.5.2 that's fixed in later versions, or maybe I'm just not emulating Talos conditions on my machine closely enough. This phenomenon has existed independently of the patch in bug 517804, but it's responsible for the ts_shutdown "regression" that that patch caused. The patch in bug 517804 reduces the number of repaints before onload from 6 to 5, so the 6th repaint moves from before onload to after onload, so the additional 100ms move into the ts_shutdown numbers.
What I've found by now is a little unsettling: The phenomenon goes away as soon as I watch the machine via VNC. If I close the VNC window (and watch browser_output.txt through the SSH shell), the phenomenon occurs again. So maybe we should just have a connected VNC session for every Mac Talos machine all the time? Just kidding ;-) I've even got Shark to take a profile, by launching the test from the Terminal in the VNC session and quickly closing the VNC window afterwards. The only problem is that this Shark session isn't helpful at all because my build doesn't have any symbols. :( I'm currently working on getting a build with proper symbols.
Fun stuff! The profile looks like this: 0.0% 7.0% CGContextFillRect 0.0% 7.0% CGContextFillRects 0.0% 7.0% ripc_DrawRects 0.0% 7.0% ripc_Render 0.0% 7.0% ripl_BltShape 0.0% 7.0% ripd_Lock 0.0% 7.0% CGSDeviceLock 0.0% 7.0% _CGSLockWindow 0.0% 7.0% _CGSSynchronizeWindowBackingStore 0.0% 7.0% mach_msg 7.0% 7.0% mach_msg_trap In other words, we're synchronizing something with the window server. And why couldn't I reproduce it on the Mac Mini in our office? Because I had a screen connected to it. As soon as I disconnect the screen, the synchronization phenomenon kicks in. So there are three things we could do now: 1. Leave things as they are and land the patch in bug 517804 anyway. The ts_shutdown regression is just an artifact of unrealistic testing conditions. 2. Buy lots of screens and attach them to all of our Talos machines. Probably not the cheapest idea. 3. Run a VNC client on all of our Talos machines, watching themselves, in order to emulate screen-attachedness. I've tried this, it works. I'd like to argue for 1 now, and 3 when people think it's necessary.
For the record: This bug has hit two bugs independently: bug 517804 and bug 334697. Both patches changed paint timing in subtle ways, causing the second occurance of this bug during a ts cycle to move from before onload to after onload, and thus moving 100ms from ts to ts_shutdown.
So... couldn't this exact phenomenon be biting Tp and such? It seems to me that if we're going to take any of our Mac T numbers seriously, we need 3 (which need not interfere with 1). Or something. 100ms moving around between pages on Tp or between tests on Tdhtml is a huge number. ccing some folks who might be interested.
Yes, it seems pretty clear we have to do #3.
We've only reproduced this on leopard, so we'd need far more testing on all platforms before making any radical changes to slaves.
Make "all of our Talos machines" "all of our Leopard Talos machines". I don't think anybody is asking for doing this on non-Leopard machines.
All platforms as in all mac platforms? This is pretty likely to be a very mac-specific issue...
We should file a bug with Apple about this. They might not consider it to be a bug but we should make sure they know about it.
There's another option -- we should be able to get dongles that pretend that a monitor is connected (I think there's a paperclip solution as well?) for pretty cheap (like single-digit $), and should just plug them into all our slaves. Running a local VNC client might introduce additional overhead (like if the VNC screen capture/update happens right during a page), though that should be far far less than the 100ms seen here.
I thought we had dongles on all our talos slaves already?
IT: Is it correct that all of our talos slaves have vga adapters with resistors in them?
Phong will know for sure but I think it's mostly the ones running Windows.
only windows and vista minis have them.
I mean windows (vista & XP) and Linux minis have them.
IT action - resistors on every Mini.
What's the ETA here?
We just had another spurious 100ms regression in the startup test: http://groups.google.com/group/mozilla.dev.tree-management/browse_thread/thread/2a79d5e6c9fcb517
It's been suggested this is potentially the cause of another recent regression. See bug 514490 comment #41
Been having problems sourcing a lot of resisters - found them at digikey. For future reference, http://search.digikey.com/scripts/DkSearch/dksus.dll?Detail&name=100XBK-ND
added resistor to all the Talos Leopard minis.
Hopefully this includes the talos try machines and any machines currently hosed at MV?
Re-assigning to IT since that's where the work is happening. (In reply to comment #22) > Hopefully this includes the talos try machines and any machines currently hosed > at MV? Alice: which machines are currently hosed, or are they fixed by now? Phong: did resistors get added to try talos machines also?
can you give me a list of which machines need to be double checked?
Re-open if there's a list.
I seem to be running into this on the try server, e.g. qm-pleopard-try07.
(In reply to comment #26) > I seem to be running into this on the try server, e.g. qm-pleopard-try07. This mini has the resistor installed. Not sure if there is much else we can do on our end.
(In reply to comment #27) > (In reply to comment #26) > > I seem to be running into this on the try server, e.g. qm-pleopard-try07. > > This mini has the resistor installed. Not sure if there is much else we can do > on our end. Dão: has this recurred on qm-pleopard-try07 (or any other mac try slave for that matter)?
I don't know. The patch I was using landed on mozilla-central (without problems).
(In reply to comment #27) > This mini has the resistor installed. Not sure if there is much else we can do > on our end. (In reply to comment #29) > I don't know. The patch I was using landed on mozilla-central (without > problems). Given these 2 points, we'll flag qm-pleopard-try07 as potentially bad and move on.
This issue was re-introduced when we switched to the rev3 talos boxes. All talos rev3 leopard + snow leopard boxes should have resistors installed.
> 3. Run a VNC client on all of our Talos machines, watching themselves, in > order to emulate screen-attachedness. I've tried this, it works. > The rev3 minis don't have resistors. Reason is because they have a different DVI output and we'd need to buy adapters for each one. I think the solution of running a local VNC client on each one is the correct solution here. My understanding is that the extra overhead is acceptable as long as it is the same on all boxes in the pool?
Blocked on IT adding new dongles.
(In reply to comment #9) > We should file a bug with Apple about this. They might not consider it to be a > bug but we should make sure they know about it. Did anyone ever rdar:// this?
Apparently I didn't. I will.
The resistors have been install on all Leopard Talos machines. Can this bug be closed?
(In reply to comment #37) > The resistors have been install on all Leopard Talos machines. Can this bug be > closed? I believe this is a DUP of bug#563836, which is now fixed. If you are still seeing problems, then obviously its not a DUP, and please reopen.