Closed Bug 555352 Opened 15 years ago Closed 15 years ago

tscroll mostly hangs on Linux

Categories

(Release Engineering :: General, defect)

x86
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: philor, Assigned: karlt)

References

Details

Attachments

(3 files, 1 obsolete file)

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269641898.1269643459.22416.gz http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269645992.1269646696.30869.gz http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269643193.1269644577.25489.gz http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269645640.1269646793.31038.gz http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269647992.1269649186.4685.gz The failure mode on x64 is particularly ugly, since the hang is detected, the harness crashes the browser, and then, what, fails to find symbols and turns itself red? NOISE: Found crashdump: /tmp/tmp0BLv5c/profile/minidumps/3cbfc11e-b685-3a54-0d89ca94-086ff1dd.dmp Traceback (most recent call last): File "run_tests.py", line 462, in <module> test_file(arg) File "run_tests.py", line 424, in test_file browser_dump, counter_dump, print_format = mytest.runTest(browser_config, test) File "/home/cltbld/talos-slave/talos-data/talos/ttest.py", line 382, in runTest self.checkForCrashes(browser_config, profile_dir) File "/home/cltbld/talos-slave/talos-data/talos/ttest.py", line 160, in checkForCrashes subprocess.call([stackwalkbin, dump, browser_config['symbols_path']], stderr=nullfd) File "/home/cltbld/lib/python2.5/subprocess.py", line 444, in call return Popen(*popenargs, **kwargs).wait() File "/home/cltbld/lib/python2.5/subprocess.py", line 594, in __init__ errread, errwrite) File "/home/cltbld/lib/python2.5/subprocess.py", line 1097, in _execute_child raise child_exception OSError: [Errno 2] No such file or directory
Blocks: 534819
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269668380.1269669071.12651.gz Rev3 Fedora 12x64 mozilla-central talos scroll on 2010/03/26 22:39:40 http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269677274.1269678029.31212.gz Rev3 Fedora 12x64 mozilla-central talos scroll on 2010/03/27 01:07:54
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269664939.1269666243.6015.gz Rev3 Fedora 12 mozilla-central talos scroll on 2010/03/26 21:42:19 http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269665613.1269666716.6915.gz Rev3 Fedora 12 mozilla-central talos scroll on 2010/03/26 21:53:33 http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269674822.1269675918.26317.gz Rev3 Fedora 12 mozilla-central talos scroll on 2010/03/27 00:27:02
Just to be clear, was the Talos scroll testsuite just added? (and started hitting this bug immediately?) It looks like that to me from a quick glance at tbpl, just want to double-check & establish that.
(In reply to comment #4) > Just to be clear, was the Talos scroll testsuite just added? (and started > hitting this bug immediately?) It looks like that to me from a quick glance at > tbpl, just want to double-check & establish that. Yes, it was just turned on yesterday. I believe Jeff developed the test in bug 534819. Jeff, any ideas why it's failing?
This is permanently red on Linux64.
I've hidden this column Rev3 Fedora 12x64 mozilla-central talos scroll on Firefox and AddonsMgr trees, and hopefully Alice can take a look.
Thanks Nick. Tscroll is permanently orange on Linux-32 as well.
There were a couple of green runs on m-c when it first landed, but point taken. Hid the linux 32 scroll column on Firefox.
We don't generate symbol packages for linux64 builds yet - so that we get red when it attempts to run the stackwalker is expected (though not very graceful). I would concentrate on the linux32 errors where the stack is readable.
Unless I'm looking at the wrong logs. It doesn't look like the linux32 stacks have symbols either.
Depends on: 555732
We are getting proper stack traces with symbols now.
Hidden for Linux (and turned on scraping while I was there) on 1.9.2.
I staged these tests on 1.9.0 primarily. Runs great (always green) on linux 1.9.0 builds. This leads me to believe that this is a browser issue.
(In reply to comment #23) > I staged these tests on 1.9.0 primarily. Runs great (always green) on linux > 1.9.0 builds. This suite is also green on linux on 1.9.1 (FF3.5) for both fedora and fedora64. This suite is orange/red for 1.9.2, mozilla-central, places and lorentz. > This leads me to believe that this is a browser issue. As these are all shared identical machines, and the failures are branch-specific, +1 to the cause being either a browser or testware problem. jrmuizel, I dont know what component to move this to, but can you have a look?
Assignee: nobody → jmuizelaar
Attachment #439030 - Flags: review?(catlee) → review+
Comment on attachment 439030 [details] [diff] [review] increase timeout, let the test run longer Checking in bcontroller.py; /cvsroot/mozilla/testing/performance/talos/bcontroller.py,v <-- bcontroller.py new revision: 1.15; previous revision: 1.14 done
Attachment #439030 - Flags: checked-in+
Another crash, with stack, under X11. http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1272534858.1272536148.32756.gz This looks like a code problem. Release Engineering isn't the right component. Who should this be assigned to?
Hidden on TraceMonkey.
Karl could probably look at it.
Assuming (unupdated) libxcb-1.4-1.fc12.i686.rpm, libX11-1.3-1.fc12.i686.rpm, libXScrnSaver-1.2.0-1.fc12.i686.rpm, Crash reason: SIGSEGV Crash address: 0x869 Thread 0 (crashed) 0 linux-gate.so + 0x424 eip = 0x008ac424 esp = 0xbf9edbe4 ebp = 0xbf9edc08 ebx = 0xbf9edc48 esi = 0x00000000 edi = 0x00c58ff4 eax = 0xfffffffc ecx = 0x00000001 edx = 0xffffffff efl = 0x00200293 1 _xcb_conn_wait /usr/src/debug/libxcb-1.4/src/xcb_conn.c:306 2 xcb_wait_for_reply /usr/src/debug/libxcb-1.4/src/xcb_in.c:390 3 _XReply /usr/src/debug/libX11-1.3/src/xcb_io.c:454 4 XScreenSaverQueryInfo /usr/src/debug/libXScrnSaver-1.2.0/src/XScrnSaver.c:224 IIUC this looks like SIGSEGV was received while in the poll system call. I can't see why the call might generate SIGSEGV: http://cgit.freedesktop.org/xcb/libxcb/tree/src/xcb_conn.c?id=96ff5fb635bc333e80ebce6cc03008a1b9992ff6 Could SIGSEGV be a kernel code for invalid fd? Or could something else be sending the SIGSEGV.
Although I am quite often wrong, nobody has challenged my claim in comment 0 that this isn't a crash per se, it's a "hang" as determined by the talos harness not seeing any output like it wants to within its timeout (hence "__FAILbrowser frozen__FAIL"), at which point it crashes the browser to give you a stack telling you where it was hung.
Attachment #442597 - Flags: review?(anodelman) → review+
Depends on: 563531
In the subset of logs at which I looked, all the "frozen__FAIL"s were after NOISE: Cycle 1: loaded http://localhost/page_load_test/scroll/tiled-downscale.html (next: http://localhost/page_load_test/scroll/tiled-fixed-downscale.html) Even in the successful runs there were some unusually long runs in the middle for tiled-fixed.html, tiled-downscale.html, sometimes iframe.svg, and most significantly tiled-fixed-downscale.html. The long runs were all equally long, even those that were 183 times longer than the short runs. NOISE: |i|pagename|median|mean|min|max|runs| NOISE: |0;tiled.html;10059;10059;10059;10064;10059;10059;10059;10064;10059 NOISE: |1;tiled-fixed.html;10066;10066;10064;12279;12279;10068;10068;10064;10064 NOISE: |2;tiled-downscale.html;10066;10064.75;10059;61377;61377;10068;10068;10059;10064 NOISE: |3;tiled-fixed-downscale.html;10061.5;10061.5;10059;1833051;1833051;10064;10059;10059;10064 NOISE: |4;iframe.svg;10441;10452.5;10441;10624;10624;10441;10487;10441;10441 NOISE: |5;reader.htm;10060;10061.25;10060;10065;10060;10065;10065;10060;10060 NOISE: |0;tiled.html;10064;10063.75;10059;10068;10064;10068;10059;10068;10064 NOISE: |1;tiled-fixed.html;10061.5;10061.5;10059;13074;13074;10064;10059;10059;10064 NOISE: |2;tiled-downscale.html;10064;10065;10064;61526;61526;10068;10064;10064;10064 NOISE: |3;tiled-fixed-downscale.html;10064;10062.75;10059;918308;918308;10064;10064;10064;10059 NOISE: |4;iframe.svg;10486.5;10486.5;10441;11718;11718;10441;10441;10532;10532 NOISE: |5;reader.htm;10065;10065.25;10060;10071;10065;10071;10071;10060;10065 Why are most of the numbers so close to 10 seconds?
The 64-bit results seem much more successful, though their results seem to have a bimodal distribution around 10 or 19 ms. The larger result is due to all the runs being equally slow. e.g. NOISE: |i|pagename|median|mean|min|max|runs| NOISE: |0;tiled.html;10122;10122.25;10064;10181;10064;10068;10181;10181;10176 NOISE: |1;tiled-fixed.html;10064;10063.75;10059;12613;12613;10064;10059;10064;10068 NOISE: |2;tiled-downscale.html;54752.5;54741.25;54635;55339;55339;54825;54721;54635;54784 NOISE: |3;tiled-fixed-downscale.html;5093274.5;5104638.5;5069121;5208579;5162884;5070574;5069121;5208579;5115975 NOISE: |4;iframe.svg;11809;11911.5;11581;12447;11581;11946;11672;12447;12447 NOISE: |5;reader.htm;10065;10063.75;10060;10065;10060;10065;10065;10065;10065
Almost (but not quite) all of the stacks are in XScreenSaverQueryInfo, which implies either: a) the X server is taking an awfully long time to respond to that query, or b) nsIdleService::GetIdleTime is being called awfully often.
Blocks: 563531
No longer depends on: 563531
Comment on attachment 442597 [details] [diff] [review] Use SIGABRT to trigger breakpad in hangs to make these easier to distinguish from SIGSEGV crashes Checking in ffprocess_linux.py; /cvsroot/mozilla/testing/performance/talos/ffprocess_linux.py,v <-- ffprocess_linux.py new revision: 1.20; previous revision: 1.19 done
Attachment #442597 - Flags: checked-in+
Recent logs for linux32 have SIGABRT rather than SIGSEGV, so we're killing a hang?
Yes, philor is also quite often right.
Hidden on MozillaTry
Depends on: 579277
One possibility here is that scrolling was all just requests sent to the server with usually no responses required from the server. This made the scrolling appear very fast but gave the server plenty of work to do. When the browser eventually waits for a response from the server (XScreenSaverQueryInfo) it has to wait for the server to complete its processing, and the times get noticed. I haven't seen this timeout since retained layers landed. Retained layers now does scrolling asynchronously, so multiple scroll requests are coalesced, giving the server less work.
This bug hasn't occurred since bug 564991 landed, so tscroll can now be unhidden from tinderbox.
However, I think we should ensure that the test waits for the server to complete the scrolling. Among other things, this will give more consistent results that don't depend on whether sessionstore triggers XScreenSaverQueryInfo.
Assignee: jmuizelaar → karlt
Attachment #460192 - Flags: review?(jmuizelaar)
(follow existing tab-indentation better)
Attachment #460192 - Attachment is obsolete: true
Attachment #460193 - Flags: review?(jmuizelaar)
Attachment #460192 - Flags: review?(jmuizelaar)
Attachment #460193 - Flags: review?(jmuizelaar) → review+
Comment on attachment 460193 [details] [diff] [review] wait for the server to complete the scrolling before recording the completion time Alice, would you like to put this on staging first, or shall we include in releng-downtime?
Attachment #460193 - Flags: feedback?(anodelman)
(In reply to comment #24) > (In reply to comment #23) > > I staged these tests on 1.9.0 primarily. Runs great (always green) on linux > > 1.9.0 builds. > This suite is also green on linux on 1.9.1 (FF3.5) for both fedora and > fedora64. This suite is orange/red for 1.9.2, mozilla-central, places and > lorentz. The slow tests, tiled-downscale.html and tiled-fixed-downscale.html use -moz-background-size, which wasn't implemented until bug 189519 landed for 1.9.2. Without -moz-background-size, these tests are the same as their non-downscale variants and run quickly.
The tests no longer take so long as to timeout because of: http://hg.mozilla.org/mozilla-central/rev/83728aa66312 In tiled-fixed-downscale.html, the tiled downscaled background is only painted once (after page load) instead of on each scroll step.
I've unhidden these: Rev3 Fedora 12 mozilla-central talos scroll Rev3 Fedora 12x64 mozilla-central talos scroll Rev3 Fedora 12 tryserver talos scroll Rev3 Fedora 12x64 tryserver talos scroll Rev3 Fedora 12 tracemonkey talos scroll Rev3 Fedora 12x64 tracemonkey talos scroll
Attachment #460193 - Flags: feedback?(anodelman) → feedback+
Whiteboard: [attachment 460193 needs check-in
Blocks: 583819
Whiteboard: [attachment 460193 needs check-in → [attachment 460193 needs CVS check-in]
Comment on attachment 460193 [details] [diff] [review] wait for the server to complete the scrolling before recording the completion time Checking in page_load_test/scroll/reader-scroll.js; /cvsroot/mozilla/testing/performance/talos/page_load_test/scroll/reader-scroll.js,v <-- reader-scroll.js new revision: 1.2; previous revision: 1.1 done Checking in page_load_test/scroll/scroll-frame.js; /cvsroot/mozilla/testing/performance/talos/page_load_test/scroll/scroll-frame.js,v <-- scroll-frame.js new revision: 1.2; previous revision: 1.1 done Checking in page_load_test/scroll/scroll.js; /cvsroot/mozilla/testing/performance/talos/page_load_test/scroll/scroll.js,v <-- scroll.js new revision: 1.2; previous revision: 1.1 done
Attachment #460193 - Flags: checked-in+
Whiteboard: [attachment 460193 needs CVS check-in]
I'm not clear that the last patch made much difference, but it is a little reassuring to have it there. 32-bit 1.9.2 timeouts are still killing the process at points where it communicates with the X server (but I haven't seen this happen when fetching screenX at the completion of scrolling). I saw the process killed at: nsGlobalWindow::GetOuterWidth (from nsSessionStore I'm guessing), nsIdleServiceGTK::GetIdleTime, gdk_display_flush from nsViewManager::UpdateViewAfterScroll. Some 32 and 64-bit 1.9.2 runs have still completed blazingly fast. Most 64-bit 1.9.2 runs are just slow (normal). Most 32-bit 1.9.2 runs have timed out since ~17-18 May. There was also a change in *64-bit* m-c results ~17-18 May, after which there were only blazingly fast or timed-out runs. (No slow runs completed.) Maybe the blazingly fast runs are when the screen saver is on, though that doesn't really explain things. I could speculate that maybe video memory is sometimes/usually too full to allocate memory for the tiled downscale in vram, but then there also must be plenty of other parameters affecting what the video driver does. Since http://hg.mozilla.org/mozilla-central/rev/83728aa66312, this has been fixed on m-c anyway, so marking WFM.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → WORKSFORME
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: