Closed
Bug 555352
Opened 15 years ago
Closed 15 years ago
tscroll mostly hangs on Linux
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: philor, Assigned: karlt)
References
Details
Attachments
(3 files, 1 obsolete file)
989 bytes,
patch
|
catlee
:
review+
mozilla
:
checked-in+
|
Details | Diff | Splinter Review |
1.26 KB,
patch
|
anodelman
:
review+
nthomas
:
checked-in+
|
Details | Diff | Splinter Review |
3.31 KB,
patch
|
jrmuizel
:
review+
anodelman
:
feedback+
coop
:
checked-in+
|
Details | Diff | Splinter Review |
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269641898.1269643459.22416.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269645992.1269646696.30869.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269643193.1269644577.25489.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269645640.1269646793.31038.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269647992.1269649186.4685.gz
The failure mode on x64 is particularly ugly, since the hang is detected, the harness crashes the browser, and then, what, fails to find symbols and turns itself red?
NOISE: Found crashdump: /tmp/tmp0BLv5c/profile/minidumps/3cbfc11e-b685-3a54-0d89ca94-086ff1dd.dmp
Traceback (most recent call last):
File "run_tests.py", line 462, in <module>
test_file(arg)
File "run_tests.py", line 424, in test_file
browser_dump, counter_dump, print_format = mytest.runTest(browser_config, test)
File "/home/cltbld/talos-slave/talos-data/talos/ttest.py", line 382, in runTest
self.checkForCrashes(browser_config, profile_dir)
File "/home/cltbld/talos-slave/talos-data/talos/ttest.py", line 160, in checkForCrashes
subprocess.call([stackwalkbin, dump, browser_config['symbols_path']], stderr=nullfd)
File "/home/cltbld/lib/python2.5/subprocess.py", line 444, in call
return Popen(*popenargs, **kwargs).wait()
File "/home/cltbld/lib/python2.5/subprocess.py", line 594, in __init__
errread, errwrite)
File "/home/cltbld/lib/python2.5/subprocess.py", line 1097, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory
Reporter | ||
Comment 1•15 years ago
|
||
Comment 2•15 years ago
|
||
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269668380.1269669071.12651.gz
Rev3 Fedora 12x64 mozilla-central talos scroll on 2010/03/26 22:39:40
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269677274.1269678029.31212.gz
Rev3 Fedora 12x64 mozilla-central talos scroll on 2010/03/27 01:07:54
Comment 3•15 years ago
|
||
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269664939.1269666243.6015.gz
Rev3 Fedora 12 mozilla-central talos scroll on 2010/03/26 21:42:19
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269665613.1269666716.6915.gz
Rev3 Fedora 12 mozilla-central talos scroll on 2010/03/26 21:53:33
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269674822.1269675918.26317.gz
Rev3 Fedora 12 mozilla-central talos scroll on 2010/03/27 00:27:02
Comment 4•15 years ago
|
||
Just to be clear, was the Talos scroll testsuite just added? (and started hitting this bug immediately?) It looks like that to me from a quick glance at tbpl, just want to double-check & establish that.
Comment 5•15 years ago
|
||
(In reply to comment #4)
> Just to be clear, was the Talos scroll testsuite just added? (and started
> hitting this bug immediately?) It looks like that to me from a quick glance at
> tbpl, just want to double-check & establish that.
Yes, it was just turned on yesterday. I believe Jeff developed the test in bug 534819.
Jeff, any ideas why it's failing?
Reporter | ||
Comment 6•15 years ago
|
||
Comment 7•15 years ago
|
||
Comment 8•15 years ago
|
||
Reporter | ||
Comment 9•15 years ago
|
||
Reporter | ||
Comment 10•15 years ago
|
||
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269792339.1269793414.10466.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269767043.1269768201.2057.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269771752.1269772886.13503.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269766834.1269767516.32537.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269774944.1269775706.26398.gz
This is permanently red on Linux64.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269821360.1269822802.14306.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269821421.1269822144.13052.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269819462.1269820657.10081.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269820401.1269821070.11114.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269817840.1269818945.5288.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269817882.1269818568.4529.gz
Comment 14•15 years ago
|
||
I've hidden this column
Rev3 Fedora 12x64 mozilla-central talos scroll
on Firefox and AddonsMgr trees, and hopefully Alice can take a look.
Thanks Nick. Tscroll is permanently orange on Linux-32 as well.
Comment 16•15 years ago
|
||
There were a couple of green runs on m-c when it first landed, but point taken. Hid the linux 32 scroll column on Firefox.
Comment 17•15 years ago
|
||
We don't generate symbol packages for linux64 builds yet - so that we get red when it attempts to run the stackwalker is expected (though not very graceful). I would concentrate on the linux32 errors where the stack is readable.
Comment 18•15 years ago
|
||
Unless I'm looking at the wrong logs. It doesn't look like the linux32 stacks have symbols either.
Comment 20•15 years ago
|
||
We are getting proper stack traces with symbols now.
Looks like it's crashing in X11.
Reporter | ||
Comment 22•15 years ago
|
||
Hidden for Linux (and turned on scraping while I was there) on 1.9.2.
Comment 23•15 years ago
|
||
I staged these tests on 1.9.0 primarily. Runs great (always green) on linux 1.9.0 builds.
This leads me to believe that this is a browser issue.
Comment 24•15 years ago
|
||
(In reply to comment #23)
> I staged these tests on 1.9.0 primarily. Runs great (always green) on linux
> 1.9.0 builds.
This suite is also green on linux on 1.9.1 (FF3.5) for both fedora and fedora64. This suite is orange/red for 1.9.2, mozilla-central, places and lorentz.
> This leads me to believe that this is a browser issue.
As these are all shared identical machines, and the failures are branch-specific, +1 to the cause being either a browser or testware problem.
jrmuizel, I dont know what component to move this to, but can you have a look?
Assignee: nobody → jmuizelaar
Comment 26•15 years ago
|
||
Attachment #439030 -
Flags: review?(catlee)
Updated•15 years ago
|
Attachment #439030 -
Flags: review?(catlee) → review+
Comment 27•15 years ago
|
||
Comment on attachment 439030 [details] [diff] [review]
increase timeout, let the test run longer
Checking in bcontroller.py;
/cvsroot/mozilla/testing/performance/talos/bcontroller.py,v <-- bcontroller.py
new revision: 1.15; previous revision: 1.14
done
Attachment #439030 -
Flags: checked-in+
Comment 28•15 years ago
|
||
Another crash, with stack, under X11.
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1272534858.1272536148.32756.gz
This looks like a code problem. Release Engineering isn't the right component. Who should this be assigned to?
Reporter | ||
Comment 29•15 years ago
|
||
Hidden on TraceMonkey.
Karl could probably look at it.
Assignee | ||
Comment 31•15 years ago
|
||
Assuming (unupdated) libxcb-1.4-1.fc12.i686.rpm, libX11-1.3-1.fc12.i686.rpm,
libXScrnSaver-1.2.0-1.fc12.i686.rpm,
Crash reason: SIGSEGV
Crash address: 0x869
Thread 0 (crashed)
0 linux-gate.so + 0x424
eip = 0x008ac424 esp = 0xbf9edbe4 ebp = 0xbf9edc08 ebx = 0xbf9edc48
esi = 0x00000000 edi = 0x00c58ff4 eax = 0xfffffffc ecx = 0x00000001
edx = 0xffffffff efl = 0x00200293
1 _xcb_conn_wait
/usr/src/debug/libxcb-1.4/src/xcb_conn.c:306
2 xcb_wait_for_reply
/usr/src/debug/libxcb-1.4/src/xcb_in.c:390
3 _XReply
/usr/src/debug/libX11-1.3/src/xcb_io.c:454
4 XScreenSaverQueryInfo
/usr/src/debug/libXScrnSaver-1.2.0/src/XScrnSaver.c:224
IIUC this looks like SIGSEGV was received while in the poll system call.
I can't see why the call might generate SIGSEGV:
http://cgit.freedesktop.org/xcb/libxcb/tree/src/xcb_conn.c?id=96ff5fb635bc333e80ebce6cc03008a1b9992ff6
Could SIGSEGV be a kernel code for invalid fd?
Or could something else be sending the SIGSEGV.
Assignee | ||
Comment 32•15 years ago
|
||
Reporter | ||
Comment 33•15 years ago
|
||
Although I am quite often wrong, nobody has challenged my claim in comment 0 that this isn't a crash per se, it's a "hang" as determined by the talos harness not seeing any output like it wants to within its timeout (hence "__FAILbrowser frozen__FAIL"), at which point it crashes the browser to give you a stack telling you where it was hung.
Assignee | ||
Comment 34•15 years ago
|
||
Attachment #442597 -
Flags: review?(anodelman)
Updated•15 years ago
|
Attachment #442597 -
Flags: review?(anodelman) → review+
Assignee | ||
Comment 35•15 years ago
|
||
In the subset of logs at which I looked, all the "frozen__FAIL"s were after
NOISE: Cycle 1: loaded http://localhost/page_load_test/scroll/tiled-downscale.html (next: http://localhost/page_load_test/scroll/tiled-fixed-downscale.html)
Even in the successful runs there were some unusually long runs in
the middle for tiled-fixed.html, tiled-downscale.html, sometimes iframe.svg,
and most significantly tiled-fixed-downscale.html. The long runs were all
equally long, even those that were 183 times longer than the short runs.
NOISE: |i|pagename|median|mean|min|max|runs|
NOISE: |0;tiled.html;10059;10059;10059;10064;10059;10059;10059;10064;10059
NOISE: |1;tiled-fixed.html;10066;10066;10064;12279;12279;10068;10068;10064;10064
NOISE: |2;tiled-downscale.html;10066;10064.75;10059;61377;61377;10068;10068;10059;10064
NOISE: |3;tiled-fixed-downscale.html;10061.5;10061.5;10059;1833051;1833051;10064;10059;10059;10064
NOISE: |4;iframe.svg;10441;10452.5;10441;10624;10624;10441;10487;10441;10441
NOISE: |5;reader.htm;10060;10061.25;10060;10065;10060;10065;10065;10060;10060
NOISE: |0;tiled.html;10064;10063.75;10059;10068;10064;10068;10059;10068;10064
NOISE: |1;tiled-fixed.html;10061.5;10061.5;10059;13074;13074;10064;10059;10059;10064
NOISE: |2;tiled-downscale.html;10064;10065;10064;61526;61526;10068;10064;10064;10064
NOISE: |3;tiled-fixed-downscale.html;10064;10062.75;10059;918308;918308;10064;10064;10064;10059
NOISE: |4;iframe.svg;10486.5;10486.5;10441;11718;11718;10441;10441;10532;10532
NOISE: |5;reader.htm;10065;10065.25;10060;10071;10065;10071;10071;10060;10065
Why are most of the numbers so close to 10 seconds?
Assignee | ||
Comment 36•15 years ago
|
||
The 64-bit results seem much more successful, though their results seem to
have a bimodal distribution around 10 or 19 ms. The larger result is due to
all the runs being equally slow. e.g.
NOISE: |i|pagename|median|mean|min|max|runs|
NOISE: |0;tiled.html;10122;10122.25;10064;10181;10064;10068;10181;10181;10176
NOISE: |1;tiled-fixed.html;10064;10063.75;10059;12613;12613;10064;10059;10064;10068
NOISE: |2;tiled-downscale.html;54752.5;54741.25;54635;55339;55339;54825;54721;54635;54784
NOISE: |3;tiled-fixed-downscale.html;5093274.5;5104638.5;5069121;5208579;5162884;5070574;5069121;5208579;5115975
NOISE: |4;iframe.svg;11809;11911.5;11581;12447;11581;11946;11672;12447;12447
NOISE: |5;reader.htm;10065;10063.75;10060;10065;10060;10065;10065;10065;10065
Assignee | ||
Comment 37•15 years ago
|
||
Almost (but not quite) all of the stacks are in XScreenSaverQueryInfo,
which implies either:
a) the X server is taking an awfully long time to respond to that query, or
b) nsIdleService::GetIdleTime is being called awfully often.
Updated•15 years ago
|
Comment 38•15 years ago
|
||
Comment on attachment 442597 [details] [diff] [review]
Use SIGABRT to trigger breakpad in hangs to make these easier to distinguish from SIGSEGV crashes
Checking in ffprocess_linux.py;
/cvsroot/mozilla/testing/performance/talos/ffprocess_linux.py,v <-- ffprocess_linux.py
new revision: 1.20; previous revision: 1.19
done
Attachment #442597 -
Flags: checked-in+
Comment 39•15 years ago
|
||
Recent logs for linux32 have SIGABRT rather than SIGSEGV, so we're killing a hang?
Assignee | ||
Comment 40•15 years ago
|
||
Yes, philor is also quite often right.
Reporter | ||
Comment 41•15 years ago
|
||
Hidden on MozillaTry
Assignee | ||
Comment 42•15 years ago
|
||
One possibility here is that scrolling was all just requests sent to the server with usually no responses required from the server. This made the scrolling appear very fast but gave the server plenty of work to do.
When the browser eventually waits for a response from the server (XScreenSaverQueryInfo) it has to wait for the server to complete its processing, and the times get noticed.
I haven't seen this timeout since retained layers landed.
Retained layers now does scrolling asynchronously, so multiple scroll requests are coalesced, giving the server less work.
Assignee | ||
Comment 43•15 years ago
|
||
This bug hasn't occurred since bug 564991 landed,
so tscroll can now be unhidden from tinderbox.
Assignee | ||
Comment 44•15 years ago
|
||
However, I think we should ensure that the test waits for the server to complete the scrolling. Among other things, this will give more consistent results that don't depend on whether sessionstore triggers XScreenSaverQueryInfo.
Assignee: jmuizelaar → karlt
Attachment #460192 -
Flags: review?(jmuizelaar)
Assignee | ||
Comment 45•15 years ago
|
||
(follow existing tab-indentation better)
Attachment #460192 -
Attachment is obsolete: true
Attachment #460193 -
Flags: review?(jmuizelaar)
Attachment #460192 -
Flags: review?(jmuizelaar)
Updated•15 years ago
|
Attachment #460193 -
Flags: review?(jmuizelaar) → review+
Assignee | ||
Comment 46•15 years ago
|
||
Comment on attachment 460193 [details] [diff] [review]
wait for the server to complete the scrolling before recording the completion time
Alice, would you like to put this on staging first, or shall we include in releng-downtime?
Attachment #460193 -
Flags: feedback?(anodelman)
Assignee | ||
Comment 47•15 years ago
|
||
(In reply to comment #24)
> (In reply to comment #23)
> > I staged these tests on 1.9.0 primarily. Runs great (always green) on linux
> > 1.9.0 builds.
> This suite is also green on linux on 1.9.1 (FF3.5) for both fedora and
> fedora64. This suite is orange/red for 1.9.2, mozilla-central, places and
> lorentz.
The slow tests, tiled-downscale.html and tiled-fixed-downscale.html use -moz-background-size, which wasn't implemented until bug 189519 landed for 1.9.2.
Without -moz-background-size, these tests are the same as their non-downscale variants and run quickly.
Assignee | ||
Comment 48•15 years ago
|
||
The tests no longer take so long as to timeout because of:
http://hg.mozilla.org/mozilla-central/rev/83728aa66312
In tiled-fixed-downscale.html, the tiled downscaled background is only painted once (after page load) instead of on each scroll step.
Assignee | ||
Comment 49•15 years ago
|
||
I've unhidden these:
Rev3 Fedora 12 mozilla-central talos scroll
Rev3 Fedora 12x64 mozilla-central talos scroll
Rev3 Fedora 12 tryserver talos scroll
Rev3 Fedora 12x64 tryserver talos scroll
Rev3 Fedora 12 tracemonkey talos scroll
Rev3 Fedora 12x64 tracemonkey talos scroll
Updated•15 years ago
|
Attachment #460193 -
Flags: feedback?(anodelman) → feedback+
Assignee | ||
Updated•15 years ago
|
Whiteboard: [attachment 460193 needs check-in
Assignee | ||
Updated•15 years ago
|
Blocks: 583819
Whiteboard: [attachment 460193 needs check-in → [attachment 460193 needs CVS check-in]
Comment 50•15 years ago
|
||
Comment on attachment 460193 [details] [diff] [review]
wait for the server to complete the scrolling before recording the completion time
Checking in page_load_test/scroll/reader-scroll.js;
/cvsroot/mozilla/testing/performance/talos/page_load_test/scroll/reader-scroll.js,v <-- reader-scroll.js
new revision: 1.2; previous revision: 1.1
done
Checking in page_load_test/scroll/scroll-frame.js;
/cvsroot/mozilla/testing/performance/talos/page_load_test/scroll/scroll-frame.js,v <-- scroll-frame.js
new revision: 1.2; previous revision: 1.1
done
Checking in page_load_test/scroll/scroll.js;
/cvsroot/mozilla/testing/performance/talos/page_load_test/scroll/scroll.js,v <-- scroll.js
new revision: 1.2; previous revision: 1.1
done
Attachment #460193 -
Flags: checked-in+
Updated•15 years ago
|
Whiteboard: [attachment 460193 needs CVS check-in]
Assignee | ||
Comment 51•15 years ago
|
||
I'm not clear that the last patch made much difference, but it is a little reassuring to have it there.
32-bit 1.9.2 timeouts are still killing the process at points where it communicates with the X server (but I haven't seen this happen when fetching screenX at the completion of scrolling). I saw the process killed at:
nsGlobalWindow::GetOuterWidth (from nsSessionStore I'm guessing),
nsIdleServiceGTK::GetIdleTime,
gdk_display_flush from nsViewManager::UpdateViewAfterScroll.
Some 32 and 64-bit 1.9.2 runs have still completed blazingly fast. Most 64-bit 1.9.2 runs are just slow (normal). Most 32-bit 1.9.2 runs have timed out since ~17-18 May.
There was also a change in *64-bit* m-c results ~17-18 May, after which there were only blazingly fast or timed-out runs. (No slow runs completed.)
Maybe the blazingly fast runs are when the screen saver is on, though that doesn't really explain things. I could speculate that maybe video memory is sometimes/usually too full to allocate memory for the tiled downscale in vram, but then there also must be plenty of other parameters affecting what the video driver does.
Since http://hg.mozilla.org/mozilla-central/rev/83728aa66312, this has been fixed on m-c anyway, so marking WFM.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → WORKSFORME
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•