tscroll mostly hangs on Linux

RESOLVED WORKSFORME

Status

defect
RESOLVED WORKSFORME
9 years ago
6 years ago

People

(Reporter: philor, Assigned: karlt)

Tracking

(Depends on 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(3 attachments, 1 obsolete attachment)

Reporter

Description

9 years ago
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269641898.1269643459.22416.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269645992.1269646696.30869.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269643193.1269644577.25489.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269645640.1269646793.31038.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269647992.1269649186.4685.gz

The failure mode on x64 is particularly ugly, since the hang is detected, the harness crashes the browser, and then, what, fails to find symbols and turns itself red?

NOISE: Found crashdump: /tmp/tmp0BLv5c/profile/minidumps/3cbfc11e-b685-3a54-0d89ca94-086ff1dd.dmp
Traceback (most recent call last):
  File "run_tests.py", line 462, in <module>
    test_file(arg)
  File "run_tests.py", line 424, in test_file
    browser_dump, counter_dump, print_format = mytest.runTest(browser_config, test)
  File "/home/cltbld/talos-slave/talos-data/talos/ttest.py", line 382, in runTest
    self.checkForCrashes(browser_config, profile_dir)
  File "/home/cltbld/talos-slave/talos-data/talos/ttest.py", line 160, in checkForCrashes
    subprocess.call([stackwalkbin, dump, browser_config['symbols_path']], stderr=nullfd)
  File "/home/cltbld/lib/python2.5/subprocess.py", line 444, in call
    return Popen(*popenargs, **kwargs).wait()
  File "/home/cltbld/lib/python2.5/subprocess.py", line 594, in __init__
    errread, errwrite)
  File "/home/cltbld/lib/python2.5/subprocess.py", line 1097, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory
Reporter

Updated

9 years ago
Blocks: 534819
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269668380.1269669071.12651.gz
Rev3 Fedora 12x64 mozilla-central talos scroll on 2010/03/26 22:39:40

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269677274.1269678029.31212.gz
Rev3 Fedora 12x64 mozilla-central talos scroll on 2010/03/27 01:07:54
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269664939.1269666243.6015.gz
Rev3 Fedora 12 mozilla-central talos scroll on 2010/03/26 21:42:19
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269665613.1269666716.6915.gz
Rev3 Fedora 12 mozilla-central talos scroll on 2010/03/26 21:53:33
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269674822.1269675918.26317.gz
Rev3 Fedora 12 mozilla-central talos scroll on 2010/03/27 00:27:02
Just to be clear, was the Talos scroll testsuite just added? (and started hitting this bug immediately?)  It looks like that to me from a quick glance at tbpl, just want to double-check & establish that.
(In reply to comment #4)
> Just to be clear, was the Talos scroll testsuite just added? (and started
> hitting this bug immediately?)  It looks like that to me from a quick glance at
> tbpl, just want to double-check & establish that.

Yes, it was just turned on yesterday.  I believe Jeff developed the test in bug 534819.

Jeff, any ideas why it's failing?
I've hidden this column
 Rev3 Fedora 12x64 mozilla-central talos scroll
on Firefox and AddonsMgr trees, and hopefully Alice can take a look.
Thanks Nick. Tscroll is permanently orange on Linux-32 as well.
There were a couple of green runs on m-c when it first landed, but point taken. Hid the linux 32 scroll column on Firefox.
We don't generate symbol packages for linux64 builds yet - so that we get red when it attempts to run the stackwalker is expected (though not very graceful).  I would concentrate on the linux32 errors where the stack is readable.
Unless I'm looking at the wrong logs. It doesn't look like the linux32 stacks have symbols either.
We are getting proper stack traces with symbols now.
Reporter

Comment 22

9 years ago
Hidden for Linux (and turned on scraping while I was there) on 1.9.2.
I staged these tests on 1.9.0 primarily.  Runs great (always green) on linux 1.9.0 builds.

This leads me to believe that this is a browser issue.
(In reply to comment #23)
> I staged these tests on 1.9.0 primarily.  Runs great (always green) on linux
> 1.9.0 builds.
This suite is also green on linux on 1.9.1 (FF3.5) for both fedora and fedora64. This suite is orange/red for 1.9.2, mozilla-central, places and lorentz. 

> This leads me to believe that this is a browser issue.
As these are all shared identical machines, and the failures are branch-specific, +1 to the cause being either a browser or testware problem.

jrmuizel, I dont know what component to move this to, but can you have a look?
Assignee: nobody → jmuizelaar
Duplicate of this bug: 558073
Attachment #439030 - Flags: review?(catlee) → review+
Comment on attachment 439030 [details] [diff] [review]
increase timeout, let the test run longer

Checking in bcontroller.py;
/cvsroot/mozilla/testing/performance/talos/bcontroller.py,v  <--  bcontroller.py
new revision: 1.15; previous revision: 1.14
done
Attachment #439030 - Flags: checked-in+
Another crash, with stack, under X11.

http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1272534858.1272536148.32756.gz

This looks like a code problem. Release Engineering isn't the right component. Who should this be assigned to?
Reporter

Comment 29

9 years ago
Hidden on TraceMonkey.
Assignee

Comment 31

9 years ago
Assuming (unupdated) libxcb-1.4-1.fc12.i686.rpm, libX11-1.3-1.fc12.i686.rpm,
libXScrnSaver-1.2.0-1.fc12.i686.rpm,

Crash reason:  SIGSEGV
Crash address: 0x869

Thread 0 (crashed)
 0  linux-gate.so + 0x424
    eip = 0x008ac424   esp = 0xbf9edbe4   ebp = 0xbf9edc08   ebx = 0xbf9edc48
    esi = 0x00000000   edi = 0x00c58ff4   eax = 0xfffffffc   ecx = 0x00000001
    edx = 0xffffffff   efl = 0x00200293
 1  _xcb_conn_wait
    /usr/src/debug/libxcb-1.4/src/xcb_conn.c:306
 2  xcb_wait_for_reply
    /usr/src/debug/libxcb-1.4/src/xcb_in.c:390
 3  _XReply
    /usr/src/debug/libX11-1.3/src/xcb_io.c:454
 4  XScreenSaverQueryInfo
    /usr/src/debug/libXScrnSaver-1.2.0/src/XScrnSaver.c:224

IIUC this looks like SIGSEGV was received while in the poll system call.

I can't see why the call might generate SIGSEGV:
http://cgit.freedesktop.org/xcb/libxcb/tree/src/xcb_conn.c?id=96ff5fb635bc333e80ebce6cc03008a1b9992ff6

Could SIGSEGV be a kernel code for invalid fd?
Or could something else be sending the SIGSEGV.
Reporter

Comment 33

9 years ago
Although I am quite often wrong, nobody has challenged my claim in comment 0 that this isn't a crash per se, it's a "hang" as determined by the talos harness not seeing any output like it wants to within its timeout (hence "__FAILbrowser frozen__FAIL"), at which point it crashes the browser to give you a stack telling you where it was hung.
Attachment #442597 - Flags: review?(anodelman) → review+
Assignee

Comment 35

9 years ago
In the subset of logs at which I looked, all the "frozen__FAIL"s were after

NOISE: Cycle 1: loaded http://localhost/page_load_test/scroll/tiled-downscale.html (next: http://localhost/page_load_test/scroll/tiled-fixed-downscale.html)

Even in the successful runs there were some unusually long runs in
the middle for tiled-fixed.html, tiled-downscale.html, sometimes iframe.svg,
and most significantly tiled-fixed-downscale.html.  The long runs were all
equally long, even those that were 183 times longer than the short runs.

NOISE: |i|pagename|median|mean|min|max|runs|
NOISE: |0;tiled.html;10059;10059;10059;10064;10059;10059;10059;10064;10059
NOISE: |1;tiled-fixed.html;10066;10066;10064;12279;12279;10068;10068;10064;10064
NOISE: |2;tiled-downscale.html;10066;10064.75;10059;61377;61377;10068;10068;10059;10064
NOISE: |3;tiled-fixed-downscale.html;10061.5;10061.5;10059;1833051;1833051;10064;10059;10059;10064
NOISE: |4;iframe.svg;10441;10452.5;10441;10624;10624;10441;10487;10441;10441
NOISE: |5;reader.htm;10060;10061.25;10060;10065;10060;10065;10065;10060;10060

NOISE: |0;tiled.html;10064;10063.75;10059;10068;10064;10068;10059;10068;10064
NOISE: |1;tiled-fixed.html;10061.5;10061.5;10059;13074;13074;10064;10059;10059;10064
NOISE: |2;tiled-downscale.html;10064;10065;10064;61526;61526;10068;10064;10064;10064
NOISE: |3;tiled-fixed-downscale.html;10064;10062.75;10059;918308;918308;10064;10064;10064;10059
NOISE: |4;iframe.svg;10486.5;10486.5;10441;11718;11718;10441;10441;10532;10532
NOISE: |5;reader.htm;10065;10065.25;10060;10071;10065;10071;10071;10060;10065

Why are most of the numbers so close to 10 seconds?
Assignee

Comment 36

9 years ago
The 64-bit results seem much more successful, though their results seem to
have a bimodal distribution around 10 or 19 ms.  The larger result is due to
all the runs being equally slow. e.g.

NOISE: |i|pagename|median|mean|min|max|runs|
NOISE: |0;tiled.html;10122;10122.25;10064;10181;10064;10068;10181;10181;10176
NOISE: |1;tiled-fixed.html;10064;10063.75;10059;12613;12613;10064;10059;10064;10068
NOISE: |2;tiled-downscale.html;54752.5;54741.25;54635;55339;55339;54825;54721;54635;54784
NOISE: |3;tiled-fixed-downscale.html;5093274.5;5104638.5;5069121;5208579;5162884;5070574;5069121;5208579;5115975
NOISE: |4;iframe.svg;11809;11911.5;11581;12447;11581;11946;11672;12447;12447
NOISE: |5;reader.htm;10065;10063.75;10060;10065;10060;10065;10065;10065;10065
Assignee

Comment 37

9 years ago
Almost (but not quite) all of the stacks are in XScreenSaverQueryInfo,
which implies either:
a) the X server is taking an awfully long time to respond to that query, or
b) nsIdleService::GetIdleTime is being called awfully often.
Blocks: 563531
No longer depends on: 563531
Comment on attachment 442597 [details] [diff] [review]
Use SIGABRT to trigger breakpad in hangs to make these easier to distinguish from SIGSEGV crashes

Checking in ffprocess_linux.py;
/cvsroot/mozilla/testing/performance/talos/ffprocess_linux.py,v  <--  ffprocess_linux.py
new revision: 1.20; previous revision: 1.19
done
Attachment #442597 - Flags: checked-in+
Recent logs for linux32 have SIGABRT rather than SIGSEGV, so we're killing a hang?
Assignee

Comment 40

9 years ago
Yes, philor is also quite often right.
Reporter

Comment 41

9 years ago
Hidden on MozillaTry
Assignee

Updated

9 years ago
Depends on: 579277
Assignee

Comment 42

9 years ago
One possibility here is that scrolling was all just requests sent to the server with usually no responses required from the server.  This made the scrolling appear very fast but gave the server plenty of work to do.

When the browser eventually waits for a response from the server (XScreenSaverQueryInfo) it has to wait for the server to complete its processing, and the times get noticed.

I haven't seen this timeout since retained layers landed.
Retained layers now does scrolling asynchronously, so multiple scroll requests are coalesced, giving the server less work.
Assignee

Comment 43

9 years ago
This bug hasn't occurred since bug 564991 landed,
so tscroll can now be unhidden from tinderbox.
Assignee

Comment 44

9 years ago
However, I think we should ensure that the test waits for the server to complete the scrolling.  Among other things, this will give more consistent results that don't depend on whether sessionstore triggers XScreenSaverQueryInfo.
Assignee: jmuizelaar → karlt
Attachment #460192 - Flags: review?(jmuizelaar)
Assignee

Comment 45

9 years ago
(follow existing tab-indentation better)
Attachment #460192 - Attachment is obsolete: true
Attachment #460193 - Flags: review?(jmuizelaar)
Attachment #460192 - Flags: review?(jmuizelaar)
Attachment #460193 - Flags: review?(jmuizelaar) → review+
Assignee

Comment 46

9 years ago
Comment on attachment 460193 [details] [diff] [review]
wait for the server to complete the scrolling before recording the completion time

Alice, would you like to put this on staging first, or shall we include in releng-downtime?
Attachment #460193 - Flags: feedback?(anodelman)
Assignee

Comment 47

9 years ago
(In reply to comment #24)
> (In reply to comment #23)
> > I staged these tests on 1.9.0 primarily.  Runs great (always green) on linux
> > 1.9.0 builds.
> This suite is also green on linux on 1.9.1 (FF3.5) for both fedora and
> fedora64. This suite is orange/red for 1.9.2, mozilla-central, places and
> lorentz. 

The slow tests, tiled-downscale.html and tiled-fixed-downscale.html use -moz-background-size, which wasn't implemented until bug 189519 landed for 1.9.2.

Without -moz-background-size, these tests are the same as their non-downscale variants and run quickly.
Assignee

Comment 48

9 years ago
The tests no longer take so long as to timeout because of:
http://hg.mozilla.org/mozilla-central/rev/83728aa66312
In tiled-fixed-downscale.html, the tiled downscaled background is only painted once (after page load) instead of on each scroll step.
Assignee

Comment 49

9 years ago
I've unhidden these:
Rev3 Fedora 12 mozilla-central talos scroll
Rev3 Fedora 12x64 mozilla-central talos scroll
Rev3 Fedora 12 tryserver talos scroll
Rev3 Fedora 12x64 tryserver talos scroll
Rev3 Fedora 12 tracemonkey talos scroll
Rev3 Fedora 12x64 tracemonkey talos scroll
Attachment #460193 - Flags: feedback?(anodelman) → feedback+
Assignee

Updated

9 years ago
Whiteboard: [attachment 460193 needs check-in
Assignee

Updated

9 years ago
Blocks: 583819
Whiteboard: [attachment 460193 needs check-in → [attachment 460193 needs CVS check-in]
Comment on attachment 460193 [details] [diff] [review]
wait for the server to complete the scrolling before recording the completion time

Checking in page_load_test/scroll/reader-scroll.js;
/cvsroot/mozilla/testing/performance/talos/page_load_test/scroll/reader-scroll.js,v  <--  reader-scroll.js
new revision: 1.2; previous revision: 1.1
done
Checking in page_load_test/scroll/scroll-frame.js;
/cvsroot/mozilla/testing/performance/talos/page_load_test/scroll/scroll-frame.js,v  <--  scroll-frame.js
new revision: 1.2; previous revision: 1.1
done
Checking in page_load_test/scroll/scroll.js;
/cvsroot/mozilla/testing/performance/talos/page_load_test/scroll/scroll.js,v  <--  scroll.js
new revision: 1.2; previous revision: 1.1
done
Attachment #460193 - Flags: checked-in+
Whiteboard: [attachment 460193 needs CVS check-in]
Assignee

Comment 51

9 years ago
I'm not clear that the last patch made much difference, but it is a little reassuring to have it there.

32-bit 1.9.2 timeouts are still killing the process at points where it communicates with the X server (but I haven't seen this happen when fetching screenX at the completion of scrolling).  I saw the process killed at:

nsGlobalWindow::GetOuterWidth (from nsSessionStore I'm guessing),
nsIdleServiceGTK::GetIdleTime,
gdk_display_flush from nsViewManager::UpdateViewAfterScroll.

Some 32 and 64-bit 1.9.2 runs have still completed blazingly fast.  Most 64-bit 1.9.2 runs are just slow (normal).  Most 32-bit 1.9.2 runs have timed out since ~17-18 May.

There was also a change in *64-bit* m-c results ~17-18 May, after which there were only blazingly fast or timed-out runs.  (No slow runs completed.)

Maybe the blazingly fast runs are when the screen saver is on, though that doesn't really explain things.  I could speculate that maybe video memory is sometimes/usually too full to allocate memory for the tiled downscale in vram, but then there also must be plenty of other parameters affecting what the video driver does.

Since http://hg.mozilla.org/mozilla-central/rev/83728aa66312, this has been fixed on m-c anyway, so marking WFM.
Status: NEW → RESOLVED
Last Resolved: 9 years ago
Resolution: --- → WORKSFORME
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.