Green up and run talos-damp performance test in mozilla-central, as well as try, on Windows 10 aarch64
Categories
(Testing :: Performance, task, P3)
Tracking
(Not tracked)
People
(Reporter: stephend, Unassigned)
References
Details
Attachments
(1 obsolete file)
Green up talos-damp (https://searchfox.org/mozilla-central/rev/b2015fdd464f598d645342614593d4ebda922d95/taskcluster/ci/test/talos.yml#69-84) runs (currently relagated to try
by the trail in bug 1547044, bug 1546595, and bug 1531876) on Windows 10 aarch64.
Reporter | ||
Comment 1•6 years ago
•
|
||
Ran a smattering of tests via try
, here: https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=8f2b3b0997df761ea064e23bf56b4e64485168ad
Reporter | ||
Updated•6 years ago
|
Reporter | ||
Comment 2•6 years ago
|
||
Reporter | ||
Comment 3•6 years ago
•
|
||
Apologies, I jumped the gun here (but thankfully didn't pull the trigger, pardon the analogy); clearly, I didn't run enough talos-damp
test iterations on this specific aarch64
architecture, in my original try
pushes. Focusing on that alone, for now, we're starting to see [0] crashes like bug 1526001 -- assuming I'm reading the stack trace correctly, and it's accurate.
From a Bugzilla search[1] for NtWaitForAlertByThreadId
, there look to be quite a few IPC/thread-related issues, which I'm having a hard time navigating.
@ochameau, how would you prefer we track this? Should we slightly repurpose this bug into a meta/tracking bug, with crashers, etc., set as blocking this bug, dependency-wise?
[0] Try push in Treeherder: https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=263762613&author=sdonner%40mozilla.com
[1] https://bugzilla.mozilla.org/buglist.cgi?list_id=14869163&classification=Client%20Software&classification=Developer%20Infrastructure&classification=Components&classification=Server%20Software&classification=Other&query_format=advanced&bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&longdesc=NtWaitForAlertByThreadId&longdesc_type=allwordssubstr
Updated•6 years ago
|
Reporter | ||
Comment 4•6 years ago
|
||
Edwin or Geoff; is my proposed approach in comment 3 on-track?
![]() |
||
Comment 5•5 years ago
|
||
I don't know talos well and I'm not sure what to do here.
In the try push, I notice lots of crashes in WriteMinidump; you might want to check in with :dmajor or :gsvelto to see what they think.
Comment 6•5 years ago
|
||
I'm not familiar with talos either and I can't tell what is happening in the damp
test other than that it is crashing unexpectedly so that's probably something that needs attention first.
I don't recall if windows10-aarch64 still had crashreporter issues though; that related bug 1526001 hasn't happened in a long time and was limited in scope to GTest.
Reporter | ||
Comment 7•5 years ago
|
||
:dmajor, gsvelto: would either of you be able to help me chase up/pin down the failure(s) and next steps? I'm reconfiguring my local windows10 aarch64 laptop to run them locally, again, still.
Thanks!
Comment 8•5 years ago
|
||
Crashes that appear to be happening in WriteMinidump()
aren't really crashes. They're content processes that are being killed because they failed to respond in time during shutdown. We grab a minidump of them just before we kill them, see the code here. You will usually find a stack that looks like this:
1 xul.dll!google_breakpad::ExceptionHandler::WriteMinidump(std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> > const &,bool (*)(wchar_t const *,wchar_t const *,void *,_EXCEPTION_POINTERS *,MDRawAssertionInfo *,bool),void *,_MINIDUMP_TYPE)
2 xul.dll!CrashReporter::CreateMinidumpsAndPair(void *,unsigned long,nsTSubstring<char> const &,nsIFile *,nsIFile * *)
3 xul.dll!static bool mozilla::ipc::CrashReporterHost::GenerateMinidumpAndPair<mozilla::dom::ContentParent>(class mozilla::dom::ContentParent *, class nsIFile *, const class nsTSubstring<char> & const)
4 xul.dll!void mozilla::dom::ContentParent::GeneratePairedMinidump(const char *)
5 xul.dll!mozilla::dom::ContentParent::KillHard(char const *)
6 xul.dll!nsresult nsTimerEvent::Run()
7 xul.dll!nsThread::ProcessNextEvent(bool,bool *)
The kill timer for content process shutdown is here and is set to 5 seconds by default. You can try a longer interval and see if it greens out the tests, in which case it means that the machine is just being slow.
Note that if it's set to 0 then there will be no shutdown kill timer for content processes at all. reftests already do that.
Clearing my flag: gsvelto's explanation sums it up nicely. Let us know if you're still having trouble.
Reporter | ||
Comment 10•5 years ago
|
||
Thanks, everyone; new try
push to bump content-process timeout from 5 seconds, to 9: https://hg.mozilla.org/try/rev/d6b3a6946dad35f8886573cc1d8b86ecf3f568ac, here: https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=0bd24cdede56e463d2a82d89e5dc651a2a570ea0
Reporter | ||
Comment 11•5 years ago
|
||
Hi Gabriele, all; I've pushed a try
build bumping the timeout to a more-conservative 7
, from 9
in the previous try
push, and up from the default of 5
: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b9882fda9b430f9cda3c4e986f227b4d1a87c1a3
I've also got a MozillaBuild env. set up on my Lenovo YOGA, so have started to do local runs. What might your next recommended step(s) be, if any?
(Given the more-relaxed priorities for aarch64
among a few teams, and so it's clear I just want to help and not push for any unnecessary work.)
Reporter | ||
Updated•5 years ago
|
Comment 12•5 years ago
|
||
Did the bump to 9
green the tests? If 7
works then it's fine otherwise 9
is also acceptable. In my experience these ARM machines can be very sluggish.
Reporter | ||
Comment 13•5 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #12)
Did the bump to
9
green the tests? If7
works then it's fine otherwise9
is also acceptable. In my experience these ARM machines can be very sluggish.
Sadly, no - individual runs of values: 7[0], 9[1], and a generous 11[2], to no/little avail. Compounding trying to nail down this issue is that builds can be very, very slow, both from the queuing (Try/Taskcluster) submission and execution, on top of the other issue which you mention: the speed and responsiveness of the machines themselves. (Just for posterity, it can take ~5 hours to start a windows10-aarch64
job, another 45 minutes or so for the job's build itself, then execution time -- assuming it doesn't crash hard, early, or bomb out after timing out, ~90 minutes later.)
That's not a gripe, just notes for me & others, when revisiting this and other similar job-env&arch/plat issues.
Values and try
runs (various build dates, sorry) for dom.ipc.tabs.shutdownTimeoutSecs
:
[0] 7: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b9882fda9b430f9cda3c4e986f227b4d1a87c1a3
[1] 9: https://treeherder.mozilla.org/#/jobs?repo=try&revision=0bd24cdede56e463d2a82d89e5dc651a2a570ea0
[2] 11: https://treeherder.mozilla.org/#/jobs?repo=try&revision=621e27d13fb6bbb91effd0db34271da6bd8ec042
Comment 14•5 years ago
|
||
(In reply to Stephen Donner [:stephend] from comment #13)
Sadly, no - individual runs of values: 7[0], 9[1], and a generous 11[2], to no/little avail. Compounding trying to nail down this issue is that builds can be very, very slow, both from the queuing (Try/Taskcluster) submission and execution, on top of the other issue which you mention: the speed and responsiveness of the machines themselves. (Just for posterity, it can take ~5 hours to start a
windows10-aarch64
job, another 45 minutes or so for the job's build itself, then execution time -- assuming it doesn't crash hard, early, or bomb out after timing out, ~90 minutes later.)
Ouch, that's even worse than I thought. What scares me of these failures is that they might be actual hangs and not just timeouts. On machines this slow it's possible that we're hitting race conditions we couldn't possibly hit on faster machines. Did you make a try run with the timeout entirely disabled (i.e. set to 0)? It's a last resort but it's worth a try.
Reporter | ||
Comment 15•5 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #14)
<snip>
Ouch, that's even worse than I thought. What scares me of these failures is that they might be actual hangs and not just timeouts. On machines this slow it's possible that we're hitting race conditions we couldn't possibly hit on faster machines. Did you make a try run with the timeout entirely disabled (i.e. set to 0)? It's a last resort but it's worth a try.
New try
push with dom.ipc.tabs.shutdownTimeoutSecs
to 0
: https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=554ccadd53bd826d5e7ca454e9f63fc5fbc1d480
Comment 16•5 years ago
|
||
No luck, it's still timing out...
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Comment 17•1 year ago
|
||
We're no longer running tests against this platform.
Description
•