1549272 - Green up and run talos-damp performance test in mozilla-central, as well as try, on Windows 10 aarch64

Summary: Green up talos-damp performance test on Windows 10 aarch64 → Green up and run talos-damp performance test in mozilla-central, as well as try, on Windows 10 aarch64

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 2

•

6 years ago

Attached file Bug 1549272: Green up and run talos-damp performance test in mozilla-central, as well as try, on Windows 10 aarch64. r?ochameau,#perftest (obsolete) — Details

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 3

•

6 years ago

•

Edited

Apologies, I jumped the gun here (but thankfully didn't pull the trigger, pardon the analogy); clearly, I didn't run enough talos-damp test iterations on this specific aarch64 architecture, in my original try pushes. Focusing on that alone, for now, we're starting to see [0] crashes like bug 1526001 -- assuming I'm reading the stack trace correctly, and it's accurate.

From a Bugzilla search[1] for NtWaitForAlertByThreadId, there look to be quite a few IPC/thread-related issues, which I'm having a hard time navigating.

@ochameau, how would you prefer we track this? Should we slightly repurpose this bug into a meta/tracking bug, with crashers, etc., set as blocking this bug, dependency-wise?

[0] Try push in Treeherder: https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=263762613&author=sdonner%40mozilla.com
[1] https://bugzilla.mozilla.org/buglist.cgi?list_id=14869163&classification=Client%20Software&classification=Developer%20Infrastructure&classification=Components&classification=Server%20Software&classification=Other&query_format=advanced&bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&longdesc=NtWaitForAlertByThreadId&longdesc_type=allwordssubstr

Flags: needinfo?(poirot.alex)

Phabricator Automation

Updated

•

6 years ago

Attachment #9088299 - Attachment is obsolete: true

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 4

•

6 years ago

Edwin or Geoff; is my proposed approach in comment 3 on-track?

Flags: needinfo?(gbrown)

Flags: needinfo?(egao)

Geoff Brown [:gbrown]

Comment 5

•

5 years ago

I don't know talos well and I'm not sure what to do here.

In the try push, I notice lots of crashes in WriteMinidump; you might want to check in with :dmajor or :gsvelto to see what they think.

Flags: needinfo?(gbrown)

Edwin Takahashi (:egao | infrequent contributor)

Comment 6

•

5 years ago

I'm not familiar with talos either and I can't tell what is happening in the damp test other than that it is crashing unexpectedly so that's probably something that needs attention first.

I don't recall if windows10-aarch64 still had crashreporter issues though; that related bug 1526001 hasn't happened in a long time and was limited in scope to GTest.

Flags: needinfo?(egao)

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 7

•

5 years ago

:dmajor, gsvelto: would either of you be able to help me chase up/pin down the failure(s) and next steps? I'm reconfiguring my local windows10 aarch64 laptop to run them locally, again, still.

Thanks!

Flags: needinfo?(gsvelto)

Flags: needinfo?(dmajor)

Gabriele Svelto [:gsvelto]

Comment 8

•

5 years ago

Crashes that appear to be happening in WriteMinidump() aren't really crashes. They're content processes that are being killed because they failed to respond in time during shutdown. We grab a minidump of them just before we kill them, see the code here. You will usually find a stack that looks like this:

1  xul.dll!google_breakpad::ExceptionHandler::WriteMinidump(std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> > const &,bool (*)(wchar_t const *,wchar_t const *,void *,_EXCEPTION_POINTERS *,MDRawAssertionInfo *,bool),void *,_MINIDUMP_TYPE)
2  xul.dll!CrashReporter::CreateMinidumpsAndPair(void *,unsigned long,nsTSubstring<char> const &,nsIFile *,nsIFile * *)
3  xul.dll!static bool mozilla::ipc::CrashReporterHost::GenerateMinidumpAndPair<mozilla::dom::ContentParent>(class mozilla::dom::ContentParent *, class nsIFile *, const class nsTSubstring<char> & const)
4  xul.dll!void mozilla::dom::ContentParent::GeneratePairedMinidump(const char *)
5  xul.dll!mozilla::dom::ContentParent::KillHard(char const *)
6  xul.dll!nsresult nsTimerEvent::Run()
7  xul.dll!nsThread::ProcessNextEvent(bool,bool *)

The kill timer for content process shutdown is here and is set to 5 seconds by default. You can try a longer interval and see if it greens out the tests, in which case it means that the machine is just being slow.

Note that if it's set to 0 then there will be no shutdown kill timer for content processes at all. reftests already do that.

Flags: needinfo?(gsvelto)

(Away)

Comment 9

•

5 years ago

Clearing my flag: gsvelto's explanation sums it up nicely. Let us know if you're still having trouble.

Flags: needinfo?(dmajor)

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 10

•

5 years ago

Thanks, everyone; new try push to bump content-process timeout from 5 seconds, to 9: https://hg.mozilla.org/try/rev/d6b3a6946dad35f8886573cc1d8b86ecf3f568ac, here: https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=0bd24cdede56e463d2a82d89e5dc651a2a570ea0

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 11

•

5 years ago

Hi Gabriele, all; I've pushed a try build bumping the timeout to a more-conservative 7, from 9 in the previous try push, and up from the default of 5: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b9882fda9b430f9cda3c4e986f227b4d1a87c1a3

I've also got a MozillaBuild env. set up on my Lenovo YOGA, so have started to do local runs. What might your next recommended step(s) be, if any?

(Given the more-relaxed priorities for aarch64 among a few teams, and so it's clear I just want to help and not push for any unnecessary work.)

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Updated

•

5 years ago

Flags: needinfo?(gsvelto)

Gabriele Svelto [:gsvelto]

Comment 12

•

5 years ago

Did the bump to 9 green the tests? If 7 works then it's fine otherwise 9 is also acceptable. In my experience these ARM machines can be very sluggish.

Flags: needinfo?(gsvelto)

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 13

•

5 years ago

(In reply to Gabriele Svelto [:gsvelto] from comment #12)

Did the bump to 9 green the tests? If 7 works then it's fine otherwise 9 is also acceptable. In my experience these ARM machines can be very sluggish.

Sadly, no - individual runs of values: 7[0], 9[1], and a generous 11[2], to no/little avail. Compounding trying to nail down this issue is that builds can be very, very slow, both from the queuing (Try/Taskcluster) submission and execution, on top of the other issue which you mention: the speed and responsiveness of the machines themselves. (Just for posterity, it can take ~5 hours to start a windows10-aarch64 job, another 45 minutes or so for the job's build itself, then execution time -- assuming it doesn't crash hard, early, or bomb out after timing out, ~90 minutes later.)

That's not a gripe, just notes for me & others, when revisiting this and other similar job-env&arch/plat issues.

Values and try runs (various build dates, sorry) for dom.ipc.tabs.shutdownTimeoutSecs:
[0] 7: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b9882fda9b430f9cda3c4e986f227b4d1a87c1a3
[1] 9: https://treeherder.mozilla.org/#/jobs?repo=try&revision=0bd24cdede56e463d2a82d89e5dc651a2a570ea0
[2] 11: https://treeherder.mozilla.org/#/jobs?repo=try&revision=621e27d13fb6bbb91effd0db34271da6bd8ec042

Gabriele Svelto [:gsvelto]

Comment 14

•

5 years ago

(In reply to Stephen Donner [:stephend] from comment #13)

Sadly, no - individual runs of values: 7[0], 9[1], and a generous 11[2], to no/little avail. Compounding trying to nail down this issue is that builds can be very, very slow, both from the queuing (Try/Taskcluster) submission and execution, on top of the other issue which you mention: the speed and responsiveness of the machines themselves. (Just for posterity, it can take ~5 hours to start a windows10-aarch64 job, another 45 minutes or so for the job's build itself, then execution time -- assuming it doesn't crash hard, early, or bomb out after timing out, ~90 minutes later.)

Ouch, that's even worse than I thought. What scares me of these failures is that they might be actual hangs and not just timeouts. On machines this slow it's possible that we're hitting race conditions we couldn't possibly hit on faster machines. Did you make a try run with the timeout entirely disabled (i.e. set to 0)? It's a last resort but it's worth a try.

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 15

•

5 years ago

(In reply to Gabriele Svelto [:gsvelto] from comment #14)
<snip>

Ouch, that's even worse than I thought. What scares me of these failures is that they might be actual hangs and not just timeouts. On machines this slow it's possible that we're hitting race conditions we couldn't possibly hit on faster machines. Did you make a try run with the timeout entirely disabled (i.e. set to 0)? It's a last resort but it's worth a try.

New try push with dom.ipc.tabs.shutdownTimeoutSecs to 0: https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=554ccadd53bd826d5e7ca454e9f63fc5fbc1d480

Gabriele Svelto [:gsvelto]

Comment 16

•

5 years ago

No luck, it's still timing out...

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

5 years ago

Assignee: stephen.donner → nobody

Status: ASSIGNED → NEW

Type: defect → task

Priority: P2 → P3

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

5 years ago

Flags: needinfo?(poirot.alex)

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

5 years ago

Severity: normal → S3

Version: Version 3 → unspecified

Dave Hunt [:davehunt] [he/him] ⌚BST

Comment 17

•

1 year ago

We're no longer running tests against this platform.

Status: NEW → RESOLVED

Closed: 1 year ago

Resolution: --- → WONTFIX