Closed Bug 1549272 Opened 5 years ago Closed 4 months ago

Green up and run talos-damp performance test in mozilla-central, as well as try, on Windows 10 aarch64

Categories

(Testing :: Performance, task, P3)

task

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: stephend, Unassigned)

References

Details

Attachments

(1 obsolete file)

Summary: Green up talos-damp performance test on Windows 10 aarch64 → Green up and run talos-damp performance test in mozilla-central, as well as try, on Windows 10 aarch64

Apologies, I jumped the gun here (but thankfully didn't pull the trigger, pardon the analogy); clearly, I didn't run enough talos-damp test iterations on this specific aarch64 architecture, in my original try pushes. Focusing on that alone, for now, we're starting to see [0] crashes like bug 1526001 -- assuming I'm reading the stack trace correctly, and it's accurate.

From a Bugzilla search[1] for NtWaitForAlertByThreadId, there look to be quite a few IPC/thread-related issues, which I'm having a hard time navigating.

@ochameau, how would you prefer we track this? Should we slightly repurpose this bug into a meta/tracking bug, with crashers, etc., set as blocking this bug, dependency-wise?

[0] Try push in Treeherder: https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=263762613&author=sdonner%40mozilla.com
[1] https://bugzilla.mozilla.org/buglist.cgi?list_id=14869163&classification=Client%20Software&classification=Developer%20Infrastructure&classification=Components&classification=Server%20Software&classification=Other&query_format=advanced&bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&longdesc=NtWaitForAlertByThreadId&longdesc_type=allwordssubstr

Flags: needinfo?(poirot.alex)
Attachment #9088299 - Attachment is obsolete: true

Edwin or Geoff; is my proposed approach in comment 3 on-track?

Flags: needinfo?(gbrown)
Flags: needinfo?(egao)

I don't know talos well and I'm not sure what to do here.

In the try push, I notice lots of crashes in WriteMinidump; you might want to check in with :dmajor or :gsvelto to see what they think.

Flags: needinfo?(gbrown)

I'm not familiar with talos either and I can't tell what is happening in the damp test other than that it is crashing unexpectedly so that's probably something that needs attention first.

I don't recall if windows10-aarch64 still had crashreporter issues though; that related bug 1526001 hasn't happened in a long time and was limited in scope to GTest.

Flags: needinfo?(egao)

:dmajor, gsvelto: would either of you be able to help me chase up/pin down the failure(s) and next steps? I'm reconfiguring my local windows10 aarch64 laptop to run them locally, again, still.

Thanks!

Flags: needinfo?(gsvelto)
Flags: needinfo?(dmajor)

Crashes that appear to be happening in WriteMinidump() aren't really crashes. They're content processes that are being killed because they failed to respond in time during shutdown. We grab a minidump of them just before we kill them, see the code here. You will usually find a stack that looks like this:

1  xul.dll!google_breakpad::ExceptionHandler::WriteMinidump(std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> > const &,bool (*)(wchar_t const *,wchar_t const *,void *,_EXCEPTION_POINTERS *,MDRawAssertionInfo *,bool),void *,_MINIDUMP_TYPE)
2  xul.dll!CrashReporter::CreateMinidumpsAndPair(void *,unsigned long,nsTSubstring<char> const &,nsIFile *,nsIFile * *)
3  xul.dll!static bool mozilla::ipc::CrashReporterHost::GenerateMinidumpAndPair<mozilla::dom::ContentParent>(class mozilla::dom::ContentParent *, class nsIFile *, const class nsTSubstring<char> & const)
4  xul.dll!void mozilla::dom::ContentParent::GeneratePairedMinidump(const char *)
5  xul.dll!mozilla::dom::ContentParent::KillHard(char const *)
6  xul.dll!nsresult nsTimerEvent::Run()
7  xul.dll!nsThread::ProcessNextEvent(bool,bool *)

The kill timer for content process shutdown is here and is set to 5 seconds by default. You can try a longer interval and see if it greens out the tests, in which case it means that the machine is just being slow.

Note that if it's set to 0 then there will be no shutdown kill timer for content processes at all. reftests already do that.

Flags: needinfo?(gsvelto)

Clearing my flag: gsvelto's explanation sums it up nicely. Let us know if you're still having trouble.

Flags: needinfo?(dmajor)

Hi Gabriele, all; I've pushed a try build bumping the timeout to a more-conservative 7, from 9 in the previous try push, and up from the default of 5: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b9882fda9b430f9cda3c4e986f227b4d1a87c1a3

I've also got a MozillaBuild env. set up on my Lenovo YOGA, so have started to do local runs. What might your next recommended step(s) be, if any?

(Given the more-relaxed priorities for aarch64 among a few teams, and so it's clear I just want to help and not push for any unnecessary work.)

Flags: needinfo?(gsvelto)

Did the bump to 9 green the tests? If 7 works then it's fine otherwise 9 is also acceptable. In my experience these ARM machines can be very sluggish.

Flags: needinfo?(gsvelto)

(In reply to Gabriele Svelto [:gsvelto] from comment #12)

Did the bump to 9 green the tests? If 7 works then it's fine otherwise 9 is also acceptable. In my experience these ARM machines can be very sluggish.

Sadly, no - individual runs of values: 7[0], 9[1], and a generous 11[2], to no/little avail. Compounding trying to nail down this issue is that builds can be very, very slow, both from the queuing (Try/Taskcluster) submission and execution, on top of the other issue which you mention: the speed and responsiveness of the machines themselves. (Just for posterity, it can take ~5 hours to start a windows10-aarch64 job, another 45 minutes or so for the job's build itself, then execution time -- assuming it doesn't crash hard, early, or bomb out after timing out, ~90 minutes later.)

That's not a gripe, just notes for me & others, when revisiting this and other similar job-env&arch/plat issues.

Values and try runs (various build dates, sorry) for dom.ipc.tabs.shutdownTimeoutSecs:
[0] 7: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b9882fda9b430f9cda3c4e986f227b4d1a87c1a3
[1] 9: https://treeherder.mozilla.org/#/jobs?repo=try&revision=0bd24cdede56e463d2a82d89e5dc651a2a570ea0
[2] 11: https://treeherder.mozilla.org/#/jobs?repo=try&revision=621e27d13fb6bbb91effd0db34271da6bd8ec042

(In reply to Stephen Donner [:stephend] from comment #13)

Sadly, no - individual runs of values: 7[0], 9[1], and a generous 11[2], to no/little avail. Compounding trying to nail down this issue is that builds can be very, very slow, both from the queuing (Try/Taskcluster) submission and execution, on top of the other issue which you mention: the speed and responsiveness of the machines themselves. (Just for posterity, it can take ~5 hours to start a windows10-aarch64 job, another 45 minutes or so for the job's build itself, then execution time -- assuming it doesn't crash hard, early, or bomb out after timing out, ~90 minutes later.)

Ouch, that's even worse than I thought. What scares me of these failures is that they might be actual hangs and not just timeouts. On machines this slow it's possible that we're hitting race conditions we couldn't possibly hit on faster machines. Did you make a try run with the timeout entirely disabled (i.e. set to 0)? It's a last resort but it's worth a try.

(In reply to Gabriele Svelto [:gsvelto] from comment #14)
<snip>

Ouch, that's even worse than I thought. What scares me of these failures is that they might be actual hangs and not just timeouts. On machines this slow it's possible that we're hitting race conditions we couldn't possibly hit on faster machines. Did you make a try run with the timeout entirely disabled (i.e. set to 0)? It's a last resort but it's worth a try.

New try push with dom.ipc.tabs.shutdownTimeoutSecs to 0: https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=554ccadd53bd826d5e7ca454e9f63fc5fbc1d480

No luck, it's still timing out...

Assignee: stephen.donner → nobody
Status: ASSIGNED → NEW
Type: defect → task
Priority: P2 → P3
Flags: needinfo?(poirot.alex)
Severity: normal → S3
Version: Version 3 → unspecified

We're no longer running tests against this platform.

Status: NEW → RESOLVED
Closed: 4 months ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: