Closed Bug 1593465 Opened 2 years ago Closed 2 years ago

Crashes in PGO 'run' jobs upload no minidumps, nor stacks in the logs

Categories

(Firefox Build System :: Task Configuration, task)

task
Not set
normal

Tracking

(firefox72 fixed)

RESOLVED FIXED
mozilla72
Tracking Status
firefox72 --- fixed

People

(Reporter: Gijs, Assigned: mshal)

References

Details

Attachments

(4 files)

Cf:

https://treeherder.mozilla.org/#/jobs?repo=autoland&searchStr=linux%2Crun&revision=b19afbec384a1e3cdee81102f4cd4e89992bab49&selectedJob=274156969

https://treeherder.mozilla.org/#/jobs?repo=autoland&searchStr=linux%2Crun&revision=91313cceae8c54aedd2b177ae14b10269e8e1678&selectedJob=274216265

The log in the latter case indicates that the profile setup run quit with code 11, and some of the other log output suggests a crash of sorts, but no minidump artifacts are uploaded and no stack is printed in the log.

It'd be nice to make these things available so it'd be possible to know what was causing this issue. Not entirely sure this is the best component, please move as appropriate...

It's quite possible for these to be "special" crashes (ie not reproduced elsewhere), especially in the initial "profile setup" run, because the browser is only up a very short time, so if we shut down in the middle of a job that runs either on startup or shortly afterwards, that could cause crashes in ways that don't easily reproduce elsewhere...

This seems like it should be straightforward to add any crash artifacts to the task, but I'm having difficulty testing it. I tried using about:crashparent instead of about:blank in build/pgo/index.html, but Firefox never returns in the runner.wait() call from profileserver.py. Is there another way I can test this? And where would you expect the minidump artifacts to be?

Flags: needinfo?(gijskruitbosch+bugs)

(In reply to Michael Shal [:mshal] from comment #1)

This seems like it should be straightforward to add any crash artifacts to the task, but I'm having difficulty testing it. I tried using about:crashparent instead of about:blank in build/pgo/index.html, but Firefox never returns in the runner.wait() call from profileserver.py. Is there another way I can test this? And where would you expect the minidump artifacts to be?

I don't know (to both of these), sorry. A bit surprised about:crashparent didn't work... maybe it's to do with how we load it? It looks like loading it from the commandline doesn't work... Hopefully Gabriele knows.

Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(gsvelto)

PGO runs have the crash reporter enabled but disable reporting crashes at runtime:

https://searchfox.org/mozilla-central/rev/8b7aa8af652f87d39349067a5bc9c0256bf6dedc/build/pgo/profileserver.py#87

In practice this means that minidumps should be generated but they won't end up in the usual folder ($HOME/.mozilla/firefox/Crash Reports/pending/) but they will be left under the user profiled instead ($<userprofile>/minidumps/).

These lines suggest that a minidump was generated, so it's possible that the task doesn't find it because it's not looking for it in the right place.

Flags: needinfo?(gsvelto)
Assignee: nobody → mshal

Since we may or may not have minidump files, it is not practical to
specifically name each file. We can just upload all files in the
directory instead.

The minidumps are stored in the temporary profile directory used by
profileserver.py. Before the temporary directory goes away, we copy the
files to the artifacts directory. Note that simply moving the directory
doesn't necessarily work on Windows if this process hasn't shutdown
cleanly and something still has a file handle open.

Depends on D52393

Gijs, does this have all the data you're looking for? https://treeherder.mozilla.org/#/jobs?repo=try&revision=104a4dca74e0e1257f9a6a60476a78cc2d458058

The run tasks were forced to crash, and a .dmp and .extra files get uploaded as artifacts.

Flags: needinfo?(gijskruitbosch+bugs)

(In reply to Michael Shal [:mshal] from comment #6)

Gijs, does this have all the data you're looking for? https://treeherder.mozilla.org/#/jobs?repo=try&revision=104a4dca74e0e1257f9a6a60476a78cc2d458058

The run tasks were forced to crash, and a .dmp and .extra files get uploaded as artifacts.

Uploading the files is definitely an improvement. However, when there are crashes during test runs, normally the stack also gets dumped into the test logs - would it be possible to do the same here?

Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(mshal)

We already store the stdout+stderr from the Firefox process in the logs (profile-run-1.log and profile-run-2.log) - those are also dumped to the live_backing.log if the process quits with rc != 0. However, I don't see the stack trace in stdout/stderr when I forcefully crash the process. Is there an environment variable needed to trigger that? Or is there something in the test runner that opens up the minidump and produces a stack trace?

Flags: needinfo?(mshal)

(In reply to Michael Shal [:mshal] from comment #8)

We already store the stdout+stderr from the Firefox process in the logs (profile-run-1.log and profile-run-2.log) - those are also dumped to the live_backing.log if the process quits with rc != 0. However, I don't see the stack trace in stdout/stderr when I forcefully crash the process. Is there an environment variable needed to trigger that? Or is there something in the test runner that opens up the minidump and produces a stack trace?

I don't know, I'm afraid. However, I found a random browser-chrome intermittent that crashes through my browser history, https://bugzilla.mozilla.org/show_bug.cgi?id=1582953, and looked at a recent log, https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=271255289&repo=mozilla-central&lineNumber=11414 , and based on some of the strings there, it seems like the output comes from a python library we have in the tree, called mozcrash - https://searchfox.org/mozilla-central/rev/2a10f4812f3f7c7645d253a4fe52f26bf58e20e8/testing/mozbase/mozcrash/mozcrash/mozcrash.py#98-116 . It seems based on its docs that we should be able to import it and tell it where to find the dump files, and it can take care of the rest, or something? https://searchfox.org/mozilla-central/rev/2a10f4812f3f7c7645d253a4fe52f26bf58e20e8/testing/mozbase/mozcrash/mozcrash/mozcrash.py#50-82 may be more helpful than the docs. It looks like :ahal, :gbrown, :whimboo and :jmaher may know more about this code.

Yes, mozcrash is the way to go! Test harnesses like mochitest use mozcrash to report stacks when they find minidumps:

https://searchfox.org/mozilla-central/rev/6566d92dd46417a2f57e75c515135ebe84c9cef5/testing/mochitest/runtests.py#2353

As Geoff already mentioned we should clearly make use of mozcrash if it's not the case yet. It does not only print the stack in the logs but also copies the minidump and extra file to the upload folder. Which means doing it manually (one of the attached patches) is not necessary.

Thanks for the info gbrown/whimboo! I've hooked up mozcrash now. Gijs, does this look reasonable? https://treeherder.mozilla.org/#/jobs?repo=try&revision=7aa429b775383f6363a04d0b914e29dd5093e854

The top of the crash shows up in the treeherder preview, with the full trace in the log, and the minidump files are still uploaded as artifacts.

Flags: needinfo?(gijskruitbosch+bugs)

(In reply to Michael Shal [:mshal] from comment #12)

Thanks for the info gbrown/whimboo! I've hooked up mozcrash now. Gijs, does this look reasonable? https://treeherder.mozilla.org/#/jobs?repo=try&revision=7aa429b775383f6363a04d0b914e29dd5093e854

The top of the crash shows up in the treeherder preview, with the full trace in the log, and the minidump files are still uploaded as artifacts.

Sort of, but the symbols don't seem to work? Is that expected?

Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(mshal)

I see minidump_stackwalk is being called with a symbols directory of /builds/worker/fetches/crashreporter-symbols. I also see:

[fetches 2019-11-15T23:30:58.318Z] Extracting /builds/worker/fetches/target.crashreporter-symbols-full.zip to /builds/worker/fetches
[fetches 2019-11-15T23:30:58.320Z] Archive:  /builds/worker/fetches/target.crashreporter-symbols-full.zip
[fetches 2019-11-15T23:30:58.321Z]  extracting: certutil/10F78F36AEBE340841159DE3986DBADB0/certutil.dbg.gz  
[fetches 2019-11-15T23:30:58.322Z]   inflating: certutil/10F78F36AEBE340841159DE3986DBADB0/certutil.sym  

which looks like it might not be extracting to .../crashreporter-symbols??

Yeah I just noticed that as well. Locally I had unzipped it into a crashreporter-symbols directory. I'll do another push...

Flags: needinfo?(mshal)

I think this looks better? https://treeherder.mozilla.org/#/jobs?repo=try&revision=f8e60a357a6f923f36c8439d2851274490703146

eg:
[task 2019-11-16T01:19:59.773Z] 0 libmozsqlite3.so!openDatabase [sqlite3.c:f8e60a357a6f923f36c8439d2851274490703146 : 159235 + 0x0]
[task 2019-11-16T01:19:59.773Z] eip = 0xf6541c0c esp = 0xcd690aa0 ebp = 0xcd690b08 ebx = 0xf672f000
[task 2019-11-16T01:19:59.773Z] esi = 0xcc87700c edi = 0x00040006 eax = 0x00000037 ecx = 0x00000037
[task 2019-11-16T01:19:59.773Z] edx = 0xf756a870 efl = 0x00010286
[task 2019-11-16T01:19:59.773Z] Found by: given as instruction pointer in context
[task 2019-11-16T01:19:59.773Z] 1 libmozsqlite3.so!sqlite3_open_v2 [sqlite3.c:f8e60a357a6f923f36c8439d2851274490703146 : 159618 + 0xa]
[task 2019-11-16T01:19:59.773Z] eip = 0xf6543055 esp = 0xcd690b10 ebp = 0xcd690b28 ebx = 0xf672f000
[task 2019-11-16T01:19:59.773Z] esi = 0xf1152cab edi = 0xcc876c00
[task 2019-11-16T01:19:59.773Z] Found by: call frame info

Flags: needinfo?(gijskruitbosch+bugs)

(In reply to Michael Shal [:mshal] from comment #16)

I think this looks better? https://treeherder.mozilla.org/#/jobs?repo=try&revision=f8e60a357a6f923f36c8439d2851274490703146

eg:
[task 2019-11-16T01:19:59.773Z] 0 libmozsqlite3.so!openDatabase [sqlite3.c:f8e60a357a6f923f36c8439d2851274490703146 : 159235 + 0x0]
[task 2019-11-16T01:19:59.773Z] eip = 0xf6541c0c esp = 0xcd690aa0 ebp = 0xcd690b08 ebx = 0xf672f000
[task 2019-11-16T01:19:59.773Z] esi = 0xcc87700c edi = 0x00040006 eax = 0x00000037 ecx = 0x00000037
[task 2019-11-16T01:19:59.773Z] edx = 0xf756a870 efl = 0x00010286
[task 2019-11-16T01:19:59.773Z] Found by: given as instruction pointer in context
[task 2019-11-16T01:19:59.773Z] 1 libmozsqlite3.so!sqlite3_open_v2 [sqlite3.c:f8e60a357a6f923f36c8439d2851274490703146 : 159618 + 0xa]
[task 2019-11-16T01:19:59.773Z] eip = 0xf6543055 esp = 0xcd690b10 ebp = 0xcd690b28 ebx = 0xf672f000
[task 2019-11-16T01:19:59.773Z] esi = 0xf1152cab edi = 0xcc876c00
[task 2019-11-16T01:19:59.773Z] Found by: call frame info

Perfect, thank you!

Flags: needinfo?(gijskruitbosch+bugs)

Where are those extra empty lines are coming from between each of the lines in the mozcrash output?

Flags: needinfo?(mshal)

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #18)

Where are those extra empty lines are coming from between each of the lines in the mozcrash output?

It looks like minidump_stackwalk produces '0d 0a' line-endings. The output from mozcrash combines that output with other info (containing '0a' line endings), and the print(output) changes '0a' to '0d 0a', so the part that contains the minidump_stackwalk output becomes '0d 0d 0a'. Does this also show up in crashdumps from Windows tests? Or is the encoding set somewhere to avoid that problem?

Quick testing shows that adding universal_newlines=True to the minidump_stackwalk subprocess call fixes the issue, but I'm not sure if that would have other side effects if there's a workaround somewhere else for test crashes.

Flags: needinfo?(mshal) → needinfo?(hskupin)

Interesting. Here an example from a crash in mochitest media which doesn't show those lines:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=276795702&repo=autoland&lineNumber=2049

Would you mind updating the patches here on the bug so that we can see under which conditions the try build was running? The current versions don't make use of mozcrash at all. Thanks.

Flags: needinfo?(hskupin)

minidump_stackwalk is used to process any crash reports generated by the
PGO run task.

Depends on D52393

To get an accurate backtrace if the run task crashes, we need the
symbols from the instrumented build.

Depends on D53716

I cannot actually see something right now. To no longer hold-up these patches from being landed please file a follow-up bug where this could be investigated.

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #23)

I cannot actually see something right now. To no longer hold-up these patches from being landed please file a follow-up bug where this could be investigated.

Good idea, I filed bug 1598393 as a followup.

Pushed by mshal@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ed1d3a106e3e
Upload the whole 'artifacts' directory for PGO-run tasks; r=tomprince
https://hg.mozilla.org/integration/autoland/rev/307146c7af6a
Add minidump_stackwalk dependencies to generate-profile tasks; r=firefox-build-system-reviewers,chmanchester
https://hg.mozilla.org/integration/autoland/rev/ee2997888f54
Dump symbols during instrumented builds for crashlogs in the run task; r=firefox-build-system-reviewers,chmanchester
https://hg.mozilla.org/integration/autoland/rev/3fc9e18fc545
Upload minidumps from PGO-run task crashes; r=firefox-build-system-reviewers,chmanchester
Pushed by dvarga@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/f494e587ee3a
Removing -full for causing build bustage from kind.yml and run_pgo_profile.py. On a CLOSED TREE
Pushed by dvarga@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/33ea535c231b
followup for busting gecko decision task. On a CLOSED TREE

Ahh, apparently this logic turns off full crashreporter syms on autoland, which caused the bustage there while my try push was green: https://searchfox.org/mozilla-central/rev/3483fb259b4edbe4594cfcc3911db97d5441b67d/taskcluster/taskgraph/transforms/build.py#160

Are ok with using the non-full crashreporter symbols for these backtraces?

Thanks :dvarga for the quick fixes!

Flags: needinfo?(gijskruitbosch+bugs)

minidump_stackwalk doesn't care about full crash symbols. Which means the enable-full-crashsymbols attribute can also be removed (it was not)

Flags: needinfo?(gijskruitbosch+bugs)
See Also: → 1599551
Regressions: 1600100
You need to log in before you can comment on or make changes to this bug.