Closed Bug 1264367 Opened 4 years ago Closed 3 years ago

Update Breakpad to pick up 6c8f80aa8b3ba8120c4158c069bb298c044dedf9

Categories

(Toolkit :: Crash Reporting, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla52
Tracking Status
firefox52 --- fixed

People

(Reporter: ted, Assigned: gsvelto)

References

Details

Attachments

(3 files, 7 obsolete files)

10.79 KB, patch
ted
: review+
Details | Diff | Splinter Review
3.42 KB, text/plain
Details
1.81 MB, patch
gsvelto
: review+
Details | Diff | Splinter Review
I made some changes upstream to write more useful debug info in Linux minidumps, including the full debug ID from the ELF headers instead of just stuffing as many bytes as would fit into a GUID:
https://chromium.googlesource.com/breakpad/breakpad/+/6c8f80aa8b3ba8120c4158c069bb298c044dedf9

Having full build ids available makes it easier to get system symbols (bug 951229), since Fedora has darkserver which you can query by build id:
https://darkserver.fedoraproject.org/

We need Socorro to update to a newer Breakpad snapshot (bug 1262135) so that it can handle these dumps server-side before we update Firefox to produce them.
I was trying to wait to get https://codereview.chromium.org/1903833002/ landed in linux-syscall-support (and updated in Breakpad) before I synced from upstream, but I can't seem to get someone to land that, so I'm just going to put it in breakpad-patches for now and move forward.
Assignee: nobody → ted
Looks like upstream broke the build on Mac, I'll have to fix that.
Since I haven't found time to fix this yet, I'll enumerate the issues I hit:
1) Some code that dump_syms uses started using more STL features, which doesn't build with the host compiler on OS X with the current set of flags we use. I think we need to start passing `-stdlib=libc++` in `HOST_CXXFLAGS`.
2) An upstream change I made broke things on Android, where we calculate debug IDs ahead of time. We just need to tweak that code to match the new upstream API.
Blocks: 1280470
I've got Firefox builds for both Android and MacOS X ready at hand so I can help fix upstream breakpad. Just a couple of questions:

(In reply to Ted Mielczarek [:ted.mielczarek] from comment #4)
> Since I haven't found time to fix this yet, I'll enumerate the issues I hit:
> 1) Some code that dump_syms uses started using more STL features, which
> doesn't build with the host compiler on OS X with the current set of flags
> we use. I think we need to start passing `-stdlib=libc++` in `HOST_CXXFLAGS`.

I think I can handle this, we had a similar issue in B2G.

> 2) An upstream change I made broke things on Android, where we calculate
> debug IDs ahead of time. We just need to tweak that code to match the new
> upstream API.

How does the issue manifest itself? Can you point me to the affected code so I'll have a look?
I've tried sync'ing breakpad to the current upstream and the first (major) issue I encounter is your patch in change 6c8f80aa8b3ba8120c4158c069bb298c044dedf9 which causes the profiler to fail to build and it seems related to debug IDs. Is that the issue you mentioned above? I'll try to fix it.

BTW I've noticed that neither the docs nor the src/processor/testdata directory are present in our copy of breakpad. I assume this is a deliberate choice, correct?
(In reply to Gabriele Svelto [:gsvelto] from comment #6)
> I've tried sync'ing breakpad to the current upstream and the first (major)
> issue I encounter is your patch in change
> 6c8f80aa8b3ba8120c4158c069bb298c044dedf9 which causes the profiler to fail
> to build and it seems related to debug IDs. Is that the issue you mentioned
> above? I'll try to fix it.

I've got a patch for some of that, I will link it here as soon as my push to my hg.mo user repo finishes...

> BTW I've noticed that neither the docs nor the src/processor/testdata
> directory are present in our copy of breakpad. I assume this is a deliberate
> choice, correct?

Correct:
https://dxr.mozilla.org/mozilla-central/rev/01748a2b1a463f24efd9cd8abad9ccfd76b037b8/toolkit/crashreporter/update-breakpad.sh#26

(In reply to Gabriele Svelto [:gsvelto] from comment #5)
> > 2) An upstream change I made broke things on Android, where we calculate
> > debug IDs ahead of time. We just need to tweak that code to match the new
> > upstream API.
> 
> How does the issue manifest itself? Can you point me to the affected code so
> I'll have a look?

The logs from my try push in comment 2 have expired, but the error summary is still available:
/builds/slave/try-and-api-15-000000000000000/build/src/toolkit/crashreporter/nsExceptionHandler.cpp:1667:115: error: no matching function for call to 'google_breakpad::FileID::ElfFileIdentifierFromMappedFile(const void*, u_int8_t [16])'
/builds/slave/try-and-api-15-000000000000000/build/src/toolkit/crashreporter/nsExceptionHandler.cpp:4008:95: error: no matching function for call to 'google_breakpad::FileID::ElfFileIdentifierFromMappedFile(const void*, u_int8_t [16])' 

It's basically the same fixup I did elsewhere in my other patch.
(In reply to Gabriele Svelto [:gsvelto] from comment #5)
> (In reply to Ted Mielczarek [:ted.mielczarek] from comment #4)
> > Since I haven't found time to fix this yet, I'll enumerate the issues I hit:
> > 1) Some code that dump_syms uses started using more STL features, which
> > doesn't build with the host compiler on OS X with the current set of flags
> > we use. I think we need to start passing `-stdlib=libc++` in `HOST_CXXFLAGS`.
> 
> I think I can handle this, we had a similar issue in B2G.

BTW, I tried the simple patch for this and I don't think it fixed my build, but I can't exactly recall, and then I got tied up in other things:
https://pastebin.mozilla.org/8904798
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #9)
> Here's the WIP patch I mentioned in comment 7:
> https://hg.mozilla.org/users/tmielczarek_mozilla.com/mc/rev/561e83c58d06

Thanks for that! I've got the whole thing building on Linux right now. I'll test and fix Windows, Mac and Fennec next week. WIP patches coming.
This patch includes all the changes to breakpad since our last sync. A few tweaks to the helper script to remove some stuff we probably don't need.

I think there's more stuff that we don't need but I haven't looked too much into it yet. Additionally I updated the Makefile.am/Makefile.in files to simplify applying the patches from upstream. IIUC they're not used anyway so we probably don't care.
Assignee: ted → gsvelto
Status: NEW → ASSIGNED
And this is the various fixes needed to build Gecko (under Linux).
This patch contains all the changes to the breakpad sources up to revision a2196179cc626de67ee4e4121246bcc8002d4bad which is the current upstream version. I've skipped the patches we had already cherry-picked and ensured no conflicts were left at the end.

I've split it from the gecko changes for ease of review; I'll merge the two patches before landing.
Attachment #8785284 - Attachment is obsolete: true
Attachment #8785943 - Flags: review?(ted)
This patch contains the various fixes needed to make gecko build correctly with the new version of breakpad. Only on the Mac I had to add platform-specific bits and I've tried to keep them only within the moz.build files which required them to restrict their impact as much as possible.
Attachment #8785286 - Attachment is obsolete: true
Attachment #8785944 - Flags: review?(ted)
I've built and manually tested the patches above under Windows, Mac and Linux. I've taken some test crash dumps with and without the patches applied for comparison; the ones produced by the new version seem sane and I couldn't spot any differences between the two:

- Windows crash dump w/ patches applied
  https://crash-stats.mozilla.com/report/index/c9512a0a-91f9-4804-9560-cddb82160829

- Windows crash dump w/o patches applied
  https://crash-stats.mozilla.com/report/index/b22c041e-b46f-4f7a-8953-92c282160829

- Mac crash dump w/ patches applied
  https://crash-stats.mozilla.com/report/index/e9f587c9-6c3c-4365-8040-2c2ee2160829

- Mac crash dump w/o patches applied
  https://crash-stats.mozilla.com/report/index/65464089-a220-4f4b-b889-482062160829

- Linux crash dump w/ patches applied
  https://crash-stats.mozilla.com/report/index/5ce8f6c7-c732-47e2-b382-21e2e2160829

- Linux crash dump w/o patches applied
  https://crash-stats.mozilla.com/report/index/e2887ac6-3ac7-4bd3-b5f4-d56ca2160829

I've only tested debug builds; I'll test the non-debug ones and report on them tomorrow. I've also prepared a try run but I'm not sure what kind of test coverage I should choose for this so I just went with my regular set of options:

  https://treeherder.mozilla.org/#/jobs?repo=try&revision=7ed176f89e0e
Comment on attachment 8785943 [details] [diff] [review]
[PATCH 1/2] Upgrade breakpad to upstream revision a2196179cc626de67ee4e4121246bcc8002d4bad

The try run turned out a lot of breakage; clearing the review request while I address it.
Attachment #8785943 - Flags: review?(ted)
Comment on attachment 8785944 [details] [diff] [review]
[PATCH 2/2] Accomodate for the new version of breakpad

Likewise.
Attachment #8785944 - Flags: review?(ted)
Manual spot-test checking is good, but we do have a bunch of unit tests that should catch any major issues.
Refreshed patch with a couple more commits taken from upstream
Attachment #8785943 - Attachment is obsolete: true
Attachment #8787170 - Flags: review?(ted)
This is the patch that ensures that breakpad builds correctly. I've built and tested it on 32- and 64-bit Windows, 32- and 64-bit Linux, MacOS X and Fennec and it seems to work everywhere.

A few notes:

- To work on MacOS X I had to add -stdlib=libc++ to the CXXFLAGS where required (including for cross-compiling certain tools)

- Some files now use POSIX thread primitives so I also had to add -pthread in a couple of places

- There was a new warning introduced in upstream breakpad on Windows which I've quieted in the relevant moz.build file, hopefully we'll get rid of the workaround soon enough

- Last but not least on 32-bit Linux and on Android the syscall wrappers that breakpad uses from src/third_party/lss/linux_syscall_support.h stubbornly declare the sys_mmap() function with a single argument and then sys_mmap2() with six arguments. While the former did exist at some point in ancient times I've never seen it used so I'm not sure why this choice. This causes those platforms to break during build as later they explicitly call sys_mmap() (rather than #define'ing the proper sys_mmap() to mmap()). I've tried some moz.build magic to work around the issue but couldn't find a suitable solution so I've added three lines to linux_syscall_support.h to fix the problem. It's not pretty but it works and I don't think it's going to cause us problems when we'll rebase in the future as it's a minuscule change.
Attachment #8785944 - Attachment is obsolete: true
Attachment #8787171 - Flags: review?(ted)
Comment on attachment 8787170 [details] [diff] [review]
[PATCH 1/2 v2] Upgrade breakpad to upstream revision 704f41ec901c419f8c321742114b415e6f5ceacc

Wrong name
Attachment #8787170 - Attachment description: [PATCH 1/2 v2] Upgrade breakpad to upstream revision [PATCH 1/2] Upgrade breakpad to upstream revision 704f41ec901c419f8c321742114b415e6f5ceacc → [PATCH 1/2 v2] Upgrade breakpad to upstream revision 704f41ec901c419f8c321742114b415e6f5ceacc
This commit from upstream changed the version of linux-syscall-support in use:
https://chromium.googlesource.com/breakpad/breakpad/+/67f738b7adb47dc1e3b272fb99062f4192fa6651%5E!/

It looks like they removed sys_mmap2 and made sys_mmap work on all platforms. What was breaking that was trying to use sys_mmap2?
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #22)
> This commit from upstream changed the version of linux-syscall-support in
> use:
> https://chromium.googlesource.com/breakpad/breakpad/+/
> 67f738b7adb47dc1e3b272fb99062f4192fa6651%5E!/
> 
> It looks like they removed sys_mmap2 and made sys_mmap work on all
> platforms. What was breaking that was trying to use sys_mmap2?

It's the opposite problem, the code is using sys_mmap() but linux_syscall_support.h is declaring it as sys_mmap(void*) (i.e. the ancient, traditional mmap interface) and declaring sys_mmap2() with the correct (modern) mmap prototype. So the code has to use sys_mmap2() to get the correct call which is why I added that small definition at the bottom of linux_syscall_support.h. BTW this happens only on Linux-based 32-bit hosts which boils down to 32-bit Linux and Android.
Oh! When you updated Breakpad from upstream, did you just not pick up the newer version of linux-syscall-support? If you didn't use my update-breakpad script that's in-tree, you would have to manually run `gclient sync` in the Breakpad repo to sync the subrepos (like lss). Looking in the Breakpad repo I don't see sys_mmap2 used anywhere, and sys_mmap is only used in these two places:
https://github.com/google/breakpad/blob/67f738b7adb47dc1e3b272fb99062f4192fa6651/src/common/memory.h#L119
https://github.com/google/breakpad/blob/67f738b7adb47dc1e3b272fb99062f4192fa6651/src/common/linux/memory_mapped_file.cc#L90

Breakpad's DEPS file specifies rev 3f6478ac95edf86cd3da300c2c0d34a438f5dbeb of linux-syscall-support:
https://chromium.googlesource.com/breakpad/breakpad/+/master/DEPS#59

Which definitely has the code to define sys_mmap everywhere:
https://chromium.googlesource.com/linux-syscall-support/+/3f6478ac95edf86cd3da300c2c0d34a438f5dbeb/linux_syscall_support.h#4019
I also just remembered, looking at that, that we have a local patch in our linux-syscall-support that finally landed upstream:
https://chromium.googlesource.com/linux-syscall-support/+/1549d20f6d3e7d66bb4e687c0ab9da42c2bff2ac%5E%21/

I haven't updated the DEPS file in Breakpad yet though, so we'll need to apply that again locally until that happens. (It was causing some bad crash reports on Android.)
Comment on attachment 8787171 [details] [diff] [review]
[PATCH 2/2 v2] Accomodate for the new version of breakpad

Review of attachment 8787171 [details] [diff] [review]:
-----------------------------------------------------------------

Modulo the linux-syscall-support issue, I think this is otherwise fine.

::: testing/tools/fileid/linux_fileid.cpp
@@ -8,2 @@
>  
> -//TODO: move this somewhere common, this is copied from dump_symbols.cc

Glad I finally made it possible to fix this TODO. :)

::: toolkit/crashreporter/google-breakpad/src/common/dwarf/moz.build
@@ +28,5 @@
>  ]
>  
> +if CONFIG['OS_ARCH'] == 'Darwin':
> +    HOST_CXXFLAGS += [
> +        '-stdlib=libc++',

Could we just put this in HOST_CXXFLAGS globally for the build?

::: toolkit/crashreporter/nsExceptionHandler.cpp
@@ +1669,5 @@
>    for (unsigned int i = 0; i < library_mappings.size(); i++) {
> +    PageAllocator allocator;
> +    auto_wasteful_vector<uint8_t, sizeof(MDGUID)> guid(&allocator);
> +    FileID::ElfFileIdentifierFromMappedFile(
> +      (void const *)library_mappings[i].start_address, guid);

I wasn't confident that this code would work, but apparently I must have wired things up properly upstream. I'd love to get rid of this code at some point, glandium had some patches to make Breakpad use the DT_DEBUG info from the ELF headers to locate shared libraries, which would work right with our custom linker.

The downside is that we won't be putting full ELF Build IDs into the minidumps on Android.

::: toolkit/crashreporter/test/moz.build
@@ +34,5 @@
>  include('/toolkit/crashreporter/crashreporter.mozbuild')
>  
>  NO_PGO = True
> +
> +# Temporary workaround for an issue in upstream breakpad

Can you file a bug in Breakpad and mention it here?
Attachment #8787171 - Flags: review?(ted) → review-
Comment on attachment 8787170 [details] [diff] [review]
[PATCH 1/2 v2] Upgrade breakpad to upstream revision 704f41ec901c419f8c321742114b415e6f5ceacc

Review of attachment 8787170 [details] [diff] [review]:
-----------------------------------------------------------------

rs=me (obviously not going to thoroughly review that patch)
Attachment #8787170 - Flags: review?(ted) → review+
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #24)
> Oh! When you updated Breakpad from upstream, did you just not pick up the
> newer version of linux-syscall-support? If you didn't use my update-breakpad
> script that's in-tree, you would have to manually run `gclient sync` in the
> Breakpad repo to sync the subrepos (like lss).

I just git pull'd breakpad's tree; I forgot I originally cloned it with gclient sync. There were a bunch of conflicts due to the cherry picks so it involved a fair bit of manual intervention... and doing stuff manually is prone to errors :-/

Anyway I've checked with the updated version of linux_syscall_support.h and they removed the single-parameter sys_mmap implementation in revision 3f6478ac95edf86cd3da300c2c0d34a438f5dbeb so it's going to work now. I'll also make sure I pickup up version 1549d20f6d3e7d66bb4e687c0ab9da42c2bff2ac

(In reply to Ted Mielczarek [:ted.mielczarek] from comment #26)
> Could we just put this in HOST_CXXFLAGS globally for the build?

Yes but I think it would be better if this was done by a build peer with more testing. I'm a bit nervous at changing a compile flag for the entire tree without extensive testing. We can always remove my local changes at a later stage when/if we enable it everywhere.

> ::: toolkit/crashreporter/test/moz.build
> @@ +34,5 @@
> >  include('/toolkit/crashreporter/crashreporter.mozbuild')
> >  
> >  NO_PGO = True
> > +
> > +# Temporary workaround for an issue in upstream breakpad
> 
> Can you file a bug in Breakpad and mention it here?

Sure!
Updated patch with lss using the proper revision from DEPS (3f6478ac95edf86cd3da300c2c0d34a438f5dbeb) and with 1549d20f6d3e7d66bb4e687c0ab9da42c2bff2ac cherry-picked on top of it. Carrying over the r+ flag.
Attachment #8787170 - Attachment is obsolete: true
Attachment #8791558 - Flags: review+
Updated patch with the mmap-related change to linux_syscall_support.h removed since it's not needed anymore.
Attachment #8787171 - Attachment is obsolete: true
Attachment #8791560 - Flags: review?(ted)
(In reply to Gabriele Svelto [:gsvelto] from comment #28)
> (In reply to Ted Mielczarek [:ted.mielczarek] from comment #24)
> > Oh! When you updated Breakpad from upstream, did you just not pick up the
> > newer version of linux-syscall-support? If you didn't use my update-breakpad
> > script that's in-tree, you would have to manually run `gclient sync` in the
> > Breakpad repo to sync the subrepos (like lss).
> 
> I just git pull'd breakpad's tree; I forgot I originally cloned it with
> gclient sync. There were a bunch of conflicts due to the cherry picks so it
> involved a fair bit of manual intervention... and doing stuff manually is
> prone to errors :-/

Yeah, I have been making some plans to make this process easier. We're down to a single non-upstreamed patch, which is good, but I think it'd be better if we had a Breakpad fork in which we maintained our local patches so we could just rebase them against upstream master.

> (In reply to Ted Mielczarek [:ted.mielczarek] from comment #26)
> > Could we just put this in HOST_CXXFLAGS globally for the build?
> 
> Yes but I think it would be better if this was done by a build peer with
> more testing. I'm a bit nervous at changing a compile flag for the entire
> tree without extensive testing. We can always remove my local changes at a
> later stage when/if we enable it everywhere.

Okay, it's not terribly important.
Comment on attachment 8791560 [details] [diff] [review]
[PATCH 2/2 v3] Accomodate for the new version of breakpad

Review of attachment 8791560 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks!
Attachment #8791560 - Flags: review?(ted) → review+
Thanks, the try run looks OK-ish (minus the usual stuff that needs to be re-triggered) so it's time to land on inbound.
Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/af84ff6bfd7e
Upgrade breakpad to upstream revision 704f41ec901c419f8c321742114b415e6f5ceacc r=ted
I've folded the patches into one before landing BTW. The first patch doesn't build correctly on its own but I've kept them separated for ease of review.
That makes sense, that's what I've done in the past. Thanks for driving this to completion!
https://hg.mozilla.org/mozilla-central/rev/af84ff6bfd7e
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla51
Backed out in https://hg.mozilla.org/mozilla-central/rev/4c5820c29259b5e295dc264a33d65566808a4fc2

At least on Android (I haven't yet found a crash on any other platform, but conveniently Android crashes every push), that caused us to not have symbols anymore, https://treeherder.mozilla.org/logviewer.html#?job_id=35922616&repo=mozilla-inbound
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla51 → ---
Not sure what to make of the non-Android state, since I thought I saw a few stacks with symbols, but mostly https://treeherder.mozilla.org/logviewer.html#?job_id=3594183&repo=autoland
(In reply to Phil Ringnalda (:philor) from comment #40)
> Not sure what to make of the non-Android state, since I thought I saw a few
> stacks with symbols, but mostly
> https://treeherder.mozilla.org/logviewer.html#?job_id=3594183&repo=autoland

You mean even on non-Android builds stacks are not symbolicated? I've manually tested all platforms - save for Fennec - by sending crash dumps to Socorro and they all had correct symbols.
Flags: needinfo?(philringnalda)
Right: https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=d8a383370502b41992775dc4efc2501453c14170&filter-searchStr=crashtest was where it met something else that was crashing us in crashtests (and reftests) on Mac and Linux (don't know about Windows, since that same thing was keeping us from building there).
Flags: needinfo?(philringnalda)
OK, Breakpad comes with a handful of tools to deal with symbols we build for the host only; I suppose those are used by taskcluster and it's them that caused this behavior. If that's the case I'm not surprised it affects more than Fennec alone. Sorry for the breakage... this is going to take a while to figure out before I land again :-/
Ted, I'll need your help to figure out what's wrong here. If you look at the failed logs here:

https://treeherder.mozilla.org/logviewer.html#?job_id=35922616&repo=mozilla-inbound#L1777

You'll notice that mozcrash doesn't print out a symbolicated signature, relying instead on the module + offset output. I've looked into mozcrash's sources and it seems to use minidump-stackwalk to extract the signature from the minidump. I've got a few questions about that:

- Which version of minidump-stackwalk is being used? I've looked through the sources and it seems that the tool is built on the server itself but I might be mistaken on this one

- I've double-checked the output of both the 'fileid' and 'dump_syms' tools with my patches applied. 'fileid' provides identical IDs with and without my patch-set applied. 'dump_syms' output on the other hand is slightly different. First of all the second line of the output file contains the code ID in the following format: 'INFO CODE_ID <GUID>'. Secondly there are a lot more 'PUBLIC ...' lines with my patches applied. Apart from these changes though the output is identical; so my question is, could the small changes in 'dump_syms' output confuse our infrastructure (or minidump_stackwalk) in a way that causes it not to read symbols correctly?

- Last but not least, do you know how I can test this locally? Without a way to test I've spent my last two days of work doing code-reading which gave me hints but no definitive answer as to what could have caused this issue.
Flags: needinfo?(ted)
Oh, shoot. We use binaries built from http://hg.mozilla.org/users/tmielczarek_mozilla.com/stackwalk-http/

in the tooltool manifests here:
https://dxr.mozilla.org/mozilla-central/source/testing/config/tooltool-manifests/

I've got a set of scripts to rebuild the binaries, I'll do that tomorrow:
https://github.com/luser/breakpad-taskcluster

Note that there may still be OS X symbol issues, that's bug 1301751.
Flags: needinfo?(ted)
Thanks Ted. So is your modified version built against an older breakpad that would be confused by the 'INFO CODE_ID ...' line in the symbol file?
(In reply to Gabriele Svelto [:gsvelto] from comment #46)
> Thanks Ted. So is your modified version built against an older breakpad that
> would be confused by the 'INFO CODE_ID ...' line in the symbol file?

It's built off an older revision, but the problem is that it doesn't know how to handle the new ELF Build ID-based debug IDs that are written in the minidumps from Linux clients now. (We updated the binaries that Socorro uses a while ago.)
Depends on: 1303980
Since bug 1303980 has landed we can try re-landing this. I've launched a very extensive try-run here and will be waiting to see if it completes correctly before re-landing:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=a11a991c00e7

Either way I won't reland the follow up bugs (breakpad processor integration, minidump analyzer and crash ping augmentation) until we're sure this sticks. Having landed all the patches together created quite a bit of confusion when this one was backed out.
The try run is pleasantly green, including Talos which seemed to have been affected during the previous attempt. Time to land again.
Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/ae50c61cb134
Upgrade breakpad to upstream revision 704f41ec901c419f8c321742114b415e6f5ceacc r=ted
OK, now I'm stumped. Ted, any ideas what could still be missing here?
Flags: needinfo?(ted)
Attached file minidump_dump excerpt
I looked at one of the dumps that philor noted and it looks like Breakpad is writing two module entries in the dump for some of our shared libraries. Attached is an excerpt from the minidump_dump output showing the two module entries for libxul, and the lines from /proc/self/maps. Because the module entry that covers the actual code has a debug_identifier that's all zeroes the stackwalker can't find the matching symbols.
Flags: needinfo?(ted)
Not being able to reproduce on my box I've started digging into the breakpad changes we've pulled in and I think I found the culprit. Hold tight.
Reading the list of changes in breakpad it seems like this might be causing the issue:

https://chromium.googlesource.com/breakpad/breakpad/+/17ad0c18b179c135fc5a3d2bba199c3fa4276035

Note the comment for the code that was removed:

-          // Also merge mappings that result from address ranges that the
-          // linker reserved but which a loaded library did not use. These
-          // appear as an anonymous private mapping with no access flags set
-          // and which directly follow an executable mapping.

That's our mappings alright, anonymous with no access flags.

I'll try reverting the patch and see if it fixes our issue. If that's the case I suppose I'll try to analyze this issue more and possibly provide an upstream fix; for now my priority is to get this landed though.
Here's another try-run with that breakpad commit backed-out:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=5dd5751e814d

I hope it crashes a lot so I can have a look at the minidumps :)
Backing out that commit didn't solve the issue. Fortunately I've managed to reproduce it locally so I'll try to bisect the breakpad changes until I find the root issue and possibly address it right away.
After mucking around a lot with my build I found out the following:

- For some reason running Firefox with ./mach run or invoking the executable directly doesn't seem to produce the multiple mappings that appear in automation. Installing the build and then running it via mozilla-run.sh does, so it probably has to do with the environment or something. Anyway this allowed me to look at the problem directly

- Breakpad's 17ad0c18b179c135fc5a3d2bba199c3fa4276035 was causing part of the problem but it wasn't the sole culprit. Commit 8915f7be39448d9257b6da3ad0233944d1d9a92a [1] was also involved. I've reverted both and the generated dumps now look sane (no duplicate libxul.so module).

I've now made yet another try run to see if we get proper signatures when the tests crash; hopefully this will be the last.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=2c4d5358a26a

[1] https://chromium.googlesource.com/breakpad/breakpad/+/8915f7be39448d9257b6da3ad0233944d1d9a92a
It looks like it's working:

https://treeherder.mozilla.org/logviewer.html#?job_id=27956305&repo=try#L2330

I'll wait to see the Android and MacOS X results and then I'll land the modified patch.
Unfortunately it doesn't seem to be working on Android:

https://treeherder.mozilla.org/logviewer.html#?job_id=27963586&repo=try#L2732

If you look at the dump it exhibits the same issue we've seen before with libxul.so appearing twice in the modules list.
It occurs to me that we should really stand up some sort of end-to-end test in automation that runs Firefox, intentionally crashes it, and then processes the minidump with our tooltool minidump_stackwalk and validates that we have useful symbols. (I think such a thing should be fairly doable in taskcluster.)
Good point; once this is behind us we should really improve the test coverage around this stuff. Anyway I suspect that the issue I'm seeing on Android is because of breakpad's cd744acecc4b1889f7452d67103b3fc5a015631d commit so I'm reverting that one too. The try run is here:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=310b8a85d55b

Fingers crossed.
I'm not sure, but you might be chasing an unreachable carrot with those particular "crashes" on Android - "exceeded maximum time" is the Android equivalent of "330 seconds without output" on desktop, where we just crash ourselves in the (almost always vain) hope that some part of the stack might say why we were hung. Quite often we're doing absolutely nothing, just spinning, and desktop or Android, new Breakpad or old, the stacks are crap because we're off in a system library we don't have symbols for while we sit and spin.

Given that your try links don't include &exclusion_profile=false I must not have mentioned how much handier the hidden Android mochitest-gl jobs are for finding crashes, though I see by https://treeherder.mozilla.org/#/jobs?repo=try&revision=310b8a85d55b&selectedJob=27988277&group_state=expanded&exclusion_profile=false that they might have chosen an inconvenient time to fix their crashiness. I retriggered a bunch in hopes of getting you a jemalloc_crash signature, which is where they *should* crash.
(In reply to Phil Ringnalda (:philor) from comment #63)
> I'm not sure, but you might be chasing an unreachable carrot with those
> particular "crashes" on Android - "exceeded maximum time" is the Android
> equivalent of "330 seconds without output" on desktop, where we just crash
> ourselves in the (almost always vain) hope that some part of the stack might
> say why we were hung. Quite often we're doing absolutely nothing, just
> spinning, and desktop or Android, new Breakpad or old, the stacks are crap
> because we're off in a system library we don't have symbols for while we sit
> and spin.

I've double-checked by downloading the minidump file and verifying that the entry for libxul.so in the module list is duplicated as Ted pointed out above. The problem's still there and I'll be spending today tracing it down :-/
Another update: I've done a lot of testing on Fennec and I've discovered something that's very confusing. I've checked the modules list for duplications of the libxul.so module as that seemed to be caused by the upstream version... but in fact some were already there. See this crash for example:

https://crash-stats.mozilla.com/report/index/caadd94e-263e-423e-bab3-1ab942160928

As you'll notice there's an entry in the modules list called libxul.so and one called libxul.so (deleted) with an empty debug identifier. Long story short this needs even more digging to figure out what's wrong with upstream.
This is the patch I mentioned (and tested) in comment 58. I've been sitting on it for a week because it still caused signatures to be missing and it also had the issue Ted mentioned in comment 53 (duplicate libxul.so entry in the module list)... After a week of testing it turns out that the former problem is probably not caused by the patch and the latter is not a problem, it was there already on some architectures (e.g. Fennec) but was harmless. The only case it's not is on Linux which is why I had to rollback a couple of the upstream patches to fix it.

Long story short I fear that the remaining issues are still in automation. Here's a couple of try runs that show what's happening. This is a try run on a plain mozilla-central tree with a patch that deliberately crashes a test. I've put the crashing method in an existing C++ object to get a very specific signature that could be identified easily:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=c06744fe558f

And here's the crashes with proper signatures on Linux, Windows and Fennec:

https://treeherder.mozilla.org/logviewer.html#?job_id=28309807&repo=try#L4481
https://treeherder.mozilla.org/logviewer.html#?job_id=28313129&repo=try#L5367
https://treeherder.mozilla.org/logviewer.html#?job_id=28308809&repo=try#L1775

Here's another try run with my patches applied; most of the crashes show the proper signature:

https://treeherder.mozilla.org/logviewer.html#?job_id=28311079&repo=try#L4481
https://treeherder.mozilla.org/logviewer.html#?job_id=28311208&repo=try#L1775

But not all of them, see this other Fennec run that's part of a different job (tc-M) in the same run:

https://treeherder.mozilla.org/logviewer.html#?job_id=28312594&repo=try#L2062

As you can see the signature is missing and we get @ libxul.so + 0x3999d6 instead.

After digging further in the differences between job 28311208 and 28312594 I noticed something important: they're not using the same minidump_stackwalk binary! Job 28311208 is using the updated one which it downloads before running it, in the raw log you'll find the following lines:

23:43:49     INFO -  'download_minidump_stackwalk': True,
[...]
23:46:45     INFO -  INFO - File linux64-minidump_stackwalk retrieved from local cache /builds/tooltool_cache
[...]
23:46:45     INFO -  'MINIDUMP_STACKWALK': '/builds/slave/test/build/linux64-minidump_stackwalk',

https://archive.mozilla.org/pub/mobile/try-builds/gsvelto@mozilla.com-0e3a852229bc791464d591859f037eb244c30ec7/try-android-api-15/try_ubuntu64_vm_armv7_large_test-mochitest-3-bm121-tests1-linux64-build166.txt.gz

While in the other job (28312594) - which is not symbolicating properly - you'll find the following:

[task 2016-09-30T06:52:16.172713Z] 06:52:16     INFO -  'download_minidump_stackwalk': False,
[...]
[task 2016-09-30T06:52:16.177861Z] 06:52:16     INFO -  'minidump_stackwalk_path': '/usr/local/bin/linux64-minidump_stackwalk',
[...]
[task 2016-09-30T06:55:02.996085Z] 06:55:02     INFO -  'MINIDUMP_STACKWALK': '/usr/local/bin/linux64-minidump_stackwalk',

I'm not sure why the two jobs are different but they're obviously configured differently in automation and it seems like the latter is using an out-of-date version of minidump_stackwalk which is probably part of its VM image.

Ted, do you know how to fix this?
Attachment #8791558 - Attachment is obsolete: true
Flags: needinfo?(ted)
Attachment #8796437 - Flags: review+
I don't know offhand, but I will look into this. I didn't know this was how these worked! This sucks, these should use tooltool like everything else. I'll make sure this gets fixed so that we don't miss this the next time around.
I've found only two scripts which are doing it; one is this one:

http://hg.mozilla.org/mozilla-central/file/tip/taskcluster/ci/android-test/tests.yml#l23

... and this is the other:

http://hg.mozilla.org/mozilla-central/file/tip/taskcluster/ci/desktop-test/tests.yml#l25

If you look at that file you'll notice that it's explicitly singling out minidump_stackwalk :-/

http://hg.mozilla.org/mozilla-central/file/tip/testing/mozharness/configs/remove_executables.py

I can prepare a patch to get rid of this but I don't know how to test it since I'm not very familiar with our automation setup.
Depends on: 1306662
OK, I figured this out and I have a patch on bug 1306662. That patch should ensure that this is fixed for future updates too, as it'll make the desktop-test images use the tooltool manifest when they get built, which also means they will get rebuilt automatically  if the manifest changes. Sorry about that, I clearly did not remember that was how those worked.
Flags: needinfo?(ted)
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #61)
> It occurs to me that we should really stand up some sort of end-to-end test
> in automation that runs Firefox, intentionally crashes it, and then
> processes the minidump with our tooltool minidump_stackwalk and validates
> that we have useful symbols. (I think such a thing should be fairly doable
> in taskcluster.)

I filed bug 1306678 on this.
I've send some try runs during the week-end and today it seems that your changes from bug 1306678 where finally picked up. This is today's try run with my crash-inducing patch and correct crash signatures:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=91c7a8dc78ca

Linux:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=91c7a8dc78ca&selectedJob=28409632
https://treeherder.mozilla.org/#/jobs?repo=try&revision=91c7a8dc78ca&selectedJob=28409654

Fennec:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=91c7a8dc78ca&selectedJob=28410458
https://treeherder.mozilla.org/#/jobs?repo=try&revision=91c7a8dc78ca&selectedJob=28409573

The latter was always failing to produce symbols before. I'll wait for the run to finish across all architectures and land again; thanks a lot for your help.
Excellent! Thanks for your persistence! For my own clarity, you wound up having to back out two upstream Breakpad changes:
17ad0c18b179c135fc5a3d2bba199c3fa4276035
8915f7be39448d9257b6da3ad0233944d1d9a92a

correct? If so we should get an upstream issue filed so we can get this fixed there for the next time we sync.
Blocks: 1280577
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #72)
> Excellent! Thanks for your persistence! For my own clarity, you wound up
> having to back out two upstream Breakpad changes:
> 17ad0c18b179c135fc5a3d2bba199c3fa4276035
> 8915f7be39448d9257b6da3ad0233944d1d9a92a
> 
> correct? If so we should get an upstream issue filed so we can get this
> fixed there for the next time we sync.

Yes, I'll be filing a bug for that as well as one for the compilation warning that shows up on Windows. Fixing the latter is going to be easy but the former is definitely tricky: one of the commits in breakpad is a complete back-out of a previous one which was said to have a problem. This is the commit that was backed out:

https://breakpad.appspot.com/7714003

And this is the back-out:

https://codereview.chromium.org/1923383002

Fixing it sounds tricky but we'll see.
Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/1adfbab318a6
Upgrade breakpad to upstream revision 704f41ec901c419f8c321742114b415e6f5ceacc r=ted
All the artificial crashes on my try run look fine, pushing again with fingers crossed.
It seems like we still have some old version of minidump_stackwalk laying around. Ted, I've done another quick grep of the testing directory and these look a little suspect to me:

http://hg.mozilla.org/mozilla-central/file/tip/testing/docker/tester/Dockerfile#l8
http://hg.mozilla.org/mozilla-central/file/tip/testing/docker/desktop1604-test/Dockerfile#l68

Could we still be pulling in an old version of minidump_stackwalk?
Flags: needinfo?(gsvelto) → needinfo?(ted)
*sigh*, I guess so. I didn't know desktop1604-test was a thing, but I guess it needs the same fix. I think the tester image isn't used for anything, it was for B2G tests (but I could be wrong).
Flags: needinfo?(ted)
s/xxx/1306662/
See Also: → 1307381
Thanks Ted, I'll try to re-land once you're done. BTW how long does it take for automation to pick up this kind of changes?
Since the docker images are built directly from the in-tree Dockerfiles as a taskcluster task, it's immediate!
Excellent, thanks!
Ted's change is in mozilla-central so I'll re-land one more time.
Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/68fb465c2951
Upgrade breakpad to upstream revision 704f41ec901c419f8c321742114b415e6f5ceacc r=ted
https://hg.mozilla.org/mozilla-central/rev/68fb465c2951
Status: REOPENED → RESOLVED
Closed: 4 years ago3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
Blocks: 1309172
Blocks: 1239362
You need to log in before you can comment on or make changes to this bug.