Closed Bug 884300 Opened 6 years ago Closed 6 years ago

CFI records broken on all non-Windows platforms

Categories

(Toolkit :: Crash Reporting, defect, P2, major)

x86
Linux
defect

Tracking

()

RESOLVED FIXED
mozilla24
Tracking Status
firefox21 --- wontfix
firefox22 + wontfix
firefox23 + fixed
firefox24 --- fixed

People

(Reporter: benjamin, Assigned: ted)

References

Details

Attachments

(1 file)

Stackwalking is very broken on x86-android. Symbols are present, but it looks like we're only stack-scanning. I haven't had time to investigate yet, but I'd love help.

In the meantime treat stacks on android-x86 as vague guides for issues.

Does anyone whom regularly looks at breakpad stuff have an android-x86 device?

See e.g. bp-6f72b2a0-ac35-440d-aafb-2a6b42130617 from bug 884200
Do we actually have the CFI info on the symbol server?
[tmielczarek@symbols1.dmz.phx1 symbols_mob]$ grep nsXPCWrappedJS::CallMethod libxul.so/200331396232091C8697FA6B30D2083A0/libxul.so.sym 
FUNC c24b3e 125 0 nsXPCWrappedJS::CallMethod(unsigned short, XPTMethodDescriptor const*, nsXPTCMiniVariant*)

[tmielczarek@symbols1.dmz.phx1 symbols_mob]$ grep "STACK CFI INIT c24b3e" libxul.so/200331396232091C8697FA6B30D2083A0/libxul.so.sym
STACK CFI INIT c24b3e 125 .cfa: 0x3bd5e430 4 + .ra: 0x15af3200 -4 + ^

The CFI records look pretty weird. There are...constants in there for some reason. The symbol files for this libxul are:
http://symbols.mozilla.org/fennec/libxul.so/200331396232091C8697FA6B30D2083A0/libxul.so.sym
http://symbols.mozilla.org/fennec/libxul.so/200331396232091C8697FA6B30D2083A0/libxul.so.dbg.gz

...although, looking at an ARM symbol file, I see the same weirdness:
[tmielczarek@symbols1.dmz.phx1 symbols_mob]$ grep nsXPCWrappedJS::CallMethod libxul.so/350963275103BEEF7243723E5CF7551A0/libxul.so.sym
FUNC 7a9160 a0 0 nsXPCWrappedJS::CallMethod(unsigned short, XPTMethodDescriptor const*, nsXPTCMiniVariant*)

[tmielczarek@symbols1.dmz.phx1 symbols_mob]$ grep "STACK CFI INIT 7a9160" libxul.so/350963275103BEEF7243723E5CF7551A0/libxul.so.sym
STACK CFI INIT 7a9160 a0 .cfa: 0x10fadbc0 0 + .ra: 0x10fae0d8 0 +
I know that if there's something to do on the product side, this has no real chance for 22, but I think we should track nevertheless in case all we need there is server-side anyhow.

Bascially, many crash reports on Android x86 are not very meaningful as long as this bug exists (some reports might at least tell us something, and the Java-exception-type crash reports should still be OK).
This is fallout from the prerequisite patches from bug 779291. It wasn't actually caught by the breakpad unit tests either. :-(
OS: Android → Linux
Summary: Stackwalking broken on x86-android → CFI records broken on Linux platforms
I added a new check in the Breakpad unit tests to catch this, and verified that it fails without the changes in module.cc, but passes with those changes.
Attachment #764185 - Flags: review?(mh+mozilla)
Assignee: nobody → ted
Comment on attachment 764185 [details] [diff] [review]
fix DWARF CFI record serialization

Review of attachment 764185 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/crashreporter/breakpad-patches/03-unique-string.patch
@@ +1681,5 @@
> + }
> + 
> + TEST(Write, OmitUnusedFiles) {
> +   Module m(MODULE_NAME, MODULE_OS, MODULE_ARCH, MODULE_ID);
> +@@ -229,21 +239,23 @@

That entire block moving in the patch looks weird. Was it outdated?
Attachment #764185 - Flags: review?(mh+mozilla) → review+
I think something might have confused diff. The actual patch looks fine.
We'll be landing this to all branches including mozilla-release, once ready. If we don't take this fix, it will hinder our ability to diagnose and resolve Android x86 crashes.
Alex: I tried to clarify in the summary, but I guess I didn't make it clear enough. this breaks stackwalking on all of our non-Windows platforms: Mac, Linux, Android. Some may be broken worse than others, so it may not have been as noticeable.
Summary: CFI records broken on Linux platforms → CFI records broken on all non-Windows platforms
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #10)
> Alex: I tried to clarify in the summary, but I guess I didn't make it clear
> enough. this breaks stackwalking on all of our non-Windows platforms: Mac,
> Linux, Android. Some may be broken worse than others, so it may not have
> been as noticeable.

If this was significantly impacting our crash investigations, we'd have noticed over the past couple of cycles though right? It's not that signature reports are being dropped or something critical in that way?

As long as we don't think this is hindering our investigations, we only want to respin Mobile to be super cautious around Android x86.
Note that the patch here is very tiny and only impacts symbol dumping, not anything else. (The fix itself was two lines, the rest is unit tests + updating a copy of the local patch against upstream.)
(In reply to Alex Keybl [:akeybl] from comment #11)
> If this was significantly impacting our crash investigations, we'd have
> noticed over the past couple of cycles though right?

Not necessarily. We sometimes have bogus stacks due to corruption, even if breakpad works fine, and our crash investigations are very much centered around Windows as that's where most of our crashes come from, and on Android a lot of it is Java exceptions which are not impacted.

Do we know if all stacks look corrupt because of this, and if all frames are bogus? Might the signatures or even some stacks be still good on some platforms?
The top frame should always be correct. This will only affect signatures when the top frame hits the skiplist. If that happens, the signatures could be really weird or totally bogus.
Keywords: helpwanted
Note that we shipped this broken in Firefox 21, if that changes your desire to take the fix on release. This is a very safe patch that I'd be personally fine with landing anywhere, but since we shipped it broken in 21 and didn't really notice, it's hard to argue that we desperately need it for 22.
Let's punt to FF23 given comment 14 and comment 15. I'd rather we not take any non-critical change, and this doesn't feel critical at this point.
https://hg.mozilla.org/mozilla-central/rev/24e9d0e5998d
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla24
I was going to check the symbols on the symbol server, but this didn't land in time for today's nightly builds. I settled for grabbing the symbol package from a m-c build:
http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux64/1371648262/firefox-24.0a1.en-US.linux-x86_64.crashreporter-symbols.zip

Spot checking that looks good:
STACK CFI INIT 34cc 9a .cfa: $rsp 8 + .ra: .cfa -8 + ^
Comment on attachment 764185 [details] [diff] [review]
fix DWARF CFI record serialization

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Originally landed as part of bug 779291 (landed prior to the final patches there, so it made Firefox 21)
User impact if declined: Crash reports on non-Windows platforms can have bad stacks.
Testing completed (on m-c, etc.): Landed on m-c, has unit tests in the Breakpad tree (we don't run these on Firefox builds).
Risk to taking this patch (and alternatives if risky): Very little risk, the fix is a two-line change that only impacts serializing symbols to disk, so only the symbol dumping process.
String or IDL/UUID changes made by this patch: None
Attachment #764185 - Flags: approval-mozilla-aurora?
Attachment #764185 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Duplicate of this bug: 859527
See Also: → 895470
You need to log in before you can comment on or make changes to this bug.