Closed Bug 1523276 Opened 1 year ago Closed 1 year ago

Implement PHC on Linux

Categories

(Core :: Memory Allocator, task, P2, major)

task

Tracking

()

RESOLVED FIXED
mozilla70
Tracking Status
firefox66 --- wontfix
firefox70 --- fixed

People

(Reporter: decoder, Assigned: njn)

References

(Depends on 1 open bug)

Details

Attachments

(3 files)

This bug is about porting the gwp-asan allocator parts to jemalloc (for a better overall description of gwp-asan, see the parent bug).

For now I'll assume that the client-side crash reporter integration also happens as part of that patch. If that isn't the case, please let me know and I'll file a separate bug to track that task. Thanks!

Depends on: 1523278

This is a work in progress. Plenty of work left to do.

  • Lots of "njn:" comments that identify things that need fixing.
  • There are link errors on Android.
  • It produces lots of logging output to stderr at the moment.
  • There are frequent crashes in FramePointerStackWalk(). (I wonder if this is
    related to the logging output.)
  • Crashes in Windows content processes aren't correctly handled, because I
    don't understand the IPC done in that case.
  • I don't know how to test it. Probably crash tests of some kind?
  • No performance measurements yet.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=a270ba88e894c0d7865c4487aa7781ea2e942234
is a representative try run. At least it's working reasonably well on Linux64
and Mac.

Anyway, I think it would be useful to get some preliminary feedback. glandium,
can you look at the replace-malloc parts; gsvelto, can you look at the crash
reporter parts. Thanks!

Type: defect → task

A quick note that new and changed data collections are subject to Data Collection Review before they land.

There's a r+ patch which didn't land and no activity in this bug for 2 weeks.
:njn, could you have a look please?

Flags: needinfo?(n.nethercote)

The patch needs more work.

Flags: needinfo?(n.nethercote)

Do you just need more time to work on this or is this blocked on someone/something else? I'll be happy to help unblock if I can.

Flags: needinfo?(n.nethercote)

I wonder if it would appease the bots if you mark the review as "plan changes".

It's mostly a matter of time and attention. There is one technical challenge which is that on Windows I don't understand the way data is transferred from a content process to the parent process when a crash occurs. On Mac and Linux it was obvious and straightforward and I could make the necessary changes quite easily, but the Windows design is a real head-scratcher.

Flags: needinfo?(n.nethercote)

We discussed this offline but I forgot to add the dependency here: this also needs bug 1035892 to be fixed to work correctly on Mac. Unfortunately I'm stalled on that one due to an obscure race-condition triggered by the modernized exception handler.

Depends on: 1035892

Windows would be the highest priority for now, Mac and Linux are only P2 per project description and their support is not a requirement for this to land.

Jed, can you help :njn with comment 8 or can you forward the needinfo to someone else who can? Thanks!

Flags: needinfo?(jld)

I know almost nothing about the IPC situation in Breakpad, but I have general Breakpad-on-Windows experience. If that ends up making me the person who'd have to do the least ramp-up, I can give it a try.

I know how the crash reporter works on Linux, but that knowledge doesn't seem to be in short supply here. For Windows I could figure it out if I had to, but it would be better to start with someone who already has more of the background knowledge — so, probably gsvelto or dmajor, who are both already on this bug.

(Assuming this is about Breakpad proper; the mechanisms for sending crash annotations are more in IPC territory but I'd have to spend some time looking at the code.)

Flags: needinfo?(jld)

A while back I privately (via email) asked both gsvelto and dmajor about the Windows breakpad code. It appears nobody has much idea how that code works, unfortunately.

Here is a slightly longer explanation of the weirdness:

On a content process crash, I need to pass some extra data from the content process to the parent process. On Linux and Mac this is fairly straightforward and I have it working, but the way things work on Windows is really bizarre and I don't understand it. There are these two closely related types, ProtocolMessage and ClientInfo:
https://searchfox.org/mozilla-central/source/toolkit/crashreporter/breakpad-client/windows/common/ipc_protocol.h#140-161
https://searchfox.org/mozilla-central/source/toolkit/crashreporter/breakpad-client/windows/crash_generation/client_info.h#113-149

The latter has various comments saying "WARNING: Do not dereference these pointers as they are pointers in the address space of another process." I don't understand the point of these, and how they are supposed to work, and how data is passed from the child to the parent in general when a crash occurs.

Is the patch on phab the latest version of your changes? Maybe if I had in front of me the Linux source plus a live Windows build with GWP but no Breakpad integration, I might be able to fill in the gap.

Attachment #9053825 - Attachment description: Bug 1523276 - Implement GWP, a probabilistic, dynamic memory checking tool. r=glandium,gsvelto → Bug 1523276 - Implement GWP, a probabilistic, dynamic memory checking tool.

dmajor: I just pushed the latest code. It has a bunch of fprintf statements in the Windows breakpad code that I was using to understand it. They might be helpful, because another part of the weirdness is that things happen in an unexpected order: the IPC struct gets set up at start-up and doesn't seem to get filled in at crash time.

Can you walk me through how to trigger the relevant behavior, what is supposed to happen, and what actually happens? In the console all I see is a flood of "GWP[n,n]" so I don't really know what to be looking for.

dmajor: as written, the patch does the checking and prints a bunch of stuff about it, but you won't see any unusual behaviour.

To trigger a GWP-relevant crash at startup, change the "#if 0" in xpcom/base/nsMemoryReporter.com to "#if 1". The output will show "IS GWP ALLOC" and the subsequent output is relevant to the IPC stuff.

njn: I'm having a lot of trouble wrapping my head around the patch. Partly it's because it's a pretty large size of WIP to keep in mental working set when I don't have the automatic familiarity of having written it. Also, I feel uneasy around the fixme comments and general WIP-ness, not knowing whether what I'm looking at will be the real thing.

To reduce the mental load I'd like to suggest splitting up the patch. In my view there are three logical pieces to this: (1) The allocator itself, with no regard for crash reporting. (In theory, this could stand on its own: Socorro would still hear about the unmapped accesses, it's just that we wouldn't know they came from GWP -- do I understand this right?) (2) crash reporting for Linux, and (3) crash reporting for Windows.

Even if Windows reporting is "the" goal from a project management perspective, I don't think that means the other bits can't be completed first, especially if doing so makes the Windows part easier. If you could polish and land parts 1 and 2, that would make the remaining portion that I need to think about more tractable on my end. What do you think?

I don't think there's much point landing any of this without having the Windows crash handling hooked up.

In terms of understanding things, it's basically just the changes under toolkit/crashreporter/breakpad-client/windows/ (and the code within that directory in general) that's causing me problems. All the rest of it can be thought of as producing a special value on some crashes, and I need to pass that special value from the crashing content process to the parent process. And the existing code for passing values from the crashing content process to the parent process is really weird.

If it's too hard, that's fine, I'll get back to it soon.

(In reply to Nicholas Nethercote [:njn] from comment #19)

I don't think there's much point landing any of this without having the
Windows crash handling hooked up.

It's not like we'd land it without Windows and then walk away. The intent was to enable better collaboration. It's so much easier for me to look at a small piece and know the rest is already tree-quality and landed, than to maintain a large diff that may change in final form.

If you'd rather just work on it, that's fine by me too. But frankly, if I were the reviewer, I would ask for logical pieces to be in separate patches even if you had the whole thing already finished.

Any updates here?

Flags: needinfo?(n.nethercote)

I'm actively working on it. Plan is to ship on Linux64 first, to get something shipping, and then work through the Windows crash reporter weirdness immediately afterwards. (I realize that Windows is a much more desirable target due to the user population.) I'm still aiming for this to happen in Q2.

Flags: needinfo?(n.nethercote)
Attachment #9053825 - Attachment description: Bug 1523276 - Implement GWP, a probabilistic, dynamic memory checking tool. → Bug 1523276 - Implement GWP, a probabilistic, dynamic memory checking tool. r=glandium!,gsvelto!
Priority: -- → P2
Attachment #9053825 - Attachment description: Bug 1523276 - Implement GWP, a probabilistic, dynamic memory checking tool. r=glandium!,gsvelto! → Bug 1523276 - Implement GWP, a probabilistic, dynamic memory checking tool. r=gsvelto
Attachment #9053825 - Attachment description: Bug 1523276 - Implement GWP, a probabilistic, dynamic memory checking tool. r=gsvelto → Bug 1523276 - Implement PHC, a probabilistic heap checker. r=glandium,gsvelto

glandium, gsvelto: the latest version has everything in place, so please do a full review of your parts.

glandium: I'm seeing one problem both locally and on try pushes -- gtests are crashing part way through. I caught one crash while debugging and it looks like a PHC allocation is being given to mozjemalloc to deallocate, which leads to a MOZ_DIAGNOSTIC_ASSERT failure within mozjemalloc. I'm not sure if it's a problem with my code or some pre-existing flaw with the replace-malloc machinery. Oddly enough, the problem isn't manifesting anywhere else and other test suites such as mochitests are working fine.

The gtest failures are caused by the profiler's use of the jemalloc_replace_dynamic function. That function has multiple issues and is a big problem for PHC.

  • It's racy: multiple functions could call jemalloc_replace_dynamic() concurrently, which means replace_init_func() could be called multiple times concurrently, which means memory_hooks.cpp:gMallocTable could be overwritten concurrently.
  • It assumes we're swapping between the gReplaceMallocTableDefault and something else. If the starting table isn't the default one (e.g. it's the PHC table) then we can't switch back to the PHC one. Fixing this is hard; if we try to change the code to use the current table we get more races.
  • The init/uninit calls coming from the profiler aren't balanced; there can be multiple uninit calls for a single (or even zero) init calls.

I'm struggling to come up with a different design that is safe and sufficiently capable for both the profiler's current use case and PHC. Swapping out your allocator mid-execution in a thread-safe manner without locks is hard. I'm not sure how to move forward.

jesup: how important is the MOZ_PROFILER_MEMORY feature?

Flags: needinfo?(rjesup)

The MOZ_PROFILER_MEMORY stuff was added in bug 1480430.

It assumes we're swapping between the gReplaceMallocTableDefault and something else. If the starting table isn't the default one (e.g. it's the PHC table) then we can't switch back to the PHC one.

Putting this another way: at first glance the API makes it look like you can switch to a different replace-malloc and then switch back. But that's not true -- as soon as you call jemalloc_replace_dynamic() you've lost whatever replace-malloc functionality you might have in place, and you can only switch back to the default (i.e. no replacement). Depending on what the original replace-malloc functionality did, this could stop the accumulation of data (e.g. with DMD) or could cause crashes (with PHC).

IMO this is a fundamentally broken design and isn't acceptable. We need to come up with a way to fix it properly, or failing that, we should remove it.

MOZ_PROFILER_MEMORY is a very important feature; it's already been used extensively in performance work (and caught a multi-GB JS memory regression immediately on being enabled). (Someone on the profiler team accidentally made it not-default recently; you need to turn on Memory in options now - this should be fixed sometime soon.)

The current design was constrained by a requirement that the ABI for replacement libs not change. If we relax that requirement, several other things are options.

Agreed, low-overhead switching it tough. Earlier patches changed the ABI, but didn't have the race condition. However, some of your other criticisms might still be valid. And to an extent, I'm not sure we need to be able to run both profiler-with-memory and DMD/etc at the same time; I'm ok with having such limitations (though best enforced in the code in some manner).

Flags: needinfo?(rjesup)

I'm not sure we need to be able to run both profiler-with-memory and DMD/etc at the same time

The plan for PHC is to initially run it for all Nightly users, all the time. We may then possibly extend it to all Beta users. So if those users ever want to use the profiler -- and I am sure that's a hard requirement -- then the profiler can't disable the current replace-malloc functionality when it wants to start counting allocations. So PHC is currently totally blocked by jemalloc_replace_dynamic's flaws.

I've managed to fix the jemalloc_replace_dynamic problems enough for PHC to run (see bug 1557907).

Depends on: 1557907

Comment on attachment 9053825 [details]
Bug 1523276 - Implement PHC, a probabilistic heap checker. r=glandium,gsvelto

chutten: This patch adds five new annotations to crash reports. The vast majority of the time they won't be used. Once in a blue moon, PHC will detect a bad memory access and trigger a crash, in which case these annotations will be added. The annotations indicate the allocation's base address, one or two stack traces (allocation and free locations), the allocation's size, and what kind of allocation (NeverAllocated, Freed, or one of a couple of other possibilities).

This all seems pretty unobjectionable to me, and similar to other stuff we already have in crash reports. The only notable thing is that these annotations, when present, are likely to point directly at potential security vulnerabilities. So we should mark them as security-sensitive, like we already do with some other fields. But I think that happens on the Socorro side, is that right?

Attachment #9053825 - Flags: data-review?(chutten)

Nicholas, how are you going to retrieve and use that data? If the new fields are not placed in crash pings you will get a lot less of them. We usually get 20x to 30x more crash pings than crash reports so limiting yourself to crashreports will lead to a lot less samples. That being said if the annotations are security sensitive I believe that crash-pings are a no-no because IIRC the dataset is public. Chris certainly knows more though.

My current idea is that crash-stats would show those PHC fields to any user who has permissions to see security-sensitive fields.

Comment on attachment 9053825 [details]
Bug 1523276 - Implement PHC, a probabilistic heap checker. r=glandium,gsvelto

Load balancing to :tdsmith

Attachment #9053825 - Flags: data-review?(chutten) → data-review?(tdsmith)

Comment on attachment 9053825 [details]
Bug 1523276 - Implement PHC, a probabilistic heap checker. r=glandium,gsvelto

Thanks for the collection review request. Would you mind filling out the data review request form, attaching it separately to this bug, and raising a data-review? flag on that attachment?

The form is here: https://raw.githubusercontent.com/mozilla/data-review/master/review.md
The process is described here: https://wiki.mozilla.org/Firefox/Data_Collection#Step_1:_Submit_Request

Also, just to confirm wrt crash pings vs reports, crash pings aren't public per se but they are accessible to all employees. Of course, crash report access is metered out through crash-stats.

Flags: needinfo?(n.nethercote)
Attachment #9053825 - Flags: data-review?(tdsmith)
Attached file Data review request
Flags: needinfo?(n.nethercote)
Attachment #9071184 - Flags: data-review?(tdsmith)

If crash ping data isn't public then it would be worth considering it in addition to crash stats. As I mentioned in the comment above they tend to deliver over an order of magnitude more crashes than what we get on crash-stats.

Comment on attachment 9071184 [details]
Data review request

I should have pointed you to the request form at https://raw.githubusercontent.com/mozilla/data-review/master/request.md, not the review form, which is for me.

Since the error was mine (sorry), I'll take a guess at the review questions. Please let me know if you would have answered them differently.

---

Request questions:

1) What questions will you answer with this data?
This will help us detect memory use errors in Firefox.

2) Why does Mozilla need to answer these questions?  Are there benefits for users? Do we need this information to address product or business requirements?
This will help us build a more secure product by identifying instances of unsafe memory use.

3) What alternative methods did you consider to answer these questions? Why were they not sufficient?
We already run tests with ASan and [distribute ASan builds][asan-nightly] to shake out memory bugs in the wild, but the increased memory use means that we can't turn it on by default for users. GWP-Asan is a lightweight way to access some of the same benefits. CI testing and fuzzing are complementary approaches but we expect actual browsing to identify additional bugs.

[asan-nightly]: https://blog.mozilla.org/security/2018/07/19/introducing-the-asan-nightly-project/

4) Can current instrumentation answer these questions?
No; we need more context for the crashes identified this way in order to troubleshoot them.

5) List all proposed measurements and indicate the category of data collection for each measurement, using the Firefox [data collection categories](https://wiki.mozilla.org/Firefox/Data_Collection) on the Mozilla wiki.   

The new annotations are:

```
PHCKind:
  description: >
    The allocation kind, if the crash involved a bad access of a special PHC
    allocation.
  type: string

PHCBaseAddress:
  description: >
    The allocation's base address, if the crash involved a bad access of a
    special PHC allocation. Encoded as a decimal address.
  type: string

PHCUsableSize:
  description: >
    The allocation's usable size, if the crash involved a bad access of a
    special PHC allocation.
  # A 32-bit integer is enough because the maximum usable size of a special PHC
  # allocation is far less than 2 GiB.
  type: integer

PHCAllocStack:
  description: >
    The allocation's allocation stack trace, if the crash involved a bad access
    of a special PHC allocation. Encoded as a comma-separated list of decimal
    addresses.
  type: string

PHCFreeStack:
  description: >
    The allocation's free stack trace, if the crash involved a bad access
    of a special PHC allocation. Encoded as a comma-separated list of decimal
    addresses.
  type: string
```

The new annotations are category 1 data.

6) How long will this data be collected?
These data will be collected permanently. :njn will be responsible for ensuring the data are useful.

7) What populations will you measure?
All users.

8) If this data collection is default on, what is the opt-out mechanism for users?
Crash reporting is opt-in.

9) Please provide a general description of how you will analyze this data.
These data will inform the analysis of crash reports.

10) Where do you intend to share the results of your analysis?
Sporadically, on Bugzilla.

11) Is there a third-party tool (i.e. not Telemetry) that you are proposing to use for this data collection?
No.

---

# Data Review Form (to be filled by Data Stewards)

1) Is there or will there be **documentation** that describes the schema for the ultimate data set in a public, complete, and accurate way?

Yes, in CrashAnnotations.yaml.

2) Is there a control mechanism that allows the user to turn the data collection on and off?

Yes, crash reports are opt-in. (If extended to crash pings, the telemetry opt-out will be the control mechanism.)

3) If the request is for permanent data collection, is there someone who will monitor the data over time?

:njn is responsible for ensuring the data are useful.

4) Using the **[category system of data types](https://wiki.mozilla.org/Firefox/Data_Collection)** on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1.

5) Is the data collection request for default-on or default-off?

Default-off. (Crash pings would be default-on.)

6) Does the instrumentation include the addition of **any *new* identifiers**?

No.

7) Is the data collection covered by the existing Firefox privacy notice?

Yes.

8) Does there need to be a check-in in the future to determine whether to renew the data? (Yes/No)

No, permanent collection.

9) Does the data collection use a third-party collection tool?

No.

---

Result: data-review+ for inclusion in crash reports and/or crash pings
Attachment #9071184 - Flags: data-review?(tdsmith) → data-review+

Thanks, tdsmith! You did a great job of interpolating my responses. I only see one thing worth correcting:

  1. What populations will you measure?
    All users.

Initially it will only be Nightly users on Linux. Soon afterward we will expand to Nightly users on Windows, and possibly Mac.

If things go well, we might expand to Beta users. Beyond that, it's conceivable we'd expand to Release users eventually, but it's far from certain.

Pushed by nnethercote@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/3c6a53b1fb24
Implement PHC, a probabilistic heap checker. r=glandium,gsvelto
https://hg.mozilla.org/integration/mozilla-inbound/rev/161dc1283019
Disable PHC for the moment. r=glandium

Bug 1567065 is the follow-up bug to enable PHC on Linux64 Nightly builds.

Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla70

(Retrospectively renaming this bug to account for the change in the tool's name, and to account for the fact that this bug eventually covered just the Linux implementation.)

Summary: Port gwp-asan allocator shim to jemalloc → Implement PHC on Linux
No longer blocks: PHC
No longer depends on: 1035892
You need to log in before you can comment on or make changes to this bug.