Closed Bug 1744320 Opened 5 months ago Closed 5 months ago

analyze performance difference between breakpad stackwalker and rust minidump-stackwalk

Categories

(Socorro :: Processor, task, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

References

Details

We've done some mild ad hoc performance analysis between the old and new stackwalkers, but we should formalize that a bit and write it up somewhere.

This bug covers that.

A while back, I had done some rough timings on my local machine:

Breakpad stackwalker:

  • 200 crash reports from 20211020 (634f9624):
    • cold cache: 25m58.401s
    • hot cache: 12m15.277s

Rust minidump-stackwalk:

  • 200 crash reports from 20211020 (f8618066)
    • cold cache: 20m29.998s
    • hot cache: 7m48.840s
  • same 200 crash reports from 20211020 after fixing panics and landing cache fix (01af4524)
    • cold cache: 25m2.065s, 19m12.645s, 21m15.886s
    • hot cache: 7m4.285s, 7m27.719s

That suggested a large improvement when switching to Rust minidump-stackwalk.

Today, I spent the afternoon extracting data from Grafana and working over it with a Jupyter notebook.

https://github.com/willkg/socorro-jupyter/blob/main/notebooks/bug_1744320_performance_minidump_stackwalk.ipynb

I'm a little puzzled we're seeing radically different numbers between my local timings and server timings. We do have a different version of Rust minidump-stackwalk running in stage than the one I tested locally. The server timings suggest they're equivalent-ish.

I don't think there's a regression here. I think it's good enough and we should move forward.

I started wondering whether the ratio of crashes with a minidump to crashes with an empty minidump was the same between the two environments. Prod has a higher percentage of crash reports that have an empty minidump.

I had to look at the numbers for today because the field I'm using doesn't exist prior to today's index in prod.

environment has minidump has empty minidump percent
prod 138,557 12,122 11%
stage 10,140 1,067 10%

That might bring the prod mean lower because crash reports with an empty minidump process don't actually have any processing.

If any other possibilities pop in my head, I'll write them down here.

Aria pointed out we should ultimately compare prod to prod to know whether Rust minidump-stackwalk performs better or not. I'll do that when we get to that point.

Jason pointed out that stage has far less load than it's provisioned for compared with prod. There are like 604k seconds in a week and stage processed 68k crash reports, so that's like 1 crash report every 10 seconds. Most crashes take 3-4 seconds to process, so stage spends the bulk of its time sitting around.

Because of that, threads running in the stage processor are less likely to be interrupted which will result in more consistent--and probably lower--timings.

That's the opposite of what we're seeing in the numbers though, right? Which suggests either this effect is minimal or rust-minidump is doubly-worse.

I added the comment because I thought it was a relevant detail and I may do another pass on the conclusion in the Jupyter notebook.

The graphs suggest the timings in prod are more erratic than they are in stage. Maybe the difference in load explains that?

My gut feeling from my experiences with Rust minidump-stackwalk is that it could be a little better or a little worse, but probably not a lot worse. I still think we should compare prod to prod to know whether Rust minidump-stackwalk performs better or not.

While looking at changes in signatures, I noticed a crash report that rust-minidump minidump-stackwalk ran out of memory on when processing. It looks like rust-minidump minidump-stackwalk occasionally spikes in memory use and gets killed off. That's different than what breakpad stackwalker was doing. (This is pretty hand-wavey since I'm trying to observe what's going on using Grafana.)

I wrote up an issue for it: https://github.com/luser/rust-minidump/issues/355

When it happens, the process dies and returns a SIGABRT (which shows up as a -6 exit code in Grafana). There aren't many on stage--maybe 1 or 2 every 6 hours.

I think that's fine for now and we can fix it as we go along. Aria has some ideas on reducing memory spikes already.

Stage and prod have different loads, but even so I think this doesn't block switching prod to use rust-minidump minidump-stackwalk. If it turns out to be a problem, we'll see it in the memory used graph and can switch back.

The mean timing for the new MinidumpStackwalkerRule is like 2s below the previous BreakpadStackwalkerRule2015, so I'm not sure what that says about our predictions, but it's good!

The new stackwalker is in production now. Any new issues that come up will be addressed in new bugs. Marking this as FIXED.

Status: ASSIGNED → RESOLVED
Closed: 5 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.