Open
Bug 950710
Opened 11 years ago
Updated 2 years ago
stackwalker running amok using excessive virtual memory
Categories
(Toolkit :: Crash Reporting, defect)
Tracking
()
NEW
People
(Reporter: lars, Unassigned)
References
Details
Attachments
(1 file)
41.86 KB,
text/plain
|
Details |
we've had a second instance of stackwalker bringing a processor machine to it's knees using excessive amounts of virtual memory: 49G in today's case. This involved crash a2d85f9e-fd2c-4f5d-ae68-133122131216 at 7am on 2013-12-16 on processor 07. Here's the command that caused the trouble: /data/socorro/stackwalk/bin/stackwalker --raw-json /tmp/a2d85f9e-fd2c-4f5d-ae68-133122131216.Thread-24.TEMPORARY.json --pipe /tmp/a2d85f9e-fd2c-4f5d-ae68-133122131216.upload_file_minidump.TEMPORARY.dump "/mnt/socorro/symbols/symbols_ffx" "/mnt/socorro/symbols/symbols_sea" "/mnt/socorro/symbols/symbols_tbrd" "/mnt/socorro/symbols/symbols_mob" "/mnt/socorro/symbols/symbols_penelope" "/mnt/socorro/symbols/symbols_sbrd" "/mnt/socorro/symbols/symbols_camino" "/mnt/socorro/symbols/symbols_os" "/mnt/socorro/symbols/symbols_solaris" "/mnt/socorro/symbols/symbols_opensuse" "/mnt/socorro/symbols/symbols_ubuntu" "/mnt/socorro/symbols/symbols_fedora" "/mnt/socorro/symbols/symbols_adobe" "/mnt/socorro/symbols/symbols_b2g" "/mnt/socorro/symbols/symbols_geeksphone" "/mnt/socorro/symbols/symbols_tclpartner" "/mnt/socorro/symbols/symbols_zte" "/mnt/socorro/symbols/symbols_leo" 2> /dev/null
Reporter | ||
Comment 1•11 years ago
|
||
the third instance of this problem has now happened... PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 640 socorro 20 0 35.1g 18g 828 R 100.0 80.0 9:25.07 stackwalker see bug 958872 for more info. I've not been able to piece together what crash_id was involved.
Reporter | ||
Comment 2•11 years ago
|
||
it just hit another processor while I happened to be looking. offender: 222e6813-0e8c-434b-aed9-471fa2140111
Comment 3•11 years ago
|
||
Comment 5•11 years ago
|
||
I tracked down the specific issue here after lars gave me another example dump today. The stackwalker is essentially winding up in an infinite loop where it produces the same frame over and over (possibly due to bad unwind data). Since the fix for bug 894483 landed, this means it'll churn up to UINT32_MAX frames, which is rather large. We can band-aid this by simply commenting out this line for now: https://github.com/mozilla/socorro/blame/master/minidump-stackwalk/stackwalker.cc#L686 Alternately I'm poking at a patch to upstream Breakpad to short-circuit this failure mode.
Comment 6•11 years ago
|
||
sp-processor06.phx1 ; PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 18409 socorro 30 10 30.2g 18g 1196 R 99.9 79.3 8:49.04 stackwalker PID TTY STAT TIME COMMAND 18409 ? DN 8:52 /data/socorro/stackwalk/bin/stackwalker --raw-json /tmp/29a8e14e-ce5b-4401-a471-30d812140120.Thread-6.TEMPORARY.json --pipe /tmp/29a8e14e-ce5b-4401-a471-30d812140120.upload_file_minidump.TEMPORARY.dump /mnt/socorro/symbols/symbols_ffx /mnt/socorro/symbols/symbols_sea /mnt/socorro/symbols/symbols_tbrd /mnt/socorro/symbols/symbols_mob /mnt/socorro/symbols/symbols_penelope /mnt/socorro/symbols/symbols_sbrd /mnt/socorro/symbols/symbols_camino /mnt/socorro/symbols/symbols_os /mnt/socorro/symbols/symbols_solaris /mnt/socorro/symbols/symbols_opensuse /mnt/socorro/symbols/symbols_ubuntu /mnt/socorro/symbols/symbols_fedora /mnt/socorro/symbols/symbols_adobe /mnt/socorro/symbols/symbols_b2g /mnt/socorro/symbols/symbols_geeksphone /mnt/socorro/symbols/symbols_tclpartner /mnt/socorro/symbols/symbols_zte /mnt/socorro/symbols/symbols_leo process killed
Comment 7•11 years ago
|
||
Another one; PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2830 socorro 30 10 24.8g 18g 1188 D 2.3 77.9 6:47.30 stackwalker $ ps -www 2830 PID TTY STAT TIME COMMAND 2830 ? DN 7:02 /data/socorro/stackwalk/bin/stackwalker --raw-json /tmp/df2d9399-f7a3-4d23-92c6-696cd2140120.Thread-10.TEMPORARY.json --pipe /tmp/df2d9399-f7a3-4d23-92c6-696cd2140120.upload_file_minidump.TEMPORARY.dump /mnt/socorro/symbols/symbols_ffx /mnt/socorro/symbols/symbols_sea /mnt/socorro/symbols/symbols_tbrd /mnt/socorro/symbols/symbols_mob /mnt/socorro/symbols/symbols_penelope /mnt/socorro/symbols/symbols_sbrd /mnt/socorro/symbols/symbols_camino /mnt/socorro/symbols/symbols_os /mnt/socorro/symbols/symbols_solaris /mnt/socorro/symbols/symbols_opensuse /mnt/socorro/symbols/symbols_ubuntu /mnt/socorro/symbols/symbols_fedora /mnt/socorro/symbols/symbols_adobe /mnt/socorro/symbols/symbols_b2g /mnt/socorro/symbols/symbols_geeksphone /mnt/socorro/symbols/symbols_tclpartner /mnt/socorro/symbols/symbols_zte /mnt/socorro/symbols/symbols_leo process killed
Comment 8•11 years ago
|
||
sp-processor10.phx1 ; PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 15490 socorro 30 10 27.3g 19g 692 D 3.2 82.9 6:46.83 stackwalker $ ps -awww 15490 PID TTY STAT TIME COMMAND 15490 ? DN 6:47 /data/socorro/stackwalk/bin/stackwalker --raw-json /tmp/9ebb0146-bfad-4c0e-bd33-935742140122.Thread-3.TEMPORARY.json --pipe /tmp/9ebb0146-bfad-4c0e-bd33-935742140122.upload_file_minidump.TEMPORARY.dump /mnt/socorro/symbols/symbols_ffx /mnt/socorro/symbols/symbols_sea /mnt/socorro/symbols/symbols_tbrd /mnt/socorro/symbols/symbols_mob /mnt/socorro/symbols/symbols_penelope /mnt/socorro/symbols/symbols_sbrd /mnt/socorro/symbols/symbols_camino /mnt/socorro/symbols/symbols_os /mnt/socorro/symbols/symbols_solaris /mnt/socorro/symbols/symbols_opensuse /mnt/socorro/symbols/symbols_ubuntu /mnt/socorro/symbols/symbols_fedora /mnt/socorro/symbols/symbols_adobe /mnt/socorro/symbols/symbols_b2g /mnt/socorro/symbols/symbols_geeksphone /mnt/socorro/symbols/symbols_tclpartner /mnt/socorro/symbols/symbols_zte /mnt/socorro/symbols/symbols_leo
Reporter | ||
Comment 10•11 years ago
|
||
:ted how soon do you think we can get this problem resolved? The occurance of this problem has gone from one every couple weeks to several in the last couple days. Because this essentially knocks a processor offline and annoys on-call, it is developing into a critical problem. Meanwhile, I'm gonna work on a way to get the processor itself to detect the problem and kill the stackwalker.
Flags: needinfo?(ted)
Comment 11•11 years ago
|
||
Let's just take the safe band-aid patch for now, we can remove it when I get this fixed upstream: https://github.com/mozilla/socorro/pull/1813
Flags: needinfo?(ted)
Comment 12•11 years ago
|
||
Commits pushed to master at https://github.com/mozilla/socorro https://github.com/mozilla/socorro/commit/52ae9fb14c5b00cf8fc3637178192597227d34c1 bug 950710 - put the 1024 frame limit back in place until we fix the root cause of runaway stackwalking in upstream Breakpad https://github.com/mozilla/socorro/commit/7b12930e93ceda589652a50701ca836f146f645b Merge pull request #1813 from luser/truncate-stacks bug 950710 - put the 1024 frame limit back in place until we fix the root cause of runaway stackwalking in upstream Breakpad
Updated•5 years ago
|
Assignee: ted → nobody
Updated•2 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•