Open Bug 950710 Opened 11 years ago Updated 2 years ago

stackwalker running amok using excessive virtual memory

Categories

(Toolkit :: Crash Reporting, defect)

x86_64
Linux
defect

Tracking

()

People

(Reporter: lars, Unassigned)

References

Details

Attachments

(1 file)

we've had a second instance of stackwalker bringing a processor machine to it's knees using excessive amounts of virtual memory: 49G in today's case.  This involved crash a2d85f9e-fd2c-4f5d-ae68-133122131216 at 7am on 2013-12-16 on processor 07.   Here's the command that caused the trouble:

/data/socorro/stackwalk/bin/stackwalker --raw-json /tmp/a2d85f9e-fd2c-4f5d-ae68-133122131216.Thread-24.TEMPORARY.json --pipe /tmp/a2d85f9e-fd2c-4f5d-ae68-133122131216.upload_file_minidump.TEMPORARY.dump "/mnt/socorro/symbols/symbols_ffx" "/mnt/socorro/symbols/symbols_sea" "/mnt/socorro/symbols/symbols_tbrd" "/mnt/socorro/symbols/symbols_mob" "/mnt/socorro/symbols/symbols_penelope" "/mnt/socorro/symbols/symbols_sbrd" "/mnt/socorro/symbols/symbols_camino" "/mnt/socorro/symbols/symbols_os" "/mnt/socorro/symbols/symbols_solaris" "/mnt/socorro/symbols/symbols_opensuse" "/mnt/socorro/symbols/symbols_ubuntu" "/mnt/socorro/symbols/symbols_fedora" "/mnt/socorro/symbols/symbols_adobe" "/mnt/socorro/symbols/symbols_b2g" "/mnt/socorro/symbols/symbols_geeksphone" "/mnt/socorro/symbols/symbols_tclpartner" "/mnt/socorro/symbols/symbols_zte" "/mnt/socorro/symbols/symbols_leo" 2> /dev/null
the third instance of this problem has now happened...

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM TIME+ COMMAND                                                                   
640 socorro   20   0 35.1g  18g  828 R 100.0 80.0   9:25.07 stackwalker

see bug 958872 for more info.  I've not been able to piece together what crash_id was involved.
it just hit another processor while I happened to be looking.  
offender: 222e6813-0e8c-434b-aed9-471fa2140111
I tracked down the specific issue here after lars gave me another example dump today. The stackwalker is essentially winding up in an infinite loop where it produces the same frame over and over (possibly due to bad unwind data). Since the fix for bug 894483 landed, this means it'll churn up to UINT32_MAX frames, which is rather large. We can band-aid this by simply commenting out this line for now:
https://github.com/mozilla/socorro/blame/master/minidump-stackwalk/stackwalker.cc#L686

Alternately I'm poking at a patch to upstream Breakpad to short-circuit this failure mode.
sp-processor06.phx1 ;
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                               18409 socorro   30  10 30.2g  18g 1196 R 99.9 79.3   8:49.04 stackwalker 

  PID TTY      STAT   TIME COMMAND
18409 ?        DN     8:52 /data/socorro/stackwalk/bin/stackwalker --raw-json /tmp/29a8e14e-ce5b-4401-a471-30d812140120.Thread-6.TEMPORARY.json --pipe /tmp/29a8e14e-ce5b-4401-a471-30d812140120.upload_file_minidump.TEMPORARY.dump /mnt/socorro/symbols/symbols_ffx /mnt/socorro/symbols/symbols_sea /mnt/socorro/symbols/symbols_tbrd /mnt/socorro/symbols/symbols_mob /mnt/socorro/symbols/symbols_penelope /mnt/socorro/symbols/symbols_sbrd /mnt/socorro/symbols/symbols_camino /mnt/socorro/symbols/symbols_os /mnt/socorro/symbols/symbols_solaris /mnt/socorro/symbols/symbols_opensuse /mnt/socorro/symbols/symbols_ubuntu /mnt/socorro/symbols/symbols_fedora /mnt/socorro/symbols/symbols_adobe /mnt/socorro/symbols/symbols_b2g /mnt/socorro/symbols/symbols_geeksphone /mnt/socorro/symbols/symbols_tclpartner /mnt/socorro/symbols/symbols_zte /mnt/socorro/symbols/symbols_leo

process killed
Another one;

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                   
 2830 socorro   30  10 24.8g  18g 1188 D  2.3 77.9   6:47.30 stackwalker


$ ps -www 2830
  PID TTY      STAT   TIME COMMAND
 2830 ?        DN     7:02 /data/socorro/stackwalk/bin/stackwalker --raw-json /tmp/df2d9399-f7a3-4d23-92c6-696cd2140120.Thread-10.TEMPORARY.json --pipe /tmp/df2d9399-f7a3-4d23-92c6-696cd2140120.upload_file_minidump.TEMPORARY.dump /mnt/socorro/symbols/symbols_ffx /mnt/socorro/symbols/symbols_sea /mnt/socorro/symbols/symbols_tbrd /mnt/socorro/symbols/symbols_mob /mnt/socorro/symbols/symbols_penelope /mnt/socorro/symbols/symbols_sbrd /mnt/socorro/symbols/symbols_camino /mnt/socorro/symbols/symbols_os /mnt/socorro/symbols/symbols_solaris /mnt/socorro/symbols/symbols_opensuse /mnt/socorro/symbols/symbols_ubuntu /mnt/socorro/symbols/symbols_fedora /mnt/socorro/symbols/symbols_adobe /mnt/socorro/symbols/symbols_b2g /mnt/socorro/symbols/symbols_geeksphone /mnt/socorro/symbols/symbols_tclpartner /mnt/socorro/symbols/symbols_zte /mnt/socorro/symbols/symbols_leo

process killed
sp-processor10.phx1 ;

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                    
15490 socorro   30  10 27.3g  19g  692 D  3.2 82.9   6:46.83 stackwalker

$ ps -awww 15490
  PID TTY      STAT   TIME COMMAND
15490 ?        DN     6:47 /data/socorro/stackwalk/bin/stackwalker --raw-json /tmp/9ebb0146-bfad-4c0e-bd33-935742140122.Thread-3.TEMPORARY.json --pipe /tmp/9ebb0146-bfad-4c0e-bd33-935742140122.upload_file_minidump.TEMPORARY.dump /mnt/socorro/symbols/symbols_ffx /mnt/socorro/symbols/symbols_sea /mnt/socorro/symbols/symbols_tbrd /mnt/socorro/symbols/symbols_mob /mnt/socorro/symbols/symbols_penelope /mnt/socorro/symbols/symbols_sbrd /mnt/socorro/symbols/symbols_camino /mnt/socorro/symbols/symbols_os /mnt/socorro/symbols/symbols_solaris /mnt/socorro/symbols/symbols_opensuse /mnt/socorro/symbols/symbols_ubuntu /mnt/socorro/symbols/symbols_fedora /mnt/socorro/symbols/symbols_adobe /mnt/socorro/symbols/symbols_b2g /mnt/socorro/symbols/symbols_geeksphone /mnt/socorro/symbols/symbols_tclpartner /mnt/socorro/symbols/symbols_zte /mnt/socorro/symbols/symbols_leo
:ted  how soon do you think we can get this problem resolved?  The occurance of this problem has gone from one every couple weeks to several in the last couple days.  Because this essentially knocks a processor offline and annoys on-call, it is developing into a critical problem.  

Meanwhile, I'm gonna work on a way to get the processor itself to detect the problem and kill the stackwalker.
Flags: needinfo?(ted)
Let's just take the safe band-aid patch for now, we can remove it when I get this fixed upstream:
https://github.com/mozilla/socorro/pull/1813
Flags: needinfo?(ted)
Commits pushed to master at https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/52ae9fb14c5b00cf8fc3637178192597227d34c1
bug 950710 - put the 1024 frame limit back in place until we fix the root cause of runaway stackwalking in upstream Breakpad

https://github.com/mozilla/socorro/commit/7b12930e93ceda589652a50701ca836f146f645b
Merge pull request #1813 from luser/truncate-stacks

bug 950710 - put the 1024 frame limit back in place until we fix the root cause of runaway stackwalking in upstream Breakpad
Assignee: ted → nobody
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: