Closed Bug 459397 Opened 16 years ago Closed 16 years ago

crash-stats can't process Thunderbird 3 Alpha 3 OS X i386 crashes - stackwalk.sh failed with return code 1

Categories

(Socorro :: General, task)

PowerPC
macOS
task
Not set
major

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: gozer, Unassigned)

References

()

Details

(Whiteboard: get a build newer than 2008-12-24)

Attachments

(4 files)

Attached file Example dump
For some reason, it looks like i386 crash-dumps can't get successfully processed. For an example, see:

http://crash-stats.mozilla.com/report/index/27fe10b8-6a9b-4c5b-9617-000ed55daadc

I've tried running minidump_stackwalk manually against it, and it seemed to work.
aravind: do you think you could try manually running minidump_stackwalk against this dump on the production machine to see what's happening?
Any luck having a look at what's hapenning on the server-side of things ?
Hardware: PC → Macintosh
Summary: OS X i386 crash-stats can't process Thunderbird 3 Aplha 3 crashes → crash-stats can't process Thunderbird 3 Aplha 3 OS X i386 crashes
Bumping priority on this one, we'd like to be able to view crashes before Beta 1 is out.
Severity: normal → major
Can you submit a new crash report, and give me the id?  I can then run stuff on the server side and see what I find.
Assignee: server-ops → aravind
Easy, here is is:

53425101-CB77-4CB1-A693-9FEDAF6161A3
FYI, when I ran :

./src/processor/minidump_stackwalk 27FE10B8-6A9B-4C5B-9617-000ED55DAADC.dmp 20081006093807

It worked like a charm
I don't have that crash report (dump files) in the system.  Are you sure that you are submitting those correctly?
And another one from just now:

9d02e889-9a76-4a8f-9f5b-e0ea8ea0d7fd
Hrm, for some reason, I was looking at the crash ids in the pending/ directory while the client was running. Guess submitted reports get their ids changed ?

Looking at the submitted/ directory, I can see the last 2 I am talking about:

Crash ID: bp-e9d68e5e-9ff6-11dd-b4da-001cc45a2c28
You can view details of this crash at http://crash-stats.mozilla.com/report/index/e9d68e5e-9ff6-11dd-b4da-001cc45a2c28?date=2008-10-22-05
Crash ID: bp-6e01ce7d-a0c3-11dd-be94-001cc45a2ce4
You can view details of this crash at http://crash-stats.mozilla.com/report/index/6e01ce7d-a0c3-11dd-be94-001cc45a2ce4?date=2008-10-23-05
The GUIDs in pending/ are generated client-side just to avoid name collisions. They're actually completely unrelated to the GUIDs that the server returns when you submit a report.
I am guessing that everything is good now?  Please re-open with the specific problem and information on what's not working if things are still broke.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → INVALID
Today, I used Ted's crash me now extension to crash Thunderbird version:

Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1b2pre) Gecko/20081028 Lightning/1.0pre Shredder/3.0b1pre

Using about:crashes, I then visited the crash:

bp-3ba268ca-a510-11dd-a5cf-001321b13766
bp-8e6f7d35-a515-11dd-9d61-001321b13766
bp-b489df6e-a376-11dd-8233-001cc45a2ce4

The last one was one Thunderbird generated a couple of days ago.

For all of these the results give no data apart from the time & version and crash comment.
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
In each of those reports, the raw dumps are empty as well.  Are you sure you are actually submitting data in those dumps?  Ted may have more insight into what you could look for on the client side to make sure that stuff is working correctly on that end.
crash-reporter happily said it was submitting data. I may have also been getting the os x crash report dialog as well, but I can't be certain.

I can probably try again on that system tomorrow.
look at the pending directory to see if there really is anything in the dmp file.  according to bug 427446 crash reporter happily tries to send when there really isn't anything to send.  I don't have my #breakpad irc conversation from this week - there may be another bug involved as well.
Attached file Dump File
I just tried this again via the crash me now extension. The id is:

http://crash-stats.mozilla.com/report/index/7d972cc2-a79d-11dd-9fe3-001cc45a2c28

What really surprised me is that when I got to about:crashes and clicked the link (within about 2 minutes) it went straight to the "blank" page. Normally I have to wait a few minutes first.

Anyway, before submitting the report I took a copy of the dump and extra file, and I'm attaching them here.
Attached file Extra file
Per comment 0 and comment 8, gozer already consulted with me already, and the minidump files look fine. He's able to run the stackwalk tool on them locally with no problems, even with the corresponding symbols present. Clearly something is going wrong on the production server. If I had to guess, I'd say minidump_stackwalk is crashing there. This would lead to the empty raw dump in comment 15, since there would be no output from minidump_stackwalk.
Summary: crash-stats can't process Thunderbird 3 Aplha 3 OS X i386 crashes → crash-stats can't process Thunderbird 3 Alpha 3 OS X i386 crashes
Here is all the output from processing 7d972cc2-a79d-11dd-9fe3-001cc45a2c28

2008-10-31 15:45:56,310 DEBUG - MainThread - incomingJobStream yielding standard from job list: 7d972cc2-a79d-11dd-9fe3-001cc45a2c28
2008-10-31 15:45:56,310 DEBUG - MainThread - start got: 7d972cc2-a79d-11dd-9fe3-001cc45a2c28
2008-10-31 15:45:56,315 INFO - Thread-4 - analyzeHeader
2008-10-31 15:45:56,316 INFO - MainThread - queuing job 18160839, /mnt/socorro/crash_dumps/2008/10/31/22/bp_40/7d972cc2-a79d-11dd-9fe3-001cc45a2c28.json, 7d972cc2-a79d-11dd-9fe3-001cc45a2c28
2008-10-31 15:46:04,315 INFO - Thread-4 - starting job: 18160839, 7d972cc2-a79d-11dd-9fe3-001cc45a2c28
2008-10-31 15:46:04,430 INFO - Thread-4 - invoking: /home/processor/stackwalk/bin/stackwalk.sh -m /mnt/socorro/crash_dumps/2008/10/31/22/bp_40/7d972cc2-a79d-11dd-9fe3-001cc45a2c28.dump "/mnt/socorro/symbols/symbols_ffx" "/mnt/socorro/symbols/symbols_sea" "/mnt/socorro/symbols/symbols_tbrd" "/mnt/socorro/symbols/symbol
s_sbrd" "/mnt/socorro/symbols/symbols_os" 2>/dev/null
2008-10-31 15:46:04,474 INFO - Thread-3 - analyzeHeader
2008-10-31 15:46:04,509 INFO - Thread-4 - analyzeHeader
2008-10-31 15:46:04,521 INFO - Thread-4 - analyzeFrames
2008-10-31 15:46:04,532 INFO - Thread-4 - succeeded and committed: 18160839, 7d972cc2-a79d-11dd-9fe3-001cc45a2c28

There is no output on stdout/stderr of the processor for this uuid.  What other information can I provide to help troubleshoot this?
From the time stamps above it does look like stackwalk on the server seems to be crashing on these dumps.  We are not doing anything that unique to these crash dumps, so I am not sure how this would be a server side problem.  I suggest going through the staging and testing environments and trying to troubleshoot this problem there first.

Please get in touch with morgamic/lars/ etc.. for info on running this stuff in the dev/staging environments.
Over the last few days I've been trying to test this again especially as we correct something in the comm-central build process. However the following crash-reports all continue to report "Please wait..." and go no further. The crash urls are:

http://crash-stats.mozilla.com/report/index/d216f5aa-94c8-443d-9887-6fb920081119
http://crash-stats.mozilla.com/report/index/f950b879-25b2-43d9-bef9-d66820081119
http://crash-stats.mozilla.com/report/index/a170bd2c-54d4-4278-833f-d3a112081122
http://crash-stats.mozilla.com/report/index/e8ff2181-ac24-42ba-afa0-b05082081123

The first two were submitted on the 19th, the third on the 22nd, the last one today.

Note that bug 463957 may be a similar thing seen on FF.
I believe there are at least two separate problems here.  In comments #1 - #22, we're talking about some sort of failure with breakpad-stackwalk.  Comment #23 sounds like a completely different problem.  

Focusing on the latter problem, there are at least a couple conditions that can cause this type of behavior.  

First, if after a successful submission, Socorro's file system or database queue is cleared.  To the best of my knowledge this has not happened in the recently, so we can ignore that for now.

Second, if the crash was submitted with a defective form missing the "ProductName", "Version" or "BuildID" fields.  If any of these case sensitive fields are missing or broken, then the crash never even gets a record in the database. This is a condition that I've been aware of for a while, but left a low priority because of a perceived low likelihood.  If this were to happen then the WebApp waits forever for a report that never happens.  I think that failing of the WebApp should also be addressed.

Aravind: if you could look those uuid from comment #23 to see if they are somewhere in the file system: standard, deferred or failed storage.

Do all crashes from Thunderbird have this same behavior?  I'd like to try this for myself:  I've installed Shredder and the crashmenow extension. How do I get the crash uuid that was returned by Socorro to Shredder so that I may look it up?
(In reply to comment #24)
> Do all crashes from Thunderbird have this same behavior?  I'd like to try this
> for myself:  I've installed Shredder and the crashmenow extension. How do I get
> the crash uuid that was returned by Socorro to Shredder so that I may look it
> up?

Go into preferences and on the general tab set about:crashes as the start page. Then exit preferences and select Mail Start Page under the Go menu.

FWIW I forgot to say I was having problems on Mac, someone else on Windows had no problems at all.
I'm also running on a Mac OS/X 10.5.5.  I've now crashed Shredder 3.0a3 six times and each time I got my report within 60 seconds.

f99aaf29-96c8-4ba2-a04d-7a70a2081123
1d5684ff-5c25-486a-9f13-9c6cf2081123
daaa454e-8b32-4cba-973d-2b23e2081123
db5971d8-0caa-45ea-b953-bb9862081123
38dcf60f-b84b-4bda-975e-6001c2081123
249cc20d-5d97-4af1-b190-73f642081123

Have you tried this since your initial report at 2008-11-23 10:26:59 PST ?
I've just tried this again today, with Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1b2pre) Gecko/20081125 Lightning/1.0pre Shredder/3.0b1pre - ran it today.

Its been cycling on processing report for a few minutes, I did see this message for a while, it seems to have gone now (although its still cycling). The fact that stackwalk failed doesn't sound good:

ID
    49ddb7fc-22c4-40f2-8779-cfb352081125
Time Queued
    2008-11-25 09:55:40.138695
Time Started
    2008-11-25 09:57:28.435164
Message
    <type 'instance'>:/home/processor/stackwalk/bin/stackwalk.sh failed with return code 1 when processing dump 49ddb7fc-22c4-40f2-8779-cfb352081125
Ted, could you take a look? Seems like this is a problem with stackwalk not our infrastructure.  We could also escalate this to the breakpad project if necessary:
http://code.google.com/p/google-breakpad/issues/list

Moving to Socorro product, aravind can't do anything here.
Assignee: aravind → ted.mielczarek
Component: Server Operations → Socorro
Product: mozilla.org → Webtools
QA Contact: mrz → socorro
Per comment 0, gozer has tried running minidump_stackwalk locally on these minidumps, and it works fine there. It's possible that something was fixed in the Breakpad code since we last updated the production copy. We could try having Aravind pull the latest code from SVN and recompiling minidump_stackwalk.
Aravind, what is the current version of stackwalk in production?
Breakpad doesn't have actual version numbers, but if he can tell me what SVN revision it's built from, we can see how old it is.
(In reply to comment #30)
> Aravind, what is the current version of stackwalk in production?

(In reply to comment #31)
> Breakpad doesn't have actual version numbers, but if he can tell me what SVN
> revision it's built from, we can see how old it is.

Aravind: ping.
Sorry guys.  We are using

[processor@dm-breakpad01 google-breakpad]$ svn info
Path: .
URL: http://google-breakpad.googlecode.com/svn/trunk
Repository Root: http://google-breakpad.googlecode.com/svn
Repository UUID: 4c0a9323-5329-0410-9bdc-e9ce6186880e
Revision: 234
The only relevant checkin I can see after r234 is:
http://code.google.com/p/google-breakpad/source/detail?r=254

But I'd expect that those dumps would be broken anyway. Regardless, it should not be harmful to update to SVN tip of Breakpad and rebuild, and then use that minidump_stackwalk in production.
Aravind, can you try what Ted said in comment 34 please?
Assignee: ted.mielczarek → aravind
Component: Socorro → Server Operations
Product: Webtools → mozilla.org
QA Contact: socorro → mrz
sure, the latest version of stackwalk is now in production.

[processor@dm-breakpad02 google-breakpad]$ svn info
Path: .
URL: http://google-breakpad.googlecode.com/svn/trunk
Repository Root: http://google-breakpad.googlecode.com/svn
Repository UUID: 4c0a9323-5329-0410-9bdc-e9ce6186880e
Revision: 304
Node Kind: directory
Schedule: normal
Last Changed Author: nealsid
Last Changed Rev: 304
Last Changed Date: 2008-12-10 19:25:39 -0800 (Wed, 10 Dec 2008)
Status: REOPENED → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → FIXED
Mark, gozer do you see any change?

http://crash-stats.mozilla.com/report/pending/27fe10b8-6a9b-4c5b-9617-000ed55daadc from comment 0 still has no response, but it may be failing for other reasons.


(In reply to comment #36)
> sure, the latest version of stackwalk is now in production

Aravind, starting as of your bug comment?  Or earlier at "Last Changed Date: 2008-12-10"
(In reply to comment #37)
> Mark, gozer do you see any change?
> 
> http://crash-stats.mozilla.com/report/pending/27fe10b8-6a9b-4c5b-9617-000ed55daadc
> from comment 0 still has no response, but it may be failing for other reasons.

Note that this will not have any effect on previously submitted reports. Once a report has been submitted, whether or not it was successfully processed, it's done. You should submit new reports to see if this fixes the issue.
I've just done a crash me now crash and submitted it. I've just seen the same error as per comment 27:

<type 'instance'>:/home/processor/stackwalk/bin/stackwalk.sh failed with return code 1 when processing dump 79b3fe2e-ef2d-4e21-b978-4a2422081213

Not sure where we go now, moving back to webtools for the time being.
Assignee: aravind → nobody
Status: RESOLVED → REOPENED
Component: Server Operations → Socorro
Product: mozilla.org → Webtools
QA Contact: mrz → socorro
Resolution: FIXED → ---
Aravind, Lars: is it possible to get the processor output from when it fails like this? Seems like it's probably crashing, but it's hard to say without seeing the output.
the stdout stream is captured and becomes the 'data' field in the dumps table.

the stderr stream is not captured (redirected to /dev/null) because of the volume of data that breakpad_stackwalk emits.  I could implement a way to capture and filter stderr.  How about if I capture any line that included the word "ERROR" and save it in the reports 'processor_notes' column ('messages' in the the current database schema).  If you'd like me to make this enhancement, let's start a separate bug for it.

alternatively, if we had a copy of the json and dump files, we could manually feed it to a test instance of Socorro (on khan, perhaps) to see if we get the same problem.  Since the invocation of minidump_stackwalk is actually a runtime configurable parameter, we could redirect the stderr output to a real file.  This would probably be a lot faster than the aforementioned enhancement idea.

BTW, I assume that it's only crashing in situ and seems to work fine when invoked manually.
With the change to dwarf2 symbols for the builds, we seem to no longer be suffering from these problems - I've successfully submitted several crashes to crash stats, they have been processed successfully and we are getting other crashes reported that are starting to look good.

Therefore closing bug as this seems to have resolved itself.
Status: REOPENED → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → WORKSFORME
Summary: crash-stats can't process Thunderbird 3 Alpha 3 OS X i386 crashes → crash-stats can't process Thunderbird 3 Alpha 3 OS X i386 crashes - stackwalk.sh failed with return code 1
Whiteboard: get a build newer than 2008-12-24
See also bug 472775. I'm not REOPENING (yet?) because this is OSX and that other is W32.
(In reply to comment #43)
> See also bug 472775. I'm not REOPENING (yet?) because this is OSX and that
> other is W32.

Please do not reopen this for a non-Mac OS X issue (and even if that bug was on OS X I'm not sure it would be appropriate to reopen). The way the symbol sets are generated (and probably processed) is completely different, hence they are two distinct issues.
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: