Closed Bug 687888 Opened 13 years ago Closed 13 years ago

Missing symbols in crash reports for recent nightly builds

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
All
task
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: alice0775, Assigned: nmaul)

Details

Build Identifier:
http://hg.mozilla.org/mozilla-central/rev/648d084ca28e
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0a1) Gecko/20110920 Firefox/9.0a1 ID:20110920030905

Crash Reporter always sends signature: xul.dll@xxxxx or mozjs.dll@xxxx like a hourly build, does not send not exact signature.

Reproducible: Always

Steps to Reproduce:

1. Start browser
2. Crash browser 
3.

Actual Results: 
  always sends signature: xul.dll@xxxxx or mozjs.dll@xxxx

Expected Results: 
  Should send exact signature.
Symbols were successfully uploaded for this build, so I suspect it's a problem on the Socorro side.
Component: Release Engineering → Socorro
Product: mozilla.org → Webtools
QA Contact: release → socorro
This happens since
http://hg.mozilla.org/mozilla-central/rev/5319b0100025
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0a1) Gecko/20110919 Firefox/9.0a1 ID:20110919030912
Can you paste a link to one of your crash reports?
Wrong Sig on Build ID	20110919030912
bp-b9949de2-d944-4b48-b7b9-37c572110920
Wrong Sig on Build ID	20110920030905
bp-e9e04be9-9837-4aaf-92b4-9800d2110920

The above Sig should be same as follows
Correct Sig on  Build ID	20110918030911
bp-94216f13-5f95-44f8-9abb-6bba62110920
Okay, I can reproduce the same issue using "Crash me now":
https://crash-stats.mozilla.com/report/index/bp-eaee3815-ce8d-4ff9-b066-601562110920

I've looked on the symbol store in SJC (dm-symbolpush01) and the symbols appear to be present, so perhaps there's an issue with the syncing to the PHX datastore.
The sync job is on dm-symbolpush01 in SJC, and it syncs to ip-admin01.phx in PHX1.

I can confirm that this appears to be working properly and hasn't emailed us any errors. Ted gave me a sample file to check:

SJC:
[root@dm-symbolpush01 ~]# md5sum /mnt/netapp/breakpad/symbols_ffx/mozjs.pdb/471CD365868F4341ACB60D2C33D9D9992/mozjs.sym 
57832686a69bf0d7b82f6b7e5929c6c0  /mnt/netapp/breakpad/symbols_ffx/mozjs.pdb/471CD365868F4341ACB60D2C33D9D9992/mozjs.sym

PHX:
[root@ip-admin01 ~]# md5sum /mnt/pio_symbols/symbols_ffx/mozjs.pdb/471CD365868F4341ACB60D2C33D9D9992/mozjs.sym
57832686a69bf0d7b82f6b7e5929c6c0  /mnt/pio_symbols/symbols_ffx/mozjs.pdb/471CD365868F4341ACB60D2C33D9D9992/mozjs.sym

So it exists on both sides, and the copies are identical.
So either things aren't being synced quickly enough (which seems unlikely, the script is running every 5 minutes and the files appear to be there) or something has gone wrong with our processor config and it can't see the symbols.
Severity: normal → blocker
I crashed my almost-month-out-of-date Linux build and it had symbols:
https://crash-stats.mozilla.com/report/index/bp-656a50ec-716e-4180-8fed-1e8322110920

I updated to today's nightly build and crashed again and no symbols:
https://crash-stats.mozilla.com/report/index/bp-106685f2-08b2-4892-bd13-84e1d2110920

It looks like the syncing script isn't working quickly enough, or something like that.
Assignee: nobody → server-ops
Component: Socorro → Server Operations
Product: Webtools → mozilla.org
QA Contact: socorro → cshields
Assignee: server-ops → nmaul
Assignee: nmaul → server-ops
It syncs from sjc->phx every 5 minutes. Would you be able to give more examples of symbols files that seem to be missing? The one from comment 6 is good, are there others we can check?

Why do we believe this is a sync issue from SJC to PHX?

Dropping prio to avoid paging on-call so quickly.
Assignee: server-ops → nmaul
Severity: blocker → critical
Per comment 8, an older Linux nightly worked fine (which indicates that symbols aren't completely broken), but today's nightly had no symbols (which indicates that it's not just windows nightlies). In that crash report, I checked that one of the symbol files was present on dm-symbolpush01 immediately after I viewed my crash report:
symbols_ffx/libxul.so/42037E50B4400C0C59F9F9F08F81174C0/libxul.so.sym

so the only thing I can imagine is that it's not getting synced to PHX properly. I just tried crashing that same build again and the symbols are still not showing up:
https://crash-stats.mozilla.com/report/index/16ad86be-9c20-4791-a3cb-2476c2110920

Is that symbol file above present in PHX?
Summary: Nightly9.0a1 ID:20110920030905 , Something wrong in crash report Signature → Missing symbols in crash reports for recent nightly builds
The sync process doesn't appear to be picking everything up.  For example, the symbols for ted's crash:

In SJC:
in /mnt/pio_symbols/symbols_ffx/mozjs.pdb
-rw-r--r-- 1 ffxbld users 5119934 Sep 20 11:37 mozjs.pdb/9C6EB556CE72441BAAAED9531D1FDFCA2/mozjs.sym

In PHX:
this file doesn't exist, 24 hours later.  Looking in
/mnt/socorro/symbols/symbols_ffx/mozjs.pdb
How can we verify the processor config? Here is what is in /etc/socorro/common.conf :

export processorSymbolsPathnameList="/mnt/socorro/symbols/symbols_ffx,/mnt/socorro/symbols/symbols_sea,/mnt/socorro/symbols/symbols_tbrd,/mnt/socorro/symbols/symbols_mob,/mnt/socorro/symbols/symbols_penelope,/mnt/socorro/symbols/symbols_sbrd,/mnt/socorro/symbols/symbols_camino,/mnt/socorro/symbols/symbols_os,/mnt/socorro/symbols/symbols_solaris,/mnt/socorro/symbols/symbols_opensuse,/mnt/socorro/symbols/symbols_ubuntu,/mnt/socorro/symbols/symbols_fedora"

Can someone verify if that is correct, or if any other config values should be verified?
With fresh eyes, the problem is obvious. It looks like the box that SJC syncs to in PHX has had the NFS share unmounted. This happened on Sunday, just minutes before the commit in comment 2. The server was rebooted, and this mount point was not set in /etc/fstab or puppet.

I've added it to puppet, and it's mounted up properly now.

I don't know if the incremental job will not be sufficient to fill in the gaps... we may need to re-run the complete job. This usually runs Sunday at 3am, and takes 5 hours... I suspect it'll take a bit longer during the day, and it will prevent the incremental sync while it's running.

The incremental sync is running now, just in case that is actually sufficient. If it seems not, I'll start the full sync right away.
The incremental sync seems to not be getting the job done. I've started a full sync job, which should take 4-6 hours to complete. Once that's done I believe things will be back to normal...
And for the record, bug 688186 covers replacing dm-symbolpush01 (which lives in SJC) with an equivalent upload server in PHX so that we can stop this syncing process.
The full sync completed, and the incrementals have been running since. The link in comment 10 still doesn't show symbols, but is that normal? I don't know if it pulls symbols on every page hit, or just during the actual crash.

Would someone be able to verify if this is working properly again?
I think this is working again:

https://crash-stats.mozilla.com/report/index/0f3361e6-f566-4605-9c3e-fd4132110922

I'm guessing it's expected behavior for crash-stats to not go back and fill in the data on reports generated during the problem interval. In any case I don't think there's anything more I can do here. Closing this out.

If anyone knows how we might easily populate the symbols for crashes that have already been reported, please let me know.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
I think it is possible to re-process those crashes. I'd guess we'd need to redo all crashes between 20:20 Sunday and now. I think in the past lars and jberkus have cooked up a way to tell postgres to re-queue these for processes. CC'ing them.
Note that it's Nightly (9.0a1) and Aurora (8.0a2) that need reprocessing for this timeframe. And up to somewhere between midnight and morning Pacific today should be enough (not exactly sure since when it's fixed for new repots).
(In reply to Jake Maul [:jakem] from comment #17)
> I'm guessing it's expected behavior for crash-stats to not go back and fill
> in the data on reports generated during the problem interval. In any case I
> don't think there's anything more I can do here. Closing this out.

This is correct, FWIW. We only use the symbols at the time of processing. We can re-process crashes to pick up new symbols, but it doesn't happen automatically.
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.