Closed Bug 561754 Opened 14 years ago Closed 11 years ago

Don't download symbols for test runs, pass symbol zip URL as symbols path

Categories

(Release Engineering :: General, defect, P5)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: catlee)

References

Details

(Whiteboard: [unittest][buildfaster:p2][waiting-for-deps])

Attachments

(7 files, 1 obsolete file)

Rather than downloading symbols for each test run, or even downloading symbols only when we crash, we should send minidumps and other useful information from test run crashes to a server in the same location as the symbols to avoid downloading them across any large distance.
Specifically, I think that we should have a symbol-resolver.mozilla.org server, ideally somewhere in a colo that's very close to the build machines.  It should have scripts which encapsulate the two things we do with that symbol data today:

minidump-to-stack-trace
symbolicate-assertion-stack

When the test machines hit a crash and produce a minidump, they would do

ssh symbol-resolver minidump-to-stacktrace < minidump >> log-file

similarly with an assertion stack and symbolicate-assertion-stack.  This would reduce the per-crash bandwidth overhead to less than a meg, and the normal (no crash or assertion) bandwidth overhead to zero.

The logic for those scripts already exists, so it's mostly a refactoring of them and their tests to handle being standalone, plus making sure that the symbols are where symbol-resolver.mozilla.org can find them (eagerly or on demand, I dunno if it matters).
The "process crashes on the slave" setup was a quick hack to get stacks on unittests/Talos working at all. Note that we can process crashes from any OS on a single server (like we do on the production Socorro setup). We will have to do this synchronously though, so that the stack trace can be in the build log.

The minidump processing itself is pretty trivial, as you probably know, it just involves running minidump_stackwalk /path/to/dump /path/to/symbols.

If we were to send the minidump + the URL to the symbols to a server, you could write a dead-simple script that fetched the symbols zip, unpacked it, and ran minidump_stackwalk on the dump. If that server were in the same colo as stage, I think it'd be pretty quick. (Plus, as you note, we would not have to download symbols for test runs where we don't crash.)
Could we just do this on the existing breakpad server? (dm-symbolpush01)
(In reply to comment #1)
> symbolicate-assertion-stack

This is more of a pain, as the existing scripts are platform-specific, so you'd have to have one server per platform to make it work properly. An alternative would be to simply write a new script that knows how to read Breakpad .sym files (whose file format is pretty trivial), and make it able to symbolize assertion stack traces.

http://code.google.com/p/google-breakpad/wiki/SymbolFiles
(In reply to comment #3)
> Could we just do this on the existing breakpad server? (dm-symbolpush01)

Maybe, but let's not mix symbols from non-nightly builds in with the nightly symbols, since they're two completely separate things. That server exists purely to provide a place to upload symbols to the NFS mount that Socorro uses, so it might make sense to use something different so we don't mix up two independent systems.
If all the symbols were in a flat symbol-store type arrangement, it would be relatively simple to have a crash server where you PUT a minidump and get back the minidump-stackwalk output.

We could do this today with the nightlies/releases, although I wouldn't want use dm-symbolpush01 because I worry about processor load.

The question is how to maintain the depend build symbols so that this server can see them, but they don't eat up huge amounts of disk space until we prune the "old" hourlies. Certainly solvable, but not trivial.
I don't think it'd be too bad. We could:
1) Have post-upload.py unpack the symbols somewhere (either by copying them to another server, or just unpack them to a NFS mount)
2) Ensure that cleanup-breakpad-symbols.py will handle hourly builds properly:
http://hg.mozilla.org/build/tools/file/1078c5659590/buildfarm/breakpad/cleanup-breakpad-symbols.py
3) Run cleanup-breakpad-symbols.py aggressively to keep only a small number of symbols available (< 24 hours per branch?)
If we're only storing the symbols in one place, such as on symbol-resolver, I bet we can afford to spend a fair bit of disk in exchange for easier maintenance.
Yeah, it just adds up quickly when you look at our build configurations x supported branches. There's not much value in saving the symbols for "hourly" builds since the tests are run on them immediately after they're built anyway. The cleanup script is already in use on the Socorro symbol store, so I don't think there's really much work there to make that all do what I said.
This would make it a lot easier to have symbols for all the system libraries present on our Talos machines as well, since we'd just need to have them in one place. (bug 528231)
Blocks: 528231
I think this needs:

- A machine (VM would likely be fine) set up with ssh access from all our build/tests machines

- A script that, given a symbols URL and a minidump (via stdin?), outputs the stack trace

Bonus points for an HTTP service that could be used for machines that don't have ssh.

Any volunteers?
Priority: -- → P3
Whiteboard: [talos][unittest]
I can pretty easily write a trivial CGI that takes a minidump (via POST?) and a symbols URL and invokes minidump_stackwalk to return a stack. We could then simply invoke it via curl from the commandline.
Awesome!  File a dep bug and name that tune!
(In reply to comment #12)
> I can pretty easily write a trivial CGI that takes a minidump (via POST?) and a
> symbols URL and invokes minidump_stackwalk to return a stack. We could then
> simply invoke it via curl from the commandline.

Then you'll win all the bonus points!
Where are my bonus points?
Blocks: 561235
No longer blocks: 561235
Depends on: 561235
Assignee: nobody → catlee
Priority: P3 → P2
Whiteboard: [talos][unittest] → [talos][unittest][buildfaster:p1]
Depends on: 669947
(In reply to comment #4)
> (In reply to comment #1)
> > symbolicate-assertion-stack
> 
> This is more of a pain, as the existing scripts are platform-specific, so
> you'd have to have one server per platform to make it work properly. An
> alternative would be to simply write a new script that knows how to read
> Breakpad .sym files (whose file format is pretty trivial), and make it able
> to symbolize assertion stack traces.

We have tools/rb/fix_stack_using_bpsyms.py now :)

We're using it for Mac and Linux, and we'll use it for Windows too as soon as bug 575188 is fixed.
It sounds like the server will only be reachable from the build farm (bug 669947), so I'd like the symbols to remain available on FTP.
(In reply to comment #17)
> It sounds like the server will only be reachable from the build farm (bug
> 669947), so I'd like the symbols to remain available on FTP.

That won't change. The build slaves will stop downloading/unpacking the symbols unconditionally from FTP and instead submit the symbols url to the minidump CGI for remote processing.
To make this work for unittest jobs, you'll need to do the following:
1. Set MINIDUMP_STACKWALK_CGI to the URL to the minidump-stackwalk-cgi script
2. Unset MINIDUMP_STACKWALK (so the test harness knows to use the CGI)
3. Change calls to unittest harnesses to pass the URL to the symbols.zip in --symbols-path instead of a local path to the unzipped symbols
Attachment #551858 - Flags: review?(bhearsum)
Attached patch stackwalk cgi settings (obsolete) — Splinter Review
Attachment #551870 - Flags: review?(bhearsum)
Comment on attachment 551870 [details] [diff] [review]
stackwalk cgi settings

Review of attachment 551870 [details] [diff] [review]:
-----------------------------------------------------------------

Shouldn't we be using this in staging too, for parity?
Attachment #551858 - Flags: review?(bhearsum) → review+
yes!
Attachment #551870 - Attachment is obsolete: true
Attachment #552196 - Flags: review?(bhearsum)
Attachment #551870 - Flags: review?(bhearsum)
Attachment #552196 - Flags: review?(bhearsum) → review+
Attachment #551858 - Flags: checked-in+
Attachment #552196 - Flags: checked-in+
select avg(e) from (select builds.id, sum(unix_timestamp(steps.endtime) - unix_timestamp(steps.starttime)) as e from builds, steps, builders where builds.builder_id = builders.id and steps.build_id = builds.id and builds.starttime >= '2011-08-08' and builds.starttime <= '2011-08-09' and steps.name in ('download_symbols', 'unpack_symbols') group by builds.id) as s;
+---------+
| avg(e)  |
+---------+
| 63.8668 | 
+---------+

so we're averaging 63 seconds per test job downloading and unpacking symbols.
Not bad. Plus the reduction in network traffic might help things out as well.
I busted win64 tests because they don't currently download symbols, so symbols_url never gets set.
Attachment #552371 - Flags: review?(bhearsum)
Attachment #552371 - Flags: review?(bhearsum) → review+
Attachment #552371 - Flags: checked-in+
turn this off since we're killing build.mozilla.org
Attachment #552699 - Flags: review?(bhearsum)
Attachment #552699 - Flags: review?(bhearsum) → review+
Attachment #552699 - Flags: checked-in+
We're going to take a different approach here: bug 679759. We're going to take the CGI out of the mix, and just have the test harnesses download the symbols when they need to process a crash. That way in the normal case (no crashes), we should get the benefit of not having to download symbols, and when we have a crash, they just get processed locally like they currently do.
Depends on: 679759
Whiteboard: [talos][unittest][buildfaster:p1] → [talos][unittest][buildfaster:p?]
waiting for deps
OS: Mac OS X → All
Priority: P2 → P5
Whiteboard: [talos][unittest][buildfaster:p?] → [talos][unittest][buildfaster:p2]
bug 679759 has been implemented for the unittest suites. We should be able to stop download symbols for those runs, and just pass the URL to the symbol zip as --symbols-path. We'll still need to implement that same logic for Talos.

Also, we'll still need to download symbols for debug tests, since we rely on them to fix stack traces from assertions.
No longer blocks: 528231
Summary: send crash minidumps from test runs to somewhere to be processed → Don't download symbols for test runs, pass symbol zip URL as symbols path
Assignee: catlee → nobody
going to start actively working on this again, hoping to reduce i/o load on the netapp by reducing web hits.
Assignee: nobody → catlee
Priority: P5 → P2
Latest patches seem to work well on linux at least. Killing firefox with -SEGV results in this output:

<snip />

TEST-UNEXPECTED-FAIL | /tests/MochiKit-1.4.2/tests/test_MochiKit-Style.html | Exited with code 1 during test run
INFO | automation.py | Application ran for: 0:00:14.403975
INFO | automation.py | Reading PID log: /tmp/tmpNistyMpidlog
Downloading symbols from: http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux/1330551909/firefox-13.0a1.en-US.linux-i686.crashreporter-symbols.zip
PROCESS-CRASH | /tests/MochiKit-1.4.2/tests/test_MochiKit-Style.html | application crashed (minidump found)
Crash dump filename: /tmp/tmpqR7JTt/minidumps/49d2146b-2db5-a562-18529d74-25e71e45.dmp
Operating system: Linux
                  0.0.0 Linux 2.6.31.5-127.fc12.i686.PAE #1 SMP Sat Nov 7 21:25:57 EST 2009 i686
CPU: x86
     GenuineIntel family 6 model 23 stepping 10
     2 CPUs

Crash reason:  SIGSEGV
Crash address: 0x818

Thread 0 (crashed)
 0  libnspr4.so + 0x7ee4
    eip = 0x004caee4   esp = 0xbfe91dbc   ebp = 0xbfe91dd8   ebx = 0x004f352c
    esi = 0x00000002   edi = 0xbfe92008   eax = 0x00000001   ecx = 0xb7626600
    edx = 0xbfe92008   efl = 0x00200202
    Found by: given as instruction pointer in context
<snip />
This makes the default behaviour of desktop unittest to pass the symbols url to the test harness rather than downloading/unpacking the symbols.

It leaves remote unittest (aka tegras) alone, since I don't think they support this yet.
Attachment #601808 - Flags: review?(nrthomas)
mozilla-1.9.2 doesn't support downloading symbols on demand, so this patch preserves the existing behaviour of downloading/unpacking the symbols for just the mozilla-1.9.2 branch.
Attachment #601809 - Flags: review?(nrthomas)
Attachment #601808 - Flags: review?(nrthomas) → review+
Attachment #601809 - Flags: review?(nrthomas) → review+
Attachment #601809 - Flags: checked-in+
Attachment #601808 - Flags: checked-in+
Your recent landed changes went live around 7:30 AM PDT.
given bug 729852, and the fact that jmaher thinks this can work, I think we should go ahead and do this for tegras.
Attachment #601992 - Flags: review?(bear)
Attachment #601992 - Flags: review?(bear) → review+
Attachment #601992 - Flags: checked-in+
All done here for unittests. Leaving open for enabling this for talos once the harness supports it.
Assignee: catlee → nobody
Component: Release Engineering → Release Engineering: Automation
Flags: checked-in+
QA Contact: release → catlee
Comment on attachment 601992 [details] [diff] [review]
enable symbol downloading on demand for tegras

Restoring checked-in flags after component move ....
Attachment #601992 - Flags: checked-in+
Attachment #601808 - Flags: checked-in+
Attachment #601809 - Flags: checked-in+
Attachment #552699 - Flags: checked-in+
Attachment #552371 - Flags: checked-in+
Attachment #552196 - Flags: checked-in+
Attachment #551858 - Flags: checked-in+
This made it to production today.
Whiteboard: [talos][unittest][buildfaster:p2] → [talos][unittest][buildfaster:p2][waiting-for-deps]
(In reply to Chris AtLee [:catlee] from comment #32)
> Latest patches seem to work well on linux at least. Killing firefox with
> -SEGV results in this output:
> 
> <snip />
 
> Thread 0 (crashed)
>  0  libnspr4.so + 0x7ee4
>     eip = 0x004caee4   esp = 0xbfe91dbc   ebp = 0xbfe91dd8   ebx = 0x004f352c
>     esi = 0x00000002   edi = 0xbfe92008   eax = 0x00000001   ecx = 0xb7626600
>     edx = 0xbfe92008   efl = 0x00200202
>     Found by: given as instruction pointer in context
> <snip />

This doesn't actually look like you have symbols. You should have a function name there in the " 0  libnspr4.so + 0x7ee4" line, not just the library name + offset.
(In reply to Ted Mielczarek [:ted] (away until ~March 7) from comment #40)
> (In reply to Chris AtLee [:catlee] from comment #32)
> > Latest patches seem to work well on linux at least. Killing firefox with
> > -SEGV results in this output:
> > 
> > <snip />
>  
> > Thread 0 (crashed)
> >  0  libnspr4.so + 0x7ee4
> >     eip = 0x004caee4   esp = 0xbfe91dbc   ebp = 0xbfe91dd8   ebx = 0x004f352c
> >     esi = 0x00000002   edi = 0xbfe92008   eax = 0x00000001   ecx = 0xb7626600
> >     edx = 0xbfe92008   efl = 0x00200202
> >     Found by: given as instruction pointer in context
> > <snip />
> 
> This doesn't actually look like you have symbols. You should have a function
> name there in the " 0  libnspr4.so + 0x7ee4" line, not just the library name
> + offset.

That would be a problem with the test harness then? It looks like it's downloading the symbols properly.
Priority: P2 → P5
Depends on: 755781
(In reply to Chris AtLee [:catlee] from comment #37)
> All done here for unittests. Leaving open for enabling this for talos once
> the harness supports it.

The Talos harness gained support for this in bug 824984, so this is fixable now.
Depends on: 851276
Product: mozilla.org → Release Engineering
How is this different than bug 851276? Can we dupe in one direction or the other here?
We should just mark this one RESOLVED FIXED since it fixed the unittest case, and deal with the Talos case over there. (I hate bugs that fix part of the issue but stay open.)
Assignee: nobody → catlee
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Whiteboard: [talos][unittest][buildfaster:p2][waiting-for-deps] → [unittest][buildfaster:p2][waiting-for-deps]
Blocks: 851276
No longer depends on: 851276
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: