Closed
Bug 561754
Opened 15 years ago
Closed 11 years ago
Don't download symbols for test runs, pass symbol zip URL as symbols path
Categories
(Release Engineering :: General, defect, P5)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bhearsum, Assigned: catlee)
References
Details
(Whiteboard: [unittest][buildfaster:p2][waiting-for-deps])
Attachments
(7 files, 1 obsolete file)
16.37 KB,
patch
|
bhearsum
:
review+
nthomas
:
checked-in+
|
Details | Diff | Splinter Review |
2.20 KB,
patch
|
bhearsum
:
review+
nthomas
:
checked-in+
|
Details | Diff | Splinter Review |
1.72 KB,
patch
|
bhearsum
:
review+
nthomas
:
checked-in+
|
Details | Diff | Splinter Review |
1.59 KB,
patch
|
bhearsum
:
review+
nthomas
:
checked-in+
|
Details | Diff | Splinter Review |
12.58 KB,
patch
|
nthomas
:
review+
nthomas
:
checked-in+
|
Details | Diff | Splinter Review |
822 bytes,
patch
|
nthomas
:
review+
nthomas
:
checked-in+
|
Details | Diff | Splinter Review |
1.08 KB,
patch
|
bear
:
review+
nthomas
:
checked-in+
|
Details | Diff | Splinter Review |
Rather than downloading symbols for each test run, or even downloading symbols only when we crash, we should send minidumps and other useful information from test run crashes to a server in the same location as the symbols to avoid downloading them across any large distance.
Comment 1•15 years ago
|
||
Specifically, I think that we should have a symbol-resolver.mozilla.org server, ideally somewhere in a colo that's very close to the build machines. It should have scripts which encapsulate the two things we do with that symbol data today:
minidump-to-stack-trace
symbolicate-assertion-stack
When the test machines hit a crash and produce a minidump, they would do
ssh symbol-resolver minidump-to-stacktrace < minidump >> log-file
similarly with an assertion stack and symbolicate-assertion-stack. This would reduce the per-crash bandwidth overhead to less than a meg, and the normal (no crash or assertion) bandwidth overhead to zero.
The logic for those scripts already exists, so it's mostly a refactoring of them and their tests to handle being standalone, plus making sure that the symbols are where symbol-resolver.mozilla.org can find them (eagerly or on demand, I dunno if it matters).
Comment 2•15 years ago
|
||
The "process crashes on the slave" setup was a quick hack to get stacks on unittests/Talos working at all. Note that we can process crashes from any OS on a single server (like we do on the production Socorro setup). We will have to do this synchronously though, so that the stack trace can be in the build log.
The minidump processing itself is pretty trivial, as you probably know, it just involves running minidump_stackwalk /path/to/dump /path/to/symbols.
If we were to send the minidump + the URL to the symbols to a server, you could write a dead-simple script that fetched the symbols zip, unpacked it, and ran minidump_stackwalk on the dump. If that server were in the same colo as stage, I think it'd be pretty quick. (Plus, as you note, we would not have to download symbols for test runs where we don't crash.)
Reporter | ||
Comment 3•15 years ago
|
||
Could we just do this on the existing breakpad server? (dm-symbolpush01)
Comment 4•15 years ago
|
||
(In reply to comment #1)
> symbolicate-assertion-stack
This is more of a pain, as the existing scripts are platform-specific, so you'd have to have one server per platform to make it work properly. An alternative would be to simply write a new script that knows how to read Breakpad .sym files (whose file format is pretty trivial), and make it able to symbolize assertion stack traces.
http://code.google.com/p/google-breakpad/wiki/SymbolFiles
Comment 5•15 years ago
|
||
(In reply to comment #3)
> Could we just do this on the existing breakpad server? (dm-symbolpush01)
Maybe, but let's not mix symbols from non-nightly builds in with the nightly symbols, since they're two completely separate things. That server exists purely to provide a place to upload symbols to the NFS mount that Socorro uses, so it might make sense to use something different so we don't mix up two independent systems.
Comment 6•15 years ago
|
||
If all the symbols were in a flat symbol-store type arrangement, it would be relatively simple to have a crash server where you PUT a minidump and get back the minidump-stackwalk output.
We could do this today with the nightlies/releases, although I wouldn't want use dm-symbolpush01 because I worry about processor load.
The question is how to maintain the depend build symbols so that this server can see them, but they don't eat up huge amounts of disk space until we prune the "old" hourlies. Certainly solvable, but not trivial.
Comment 7•15 years ago
|
||
I don't think it'd be too bad. We could:
1) Have post-upload.py unpack the symbols somewhere (either by copying them to another server, or just unpack them to a NFS mount)
2) Ensure that cleanup-breakpad-symbols.py will handle hourly builds properly:
http://hg.mozilla.org/build/tools/file/1078c5659590/buildfarm/breakpad/cleanup-breakpad-symbols.py
3) Run cleanup-breakpad-symbols.py aggressively to keep only a small number of symbols available (< 24 hours per branch?)
Comment 8•15 years ago
|
||
If we're only storing the symbols in one place, such as on symbol-resolver, I bet we can afford to spend a fair bit of disk in exchange for easier maintenance.
Comment 9•15 years ago
|
||
Yeah, it just adds up quickly when you look at our build configurations x supported branches. There's not much value in saving the symbols for "hourly" builds since the tests are run on them immediately after they're built anyway. The cleanup script is already in use on the Socorro symbol store, so I don't think there's really much work there to make that all do what I said.
Comment 10•15 years ago
|
||
This would make it a lot easier to have symbols for all the system libraries present on our Talos machines as well, since we'd just need to have them in one place. (bug 528231)
Blocks: 528231
Assignee | ||
Comment 11•15 years ago
|
||
I think this needs:
- A machine (VM would likely be fine) set up with ssh access from all our build/tests machines
- A script that, given a symbols URL and a minidump (via stdin?), outputs the stack trace
Bonus points for an HTTP service that could be used for machines that don't have ssh.
Any volunteers?
Priority: -- → P3
Whiteboard: [talos][unittest]
Comment 12•15 years ago
|
||
I can pretty easily write a trivial CGI that takes a minidump (via POST?) and a symbols URL and invokes minidump_stackwalk to return a stack. We could then simply invoke it via curl from the commandline.
Comment 13•15 years ago
|
||
Awesome! File a dep bug and name that tune!
Assignee | ||
Comment 14•15 years ago
|
||
(In reply to comment #12)
> I can pretty easily write a trivial CGI that takes a minidump (via POST?) and a
> symbols URL and invokes minidump_stackwalk to return a stack. We could then
> simply invoke it via curl from the commandline.
Then you'll win all the bonus points!
Comment 15•15 years ago
|
||
Where are my bonus points?
Assignee | ||
Updated•15 years ago
|
Assignee | ||
Updated•13 years ago
|
Assignee: nobody → catlee
Priority: P3 → P2
Whiteboard: [talos][unittest] → [talos][unittest][buildfaster:p1]
Comment 16•13 years ago
|
||
(In reply to comment #4)
> (In reply to comment #1)
> > symbolicate-assertion-stack
>
> This is more of a pain, as the existing scripts are platform-specific, so
> you'd have to have one server per platform to make it work properly. An
> alternative would be to simply write a new script that knows how to read
> Breakpad .sym files (whose file format is pretty trivial), and make it able
> to symbolize assertion stack traces.
We have tools/rb/fix_stack_using_bpsyms.py now :)
We're using it for Mac and Linux, and we'll use it for Windows too as soon as bug 575188 is fixed.
Comment 17•13 years ago
|
||
It sounds like the server will only be reachable from the build farm (bug 669947), so I'd like the symbols to remain available on FTP.
Assignee | ||
Comment 18•13 years ago
|
||
(In reply to comment #17)
> It sounds like the server will only be reachable from the build farm (bug
> 669947), so I'd like the symbols to remain available on FTP.
That won't change. The build slaves will stop downloading/unpacking the symbols unconditionally from FTP and instead submit the symbols url to the minidump CGI for remote processing.
Comment 19•13 years ago
|
||
To make this work for unittest jobs, you'll need to do the following:
1. Set MINIDUMP_STACKWALK_CGI to the URL to the minidump-stackwalk-cgi script
2. Unset MINIDUMP_STACKWALK (so the test harness knows to use the CGI)
3. Change calls to unittest harnesses to pass the URL to the symbols.zip in --symbols-path instead of a local path to the unzipped symbols
Assignee | ||
Comment 20•13 years ago
|
||
Attachment #551858 -
Flags: review?(bhearsum)
Assignee | ||
Comment 21•13 years ago
|
||
Attachment #551870 -
Flags: review?(bhearsum)
Reporter | ||
Comment 22•13 years ago
|
||
Comment on attachment 551870 [details] [diff] [review]
stackwalk cgi settings
Review of attachment 551870 [details] [diff] [review]:
-----------------------------------------------------------------
Shouldn't we be using this in staging too, for parity?
Reporter | ||
Updated•13 years ago
|
Attachment #551858 -
Flags: review?(bhearsum) → review+
Assignee | ||
Comment 23•13 years ago
|
||
yes!
Attachment #551870 -
Attachment is obsolete: true
Attachment #552196 -
Flags: review?(bhearsum)
Attachment #551870 -
Flags: review?(bhearsum)
Reporter | ||
Updated•13 years ago
|
Attachment #552196 -
Flags: review?(bhearsum) → review+
Assignee | ||
Updated•13 years ago
|
Attachment #551858 -
Flags: checked-in+
Assignee | ||
Updated•13 years ago
|
Attachment #552196 -
Flags: checked-in+
Assignee | ||
Comment 24•13 years ago
|
||
select avg(e) from (select builds.id, sum(unix_timestamp(steps.endtime) - unix_timestamp(steps.starttime)) as e from builds, steps, builders where builds.builder_id = builders.id and steps.build_id = builds.id and builds.starttime >= '2011-08-08' and builds.starttime <= '2011-08-09' and steps.name in ('download_symbols', 'unpack_symbols') group by builds.id) as s;
+---------+
| avg(e) |
+---------+
| 63.8668 |
+---------+
so we're averaging 63 seconds per test job downloading and unpacking symbols.
Comment 25•13 years ago
|
||
Not bad. Plus the reduction in network traffic might help things out as well.
Assignee | ||
Comment 26•13 years ago
|
||
I busted win64 tests because they don't currently download symbols, so symbols_url never gets set.
Attachment #552371 -
Flags: review?(bhearsum)
Reporter | ||
Updated•13 years ago
|
Attachment #552371 -
Flags: review?(bhearsum) → review+
Assignee | ||
Updated•13 years ago
|
Attachment #552371 -
Flags: checked-in+
Assignee | ||
Comment 27•13 years ago
|
||
turn this off since we're killing build.mozilla.org
Attachment #552699 -
Flags: review?(bhearsum)
Reporter | ||
Updated•13 years ago
|
Attachment #552699 -
Flags: review?(bhearsum) → review+
Assignee | ||
Updated•13 years ago
|
Attachment #552699 -
Flags: checked-in+
Comment 28•13 years ago
|
||
We're going to take a different approach here: bug 679759. We're going to take the CGI out of the mix, and just have the test harnesses download the symbols when they need to process a crash. That way in the normal case (no crashes), we should get the benefit of not having to download symbols, and when we have a crash, they just get processed locally like they currently do.
Depends on: 679759
Updated•13 years ago
|
Whiteboard: [talos][unittest][buildfaster:p1] → [talos][unittest][buildfaster:p?]
Assignee | ||
Comment 29•13 years ago
|
||
waiting for deps
OS: Mac OS X → All
Priority: P2 → P5
Whiteboard: [talos][unittest][buildfaster:p?] → [talos][unittest][buildfaster:p2]
Comment 30•13 years ago
|
||
bug 679759 has been implemented for the unittest suites. We should be able to stop download symbols for those runs, and just pass the URL to the symbol zip as --symbols-path. We'll still need to implement that same logic for Talos.
Also, we'll still need to download symbols for debug tests, since we rely on them to fix stack traces from assertions.
No longer blocks: 528231
Summary: send crash minidumps from test runs to somewhere to be processed → Don't download symbols for test runs, pass symbol zip URL as symbols path
Assignee | ||
Updated•13 years ago
|
Assignee: catlee → nobody
Assignee | ||
Comment 31•13 years ago
|
||
going to start actively working on this again, hoping to reduce i/o load on the netapp by reducing web hits.
Assignee: nobody → catlee
Priority: P5 → P2
Assignee | ||
Comment 32•13 years ago
|
||
Latest patches seem to work well on linux at least. Killing firefox with -SEGV results in this output:
<snip />
TEST-UNEXPECTED-FAIL | /tests/MochiKit-1.4.2/tests/test_MochiKit-Style.html | Exited with code 1 during test run
INFO | automation.py | Application ran for: 0:00:14.403975
INFO | automation.py | Reading PID log: /tmp/tmpNistyMpidlog
Downloading symbols from: http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux/1330551909/firefox-13.0a1.en-US.linux-i686.crashreporter-symbols.zip
PROCESS-CRASH | /tests/MochiKit-1.4.2/tests/test_MochiKit-Style.html | application crashed (minidump found)
Crash dump filename: /tmp/tmpqR7JTt/minidumps/49d2146b-2db5-a562-18529d74-25e71e45.dmp
Operating system: Linux
0.0.0 Linux 2.6.31.5-127.fc12.i686.PAE #1 SMP Sat Nov 7 21:25:57 EST 2009 i686
CPU: x86
GenuineIntel family 6 model 23 stepping 10
2 CPUs
Crash reason: SIGSEGV
Crash address: 0x818
Thread 0 (crashed)
0 libnspr4.so + 0x7ee4
eip = 0x004caee4 esp = 0xbfe91dbc ebp = 0xbfe91dd8 ebx = 0x004f352c
esi = 0x00000002 edi = 0xbfe92008 eax = 0x00000001 ecx = 0xb7626600
edx = 0xbfe92008 efl = 0x00200202
Found by: given as instruction pointer in context
<snip />
Assignee | ||
Comment 33•13 years ago
|
||
This makes the default behaviour of desktop unittest to pass the symbols url to the test harness rather than downloading/unpacking the symbols.
It leaves remote unittest (aka tegras) alone, since I don't think they support this yet.
Attachment #601808 -
Flags: review?(nrthomas)
Assignee | ||
Comment 34•13 years ago
|
||
mozilla-1.9.2 doesn't support downloading symbols on demand, so this patch preserves the existing behaviour of downloading/unpacking the symbols for just the mozilla-1.9.2 branch.
Attachment #601809 -
Flags: review?(nrthomas)
Updated•13 years ago
|
Attachment #601808 -
Flags: review?(nrthomas) → review+
Updated•13 years ago
|
Attachment #601809 -
Flags: review?(nrthomas) → review+
Assignee | ||
Updated•13 years ago
|
Attachment #601809 -
Flags: checked-in+
Assignee | ||
Updated•13 years ago
|
Attachment #601808 -
Flags: checked-in+
Comment 35•13 years ago
|
||
Your recent landed changes went live around 7:30 AM PDT.
Assignee | ||
Comment 36•13 years ago
|
||
given bug 729852, and the fact that jmaher thinks this can work, I think we should go ahead and do this for tegras.
Attachment #601992 -
Flags: review?(bear)
Updated•13 years ago
|
Attachment #601992 -
Flags: review?(bear) → review+
Assignee | ||
Updated•13 years ago
|
Attachment #601992 -
Flags: checked-in+
Assignee | ||
Comment 37•13 years ago
|
||
All done here for unittests. Leaving open for enabling this for talos once the harness supports it.
Assignee: catlee → nobody
Component: Release Engineering → Release Engineering: Automation
Flags: checked-in+
QA Contact: release → catlee
Comment 38•13 years ago
|
||
Comment on attachment 601992 [details] [diff] [review]
enable symbol downloading on demand for tegras
Restoring checked-in flags after component move ....
Attachment #601992 -
Flags: checked-in+
Updated•13 years ago
|
Attachment #601808 -
Flags: checked-in+
Updated•13 years ago
|
Attachment #601809 -
Flags: checked-in+
Updated•13 years ago
|
Attachment #552699 -
Flags: checked-in+
Updated•13 years ago
|
Attachment #552371 -
Flags: checked-in+
Updated•13 years ago
|
Attachment #552196 -
Flags: checked-in+
Updated•13 years ago
|
Attachment #551858 -
Flags: checked-in+
Reporter | ||
Comment 39•13 years ago
|
||
This made it to production today.
Assignee | ||
Updated•13 years ago
|
Whiteboard: [talos][unittest][buildfaster:p2] → [talos][unittest][buildfaster:p2][waiting-for-deps]
Comment 40•13 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #32)
> Latest patches seem to work well on linux at least. Killing firefox with
> -SEGV results in this output:
>
> <snip />
> Thread 0 (crashed)
> 0 libnspr4.so + 0x7ee4
> eip = 0x004caee4 esp = 0xbfe91dbc ebp = 0xbfe91dd8 ebx = 0x004f352c
> esi = 0x00000002 edi = 0xbfe92008 eax = 0x00000001 ecx = 0xb7626600
> edx = 0xbfe92008 efl = 0x00200202
> Found by: given as instruction pointer in context
> <snip />
This doesn't actually look like you have symbols. You should have a function name there in the " 0 libnspr4.so + 0x7ee4" line, not just the library name + offset.
Assignee | ||
Comment 41•13 years ago
|
||
(In reply to Ted Mielczarek [:ted] (away until ~March 7) from comment #40)
> (In reply to Chris AtLee [:catlee] from comment #32)
> > Latest patches seem to work well on linux at least. Killing firefox with
> > -SEGV results in this output:
> >
> > <snip />
>
> > Thread 0 (crashed)
> > 0 libnspr4.so + 0x7ee4
> > eip = 0x004caee4 esp = 0xbfe91dbc ebp = 0xbfe91dd8 ebx = 0x004f352c
> > esi = 0x00000002 edi = 0xbfe92008 eax = 0x00000001 ecx = 0xb7626600
> > edx = 0xbfe92008 efl = 0x00200202
> > Found by: given as instruction pointer in context
> > <snip />
>
> This doesn't actually look like you have symbols. You should have a function
> name there in the " 0 libnspr4.so + 0x7ee4" line, not just the library name
> + offset.
That would be a problem with the test harness then? It looks like it's downloading the symbols properly.
Assignee | ||
Updated•13 years ago
|
Priority: P2 → P5
Comment 42•12 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #37)
> All done here for unittests. Leaving open for enabling this for talos once
> the harness supports it.
The Talos harness gained support for this in bug 824984, so this is fixable now.
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
Comment 43•11 years ago
|
||
How is this different than bug 851276? Can we dupe in one direction or the other here?
Comment 44•11 years ago
|
||
We should just mark this one RESOLVED FIXED since it fixed the unittest case, and deal with the Talos case over there. (I hate bugs that fix part of the issue but stay open.)
Assignee: nobody → catlee
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Whiteboard: [talos][unittest][buildfaster:p2][waiting-for-deps] → [unittest][buildfaster:p2][waiting-for-deps]
Updated•11 years ago
|
Updated•7 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•