Closed Bug 679759 Opened 13 years ago Closed 13 years ago

Drop MINIDUMP_STACKWALK_CGI support, let harness download symbols as needed

Categories

(Testing :: Mochitest, defect)

defect
Not set
normal

Tracking

(firefox-esr1012+ fixed)

RESOLVED FIXED
mozilla11
Tracking Status
firefox-esr10 12+ fixed

People

(Reporter: ted, Assigned: wlach)

References

Details

(Whiteboard: [buildfaster:p1][qa-])

Attachments

(1 file, 5 obsolete files)

In bug 563745 and bug 563678 I implemented a CGI script that would run minidump_stackwalk, with the goal of making it so unit test machines never had to download symbol files to process crashes. We tried running this in production, and the web server running the CGI got swamped and we had to turn it off.

catlee tossed out an alternate idea, which is to simply make the test harnesses download the symbols as needed, only if they need them to process a crash.

This sounds reasonably easy to do, and should get us most of the benefits of the other approach with fewer moving parts.
Blocks: 561754
Do crashes that occur but are expected still require processing? That could negate the benefit of this.
They shouldn't, since we fixed bug 642175.
Whiteboard: [buildfaster:p1]
Assignee: nobody → wlachance
For some reason I volunteered to take this bug on. :)
I'm not using automation.py, but I took the checkForCrashes method for my own harness (which uses mozbase) and stripped out the CGI script bits:

https://github.com/ahal/peptest/blob/4ed9090aaee9cb80dda0cc09d2d5fed061c80de7/runpeptests.py#L276
This patch does what it says, or tries to. :)
Attachment #568145 - Flags: review?(ted.mielczarek)
Comment on attachment 568145 [details] [diff] [review]
Drop MINIDUMP_STACKWALK_CGI support, let harness download symbols as needed

Review of attachment 568145 [details] [diff] [review]:
-----------------------------------------------------------------

This mostly looks good, but we don't require Python 2.6 yet, so I have to r- it.

::: build/automationutils.py
@@ +109,5 @@
>      except:
>        testName = "unknown"
>  
>    foundCrash = False
> +  eraseSymbolsPath = False

I would probably call this "removeSymbolsPath", but that's pretty nitpicky.

@@ +124,5 @@
> +        symbolsFile = tempfile.TemporaryFile()
> +        symbolsFile.write(data.read())
> +        # extract symbols to a temporary directory (which we'll delete after
> +        # processing all crashes)
> +        # can't use extractall because of python 2.4 compat requirements for talos

Uh, you say this, but then you go ahead and use it below! extractall is a 2.6-ism, unfortunately. I'd be fine with just shelling out to "unzip" here for simplicity.
Attachment #568145 - Flags: review?(ted.mielczarek) → review-
This should address both elements of the review  (erase->remove, remove py2.6ism).

Sorry about the python 2.6ism, I thought we did depend on py2.6 for mozilla-central at the moment (I decided I'd simplify things a bit, without removing a comment that was specific to Talos).
Attachment #568145 - Attachment is obsolete: true
Attachment #569508 - Flags: review?(ted.mielczarek)
This looks like the exact same patch. Did you attach the wrong patch?
(this is the actual patch I intended to attach last time)
Attachment #569508 - Attachment is obsolete: true
Attachment #569508 - Flags: review?(ted.mielczarek)
Attachment #569654 - Flags: review?(ted.mielczarek)
Comment on attachment 569654 [details] [diff] [review]
Drop MINIDUMP_STACKWALK_CGI support, let harness download symbols as needed

Review of attachment 569654 [details] [diff] [review]:
-----------------------------------------------------------------

I don't think this is quite going to work, but it's pretty close. Sorry to make you go through multiple revisions here!

::: build/automationutils.py
@@ +124,5 @@
> +        symbolsFile = tempfile.TemporaryFile()
> +        symbolsFile.write(data.read())
> +        # extract symbols to a temporary directory (which we'll delete after
> +        # processing all crashes)
> +        # can't use extractall because of python 2.4 compat requirements

Might want to say 2.5 here, since extractall is 2.6. (That way when we do require 2.6 everywhere someone might notice and clean this up.)

@@ +129,5 @@
> +        symbolsPath = tempfile.mkdtemp()
> +        zfile = zipfile.ZipFile(symbolsFile)
> +        for name in zfile.namelist():
> +          (dirname, filename) = map(lambda p: os.path.join(symbolsPath, p),
> +                                    os.path.split(name))

I always use list comprehensions instead of map, they feel more Pythonic.

Also, I'm not sure this is really doing what you want it to do. If one of the zip entries is "foo/bar.txt", and symbolsPath is "/tmp/xyz", then you're going to wind up with:
dirname = "/tmp/xyz/foo", filename = "/tmp/xyz/bar.txt"

You probably want to write this whole thing more like:
for name in zfile.namelist():
  filename = os.path.join(symbolsPath, name)
  dirname = os.path.dirname(filename)
  ...

@@ +132,5 @@
> +          (dirname, filename) = map(lambda p: os.path.join(symbolsPath, p),
> +                                    os.path.split(name))
> +          if dirname and not os.path.exists(dirname):
> +            os.makedirs(dirname)
> +          f = open(filename, "w")

with open(filename, "w") as f: (although if you're going to copy-paste this code for Talos with Python 2.4 I guess you can leave it alone for now).

@@ +169,5 @@
>        os.remove(extra)
>      foundCrash = True
>  
> +  if removeSymbolsPath:
> +    shutil.rmtree(symbolsPath)

I wonder if you shouldn't wrap this entire block in try, and put this in a finally block?
Attachment #569654 - Flags: review?(ted.mielczarek) → review-
Ok, I hope this patch should address your issues. :)

1. I added the try..finally block. 
2. Not in reaction to your review, I moved the logic of downloading the symbols from a URL outside the crash processing loop. Seems to make a lot more sense that way.
3. For the rest of your comments, I just realized that we already subclassed ZipFileReader for this exact reason inside automation.py.in. I thus moved this class over and started using it. No point in having two slightly-different copies of this kind of (complicated) code.
Attachment #569654 - Attachment is obsolete: true
Attachment #571378 - Flags: review?(ted.mielczarek)
Comment on attachment 571378 [details] [diff] [review]
MINIDUMP_STACKWALK_CGI support, let harness download symbols as needed

Review of attachment 571378 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, thanks! I should have remembered that ZipFile stuff in automation.py, since I reviewed it not that long ago...
Attachment #571378 - Flags: review?(ted.mielczarek) → review+
Can someone check this in for me?
Keywords: checkin-needed
This is causing orange on every single platform, with failures like the following:

{
python reftest/runreftest.py --appname=firefox/firefox-bin --utility-path=bin --extra-profile-file=bin/plugins --symbols-path=symbols reftest/tests/testing/crashtest/crashtests.list
 in dir /home/cltbld/talos-slave/test/build (timeout 1200 secs) (maxTime 7200 secs)
 watching logfiles {}
 argv: ['python', 'reftest/runreftest.py', '--appname=firefox/firefox-bin', '--utility-path=bin', '--extra-profile-file=bin/plugins', '--symbols-path=symbols', 'reftest/tests/testing/crashtest/crashtests.list']
 environment:
[...]
 using PTY: False
/home/cltbld/talos-slave/test/build/reftest/automationutils.py:109: Warning: 'with' will become a reserved keyword in Python 2.6
Traceback (most recent call last):
  File "reftest/runreftest.py", line 48, in <module>
    from automation import Automation
  File "/home/cltbld/talos-slave/test/build/reftest/automation.py", line 20, in <module>
    import automationutils
  File "/home/cltbld/talos-slave/test/build/reftest/automationutils.py", line 109
    with open(filename, "wb") as dest:
            ^
SyntaxError: invalid syntax
program finished with exit code 1
elapsedTime=0.111890
TinderboxPrint: crashtest<br/><em class="testfail">T-FAIL</em>
Unknown Error: command finished with exit code: 1
}

Backing out.  Please fix and re-land only after this has been TryServer-tested.
actually mbrubeck beat me to it. Backout was:
 https://hg.mozilla.org/integration/mozilla-inbound/rev/72592670532d
Version: unspecified → Trunk
Try run for 97eb160935b0 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=97eb160935b0
Results (out of 169 total builds):
    success: 149
    warnings: 19
    failure: 1
Builds available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/wlachance@mozilla.com-97eb160935b0
Ok, breaking down this try server run:

Failures:

1. A timeout in jetpack. Unrelated to this change. http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/wlachance@mozilla.com-97eb160935b0/try-linux64/try_fedora64_test-jetpack-build112.txt.gz

Warnings:

Most of these are problems in Lion, which we don't support yet. Of the remainder, I looked at them to see what was going on:

1. Another warning in a build on Linux. Unrelated to this change. http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/wlachance@mozilla.com-97eb160935b0/try-linux/try-linux-build3283.txt.gz

2. Warnings during a mochitest run on mac snow leopard. Unrelated to this change. http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/wlachance@mozilla.com-97eb160935b0/try-macosx64-debug/try_snowleopard-debug_test-mochitest-other-build48.txt.gz

3. Warning during a reftest run on android. Unrelated to this change. http://ftp.mozilla.org/pub/mozilla.org/mobile/try-builds/wlachance@mozilla.com-97eb160935b0/try-android/try_tegra_android_test-reftest-1-build3040.txt.gz

4. Warning during a reftest run on android. Unrelated to this change. http://ftp.mozilla.org/pub/mozilla.org/mobile/try-builds/wlachance@mozilla.com-97eb160935b0/try-android/try_tegra_android_test-reftest-2-build3094.txt.gz

5. Warning during a mochitest run on Linux. Unrelated to this change. http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/wlachance@mozilla.com-97eb160935b0/try-linux64-debug/try_fedora64-debug_test-mochitest-other-build204.txt.gz
Ok, here's a hopefully final take. It fixes the bug with the use of "with" that we noticed after pushing the first time (oops). I also noticed at the last minute that we're still shipping poster.zip as part of the tree, even though it's no longer needed. So I deleted that. Doing one last try server run, can we push assuming that no new oranges occur there?
Attachment #571378 - Attachment is obsolete: true
Attachment #573004 - Flags: review?(ted.mielczarek)
Try run for 1dbb678f75fc is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=1dbb678f75fc
Results (out of 18 total builds):
    failure: 18
Builds available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/wlachance@mozilla.com-1dbb678f75fc
Try run for fc5ce15dbe5a is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=fc5ce15dbe5a
Results (out of 18 total builds):
    exception: 18
Builds available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/wlachance@mozilla.com-fc5ce15dbe5a
Try run for a56ddb64ab16 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=a56ddb64ab16
Results (out of 214 total builds):
    success: 194
    warnings: 19
    failure: 1
Builds available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/wlachance@mozilla.com-a56ddb64ab16
(In reply to Mozilla RelEng Bot from comment #23)
> Try run for a56ddb64ab16 is complete.
> Detailed breakdown of the results available here:
>     https://tbpl.mozilla.org/?tree=Try&rev=a56ddb64ab16
> Results (out of 214 total builds):
>     success: 194
>     warnings: 19
>     failure: 1
> Builds available at
> http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/wlachance@mozilla.
> com-a56ddb64ab16

Ok, there don't seem to be any oranges in this build that could be caused by my change AFAICT.
Comment on attachment 573004 [details] [diff] [review]
Drop MINIDUMP_STACKWALK_CGI support, let harness download symbols as needed

Review of attachment 573004 [details] [diff] [review]:
-----------------------------------------------------------------

::: build/automationutils.py
@@ +105,5 @@
> +    else:
> +      path = os.path.split(filename)[0]
> +      if not os.path.isdir(path):
> +        os.makedirs(path)
> +      dest = open(filename, "wb")

Don't go removing with blocks. You just need to import them from the future up top, like automation.py does:
http://mxr.mozilla.org/mozilla-central/source/build/automation.py.in#40
Attachment #573004 - Flags: review?(ted.mielczarek) → review+
Try run for 6da8df028cc7 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=6da8df028cc7
Results (out of 209 total builds):
    success: 184
    warnings: 25
Builds available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/wlachance@mozilla.com-6da8df028cc7
This imports with from future, as desired.
Attachment #573004 - Attachment is obsolete: true
The latest patch passes try and gets a positive review. Please commit when ready.
Keywords: checkin-needed
https://hg.mozilla.org/integration/mozilla-inbound/rev/6c596531c794
Status: NEW → ASSIGNED
Keywords: checkin-needed
Target Milestone: --- → mozilla11
https://hg.mozilla.org/mozilla-central/rev/6c596531c794
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Depends on: 703768
Comment on attachment 573874 [details] [diff] [review]
Drop MINIDUMP_STACKWALK_CGI support, let harness download symbols as needed

This is decidedly inconvenient.

This patch landed just before 10 merged from m-c, but was backed out, and then landed after, for 11. But then bug 561754 maybe sort of forgot that, and treated esr-10 as though it was able to download symbols on demand. That means that now buildbot doesn't download symbols while downloading a build for tests, and doesn't pass a local path, but automation doesn't know what to do with a remote symbols path, and so we get useless stacks from crashes like https://tbpl.mozilla.org/php/getParsedLog.php?id=10831561&tree=Mozilla-Esr10 which might be some known crash, or might be a new crash introduced by that push, and there's no way of telling.

We have two choices: land this patch on esr-10 (it applies cleanly, I have no idea whether it actually works or how to tell other than by landing it), or, give esr-10 the same "don't download on demand" buildbot treatment that bug 561754 gave 1.9.2 (which we have absolutely no reason to believe works, we haven't actually seen a crash on 1.9.2 since March 1st, and the other thing that wasn't downloading symbols on demand, Android, now is because when it wasn't, it wasn't getting stacks for crashes).
Attachment #573874 - Flags: approval-mozilla-esr10?
I think that since we have this working on trunk, the simplest solution is to land this on esr. This is a test-automation-only change, and won't affect the bits we're actually shipping (except that we'll actually be able to get stacks for test failures).
Comment on attachment 573874 [details] [diff] [review]
Drop MINIDUMP_STACKWALK_CGI support, let harness download symbols as needed

[Triage Comment]
approving for landing, please go ahead as per https://wiki.mozilla.org/Release_Management/ESR_Landing_Process
Attachment #573874 - Flags: approval-mozilla-esr10? → approval-mozilla-esr10+
I don't think this is anything QA needs to explicitly verify, flagging qa-. Please correct me if I am wrong.
Whiteboard: [buildfaster:p1] → [buildfaster:p1][qa-]
Depends on: 789275
You need to log in before you can comment on or make changes to this bug.