Closed Bug 795360 Opened 12 years ago Closed 11 years ago

run dump_syms in parallel in symbolstore.py on multi-core machines

Categories

(Toolkit :: Crash Reporting, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla20

People

(Reporter: ted, Assigned: singerb.dev)

Details

(Whiteboard: [mentor=ted][lang=python][buildfaster:?])

Attachments

(1 file, 4 obsolete files)

jhford wrote a blog post a while ago about benchmarking faster mac minis for building:
http://blog.johnford.org/new-mac-builders-ssds-j-settings/

One thing he mentioned offhand was: "You’ll notice that the symbols generation is slower on the Quad-i7 than on either Dual-i7 machine.  Symbol generation is not a parallelized process, which explains why higher clock rate dual core machines are faster than the quad core."

This is a good point, and we could fix this. The biggest win would be to run dump_syms in parallel when we have >1 core available. Dumping the libxul symbols is still going to take the most time, but at least we could be dumping a bunch of other smaller files while that happens, so the time would be bounded by how long it takes to dump libxul.

I don't have time to work on this at the moment, but I'd be happy to mentor someone. The only code you should have to change to fix this is symbolstore.py:
http://mxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/tools/symbolstore.py
It has some basic unit tests:
http://mxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/tools/unit-symbolstore.py

You invoke this script after you've done a build by executing "make buildsymbols" in your objdir.

Currently the script takes arbitrary files/directories as input and calls Dumper.Process on each one in a loop:
http://mxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/tools/symbolstore.py#786

Dumper.Process calls down into Dumper.ProcessDir or Dumper.ProcessFile depending on whether the input is a directory or a file, respectively.

Dumper.ProcessFile does all the real work of running dump_syms:
http://mxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/tools/symbolstore.py#472

One tricky bit is that on Mac, there's a Dumper_Mac.ProcessFile that overrides that method to first run dsymutil and then call Dumper.ProcessFile on the resulting .dSYM file:
http://mxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/tools/symbolstore.py#702

Ideally we'd want to be running dsymutil in parallel as well since it can take a while.

I'm not sure of the best way to architect a parallel solution here, but I suspect it'd be something like:
1) In Dumper.__init__, create a work Queue to put work items in (the threading and multiprocessing libraries both have a Queue class) and a queue for output from finished items (probably just the filenames that we currently print to stdout would need to go into the output queue).
2) Also in __init__, detect the number of cores and start that many worker threads or processes.
3) In ProcessFile, queue a work item instead of doing the actual work.
4) Add some sort of Dumper.finish() method that collects the items from the output queue (and prints them to stdout to preserve the current behavior) and cleans up the workers.

Handling the mac case gracefully is probably slightly tricky because we need to run dsymutil once per-binary, but then run dump_syms one per-cpu-architecture on the resulting dSYM (so twice for our universal builds). In the worst case we could just make the background task do both dump_syms, but that would mean we couldn't run the two longest tasks in parallel, which would be unfortunate. There's probably a clever way to do this right (perhaps the task to run dsymutil could put the two dump_syms tasks in the queue after it finishes?).
I'm working on a patch for this; it's most of the way there, but I had to take some time to do a small proof of concept, since the multiprocessing module has quite a few quirks. Now that that's sorted, I hope to be able to finish the patch in the next few days. multiprocessing.Pool does most of the work, as long as you use it correctly.
Cool! I'm going to assign the bug to you. Let me know if you need any more info. You can find me on irc.mozilla.org as ted as well.
Assignee: nobody → singerb.dev
Status: NEW → ASSIGNED
Some preliminary performance numbers from my laptop (2 cores) with a non-universal Mac build:

1 worker: 884 sec
2 workers: 747 sec

So it certainly helps, and should help even more with more cores and/or a universal build.
From my testing, this patch is fully working, and parallelizes everything as much as possible. It's been run on a universal Mac OS X build and worked fine, but has not been tested on Linux or Windows (more important, since multiprocessing has some strange quirks). It's implemented using multiprocessing, mostly Pool. The unit tests are also updated and are working.
Attachment #680937 - Flags: review?(ted)
I'd be happy to push this to try for you, but it looks like I probably bitrotted you with my patches from bug 774780. If you can rebase your patch on top of the latest code in mozilla-central (at least revision caca2267f82b) I can push it to try for you so we can see what happens on Linux/Win32.
Attached an updated patch merged with a more recent m-c, including the change you mentioned above. Unit tests all still work, re-tested on a universal build and still works.
Attachment #680937 - Attachment is obsolete: true
Attachment #680937 - Flags: review?(ted)
Attachment #681502 - Flags: review?(ted)
Linux, Win, and OS X logs from the try server all show symbolstore.py runs that look good, and show that they're taking advantage of 2-4 core machines. I think this is ready for review.
Just a ping to make sure there's nothing further needed from me at the moment.
No, sorry! Just a little lax on reviews at the moment and then a holiday. I'll review this in the next day or two. Thanks for your patience!
Comment on attachment 681502 [details] [diff] [review]
Patch updated to more recent m-c.

Review of attachment 681502 [details] [diff] [review]:
-----------------------------------------------------------------

I apologize profusely for the review delay. This is a rather involved patch, and I needed to block off a significant chunk of time to make sure I gave it a thorough review. Thanks so much for writing this patch! I think you're very much on the right track, but there are some things I'd like to see addressed before we land this. The fact that you got a green Try run is very encouraging!

As a note, if you disagree with any of my comments, please do respond with your reasoning. You've obviously spent a lot of time working on this and there may be assumptions you've made that I don't have knowledge of.

::: toolkit/crashreporter/tools/symbolstore.py
@@ +350,5 @@
>  
> +        # book-keeping to keep track of our jobs and the cleanup work per file tuple
> +        self.expected_jobs = 0
> +        self.submitted_jobs = 0
> +        self.files_record = dict()

{} instead of dict() is more Pythonic.

@@ +445,2 @@
>      def Process(self, file_or_dir):
>          "Process a file or all the (valid) files in a directory."

You should note in the docstrings for Process, ProcessFile and ProcessDir that processing is queued up to be performed asynchronously, and the caller needs to call Dumper.Finish to collate the results.

You might want to add that usage info to the Dumper class docstring as well.

@@ +449,2 @@
>          elif os.path.isfile(file_or_dir):
> +            self.expected_jobs += self.JobsPerFile() # update our expected job count

I'm really not wild about this accounting being handled outside of the ProcessFile method. I thought about this for quite a while (I stepped away from this review for some time to think), and I understand why you implemented it this way, since the Mac dumper will append additional jobs after the dsymutil job.

I could only come up with one solution that felt cleaner to me, which would be to maintain a dict of jobs remaining, keyed by input filename.  Something like self.files_record, but keyed by the original input filename, like "/path/to/libxul.so". I envision doing something very similar to what you're doing now, but instead of starting the counter at zero and incrementing it in ProcessFileFinished, you'd increment it for every SubmitJob for this file (once for dsymutil, once per-arch-ProcessFileWork), and decrement it in the ProcessFileFinished callback. When the counter got back to zero, you could remove the entry from self.files_record. This way Dumper.Finish could simply wait until self.files_record was empty to know when all jobs had been finished. (This does change the logic there slightly, to waiting after all jobs have *finished* instead of *submitted*, but I don't think that matters.) You'd have to be careful about the ordering of the "decrement counter" vs. "SubmitJob" after the dsymutil job runs, but that doesn't seem like an incredibly hard problem.

What do you think?

@@ +459,5 @@
>                      dirs.remove(d)
>              for f in files:
>                  fullpath = os.path.join(root, f)
>                  if self.ShouldProcess(fullpath):
> +                    self.expected_jobs += self.JobsPerFile() # update our expected job count

The code smell that bothers me the most about this is that this code gets repeated in more than one place. That feels fragile.

@@ +479,5 @@
> +            del self.files_record[res['files']]
> +            if res['after']:
> +                res['after'](res['status'], res['after_arg'])
> +
> +    def ProcessFile(self, files, after=None, after_arg=None):

This should probably get renamed to ProcessFile*s* now that it handles more than one file, right?

@@ +554,5 @@
>                      f.close()
>                      proc.wait()
>                      # we output relative paths so callers can get a list of what
>                      # was generated
> +                    output(lock, sys.stdout, rel_path)

I wonder if it wouldn't make more sense to just use a Queue here, push all output from workers into the Queue, and have Dumper.Finish pull items from the Queue and print them?

@@ +572,5 @@
> +                break
> +            elif len(files) > 1:
> +                # fallback for DWARF-less binaries
> +                output(lock, sys.stderr, "Couldn't read symbols in: %s, (most likely couldn't read DWARF symbols)" % file)
> +                continue

This is a little awkward, I'd probably just leave out the error message here honestly.

@@ +793,5 @@
>                                    stdout=open("/dev/null","w"), stderr=subprocess.STDOUT)
>          if success == 0 and os.path.exists(full_path):
> +            output(lock, sys.stdout, rel_path)
> +
> +def output(lock, dest, output_str):

Can you wrap these up in a class so that you don't have to pass the lock all over the place just to use it for output? Alternately, you could switch the logging in this file to use the logging module, which might just solve this for you. (If that's too much work we can address it in a followup, but passing the lock all over the place is kind of inelegant.)

@@ +870,5 @@
> +    # also note that these need to be in the __main__ guard, or else Windows will choke
> +    manager = multiprocessing.Manager()
> +    jobs_condition = manager.Condition()
> +    output_lock = manager.RLock()
> +    pool = multiprocessing.Pool(num_cpus)

This is pretty ugly, using globals isn't really great. Can you create this all in the Dumper __init__ method instead? (You could make them statics on Dumper if you want them to be process-global, like Dumper.pool etc.)

::: toolkit/crashreporter/tools/unit-symbolstore.py
@@ +37,5 @@
>      return [f + extension for f in files]
>  
> +# since we replace module functions/classes with our own, we can't let multiprocessing spawn new processes
> +# the new processes wouldn't get our replacements, and the tests would try to run for real
> +# since symbolstore uses a global pool, we don't have to change the multiprocessing module

This is a little distressing, since it means that you won't be testing actual async behavior, but I understand the difficulty of making mocking work with multiprocessing. There's actually an (undocumented) ThreadPool in multiprocessing:
from multiprocessing.pool import ThreadPool

Can you see if you can use that for the tests? (Actually it'd probably be fine to use for the real implementation too, since all the hard work is done via subprocess anyway.)
Attachment #681502 - Flags: review?(ted)
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #11)
> Comment on attachment 681502 [details] [diff] [review]
> Patch updated to more recent m-c.
> 
> Review of attachment 681502 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I apologize profusely for the review delay. This is a rather involved patch,
> and I needed to block off a significant chunk of time to make sure I gave it
> a thorough review. Thanks so much for writing this patch! I think you're
> very much on the right track, but there are some things I'd like to see
> addressed before we land this. The fact that you got a green Try run is very
> encouraging!
> 
> As a note, if you disagree with any of my comments, please do respond with
> your reasoning. You've obviously spent a lot of time working on this and
> there may be assumptions you've made that I don't have knowledge of.

No problem on the delay; this did end up being quite involved!

> ::: toolkit/crashreporter/tools/symbolstore.py
> @@ +350,5 @@
> >  
> > +        # book-keeping to keep track of our jobs and the cleanup work per file tuple
> > +        self.expected_jobs = 0
> > +        self.submitted_jobs = 0
> > +        self.files_record = dict()
> 
> {} instead of dict() is more Pythonic.

True. Will change.

> @@ +445,2 @@
> >      def Process(self, file_or_dir):
> >          "Process a file or all the (valid) files in a directory."
> 
> You should note in the docstrings for Process, ProcessFile and ProcessDir
> that processing is queued up to be performed asynchronously, and the caller
> needs to call Dumper.Finish to collate the results.
> 
> You might want to add that usage info to the Dumper class docstring as well.

Good point.

> @@ +449,2 @@
> >          elif os.path.isfile(file_or_dir):
> > +            self.expected_jobs += self.JobsPerFile() # update our expected job count
> 
> I'm really not wild about this accounting being handled outside of the
> ProcessFile method. I thought about this for quite a while (I stepped away
> from this review for some time to think), and I understand why you
> implemented it this way, since the Mac dumper will append additional jobs
> after the dsymutil job.

Yup, exactly.

> I could only come up with one solution that felt cleaner to me, which would
> be to maintain a dict of jobs remaining, keyed by input filename.  Something
> like self.files_record, but keyed by the original input filename, like
> "/path/to/libxul.so". I envision doing something very similar to what you're
> doing now, but instead of starting the counter at zero and incrementing it
> in ProcessFileFinished, you'd increment it for every SubmitJob for this file
> (once for dsymutil, once per-arch-ProcessFileWork), and decrement it in the
> ProcessFileFinished callback. When the counter got back to zero, you could
> remove the entry from self.files_record. This way Dumper.Finish could simply
> wait until self.files_record was empty to know when all jobs had been
> finished. (This does change the logic there slightly, to waiting after all
> jobs have *finished* instead of *submitted*, but I don't think that
> matters.) You'd have to be careful about the ordering of the "decrement
> counter" vs. "SubmitJob" after the dsymutil job runs, but that doesn't seem
> like an incredibly hard problem.
> 
> What do you think?

Sounds like it should be possible; I'll take a look at it. I agree with both this comment and the one below; I didn't like duplicating this around outside ProcessFile, but hadn't come up with any other approach.

> @@ +459,5 @@
> >                      dirs.remove(d)
> >              for f in files:
> >                  fullpath = os.path.join(root, f)
> >                  if self.ShouldProcess(fullpath):
> > +                    self.expected_jobs += self.JobsPerFile() # update our expected job count
> 
> The code smell that bothers me the most about this is that this code gets
> repeated in more than one place. That feels fragile.
> 
> @@ +479,5 @@
> > +            del self.files_record[res['files']]
> > +            if res['after']:
> > +                res['after'](res['status'], res['after_arg'])
> > +
> > +    def ProcessFile(self, files, after=None, after_arg=None):
> 
> This should probably get renamed to ProcessFile*s* now that it handles more
> than one file, right?

Yes, good point.

> @@ +554,5 @@
> >                      f.close()
> >                      proc.wait()
> >                      # we output relative paths so callers can get a list of what
> >                      # was generated
> > +                    output(lock, sys.stdout, rel_path)
> 
> I wonder if it wouldn't make more sense to just use a Queue here, push all
> output from workers into the Queue, and have Dumper.Finish pull items from
> the Queue and print them?

That could certainly be done, easily. And actually, given that Finish should get called before all the jobs are done, you wouldn't lose much of the "real-time" output of paths. I'll see if it makes sense to make this change with the other output changes.

> @@ +572,5 @@
> > +                break
> > +            elif len(files) > 1:
> > +                # fallback for DWARF-less binaries
> > +                output(lock, sys.stderr, "Couldn't read symbols in: %s, (most likely couldn't read DWARF symbols)" % file)
> > +                continue
> 
> This is a little awkward, I'd probably just leave out the error message here
> honestly.

I can leave it out, certainly. It does happen, and I was just keeping it in from the old code (it's moved here, due to the way we fallback in the Mac case now).

> @@ +793,5 @@
> >                                    stdout=open("/dev/null","w"), stderr=subprocess.STDOUT)
> >          if success == 0 and os.path.exists(full_path):
> > +            output(lock, sys.stdout, rel_path)
> > +
> > +def output(lock, dest, output_str):
> 
> Can you wrap these up in a class so that you don't have to pass the lock all
> over the place just to use it for output? Alternately, you could switch the
> logging in this file to use the logging module, which might just solve this
> for you. (If that's too much work we can address it in a followup, but
> passing the lock all over the place is kind of inelegant.)

I'll have to refresh myself on whether I can wrap up the lock somehow; I agree it would be nice to, but integrating multiprocessing and classes/objects in general can get a little tricky. Might be possible though. Not sure on the multiprocess-safeness of logging, but will investigate. I agree it'd be nice to be able to use it.

> @@ +870,5 @@
> > +    # also note that these need to be in the __main__ guard, or else Windows will choke
> > +    manager = multiprocessing.Manager()
> > +    jobs_condition = manager.Condition()
> > +    output_lock = manager.RLock()
> > +    pool = multiprocessing.Pool(num_cpus)
> 
> This is pretty ugly, using globals isn't really great. Can you create this
> all in the Dumper __init__ method instead? (You could make them statics on
> Dumper if you want them to be process-global, like Dumper.pool etc.)

Again, I'll have to refresh myself on this w/r/t multiprocessing and the restrictions therein.

> ::: toolkit/crashreporter/tools/unit-symbolstore.py
> @@ +37,5 @@
> >      return [f + extension for f in files]
> >  
> > +# since we replace module functions/classes with our own, we can't let multiprocessing spawn new processes
> > +# the new processes wouldn't get our replacements, and the tests would try to run for real
> > +# since symbolstore uses a global pool, we don't have to change the multiprocessing module
> 
> This is a little distressing, since it means that you won't be testing
> actual async behavior, but I understand the difficulty of making mocking
> work with multiprocessing. There's actually an (undocumented) ThreadPool in
> multiprocessing:
> from multiprocessing.pool import ThreadPool
> 
> Can you see if you can use that for the tests? (Actually it'd probably be
> fine to use for the real implementation too, since all the hard work is done
> via subprocess anyway.)

It should be possible to use some sort of thread pool for the tests, yes; I'll look into that. Hadn't thought about that as an option, and it didn't bother me too much that the tests weren't actually running asynchronously, since all the logic was being tested anyway.
Attached patch Patch take 2. (obsolete) — Splinter Review
Updated patch with:

- nits/docs from review fixed
- output and general multiprocessing cleaned up
- new job tracking mechanism based on review suggestion
- unit test updated to use multiprocessing.dummy

It's working on OS X, and passing the unit tests, but it could use a push to try for Linux/Windows.
Attachment #681502 - Attachment is obsolete: true
Attachment #690017 - Flags: review?(ted)
Comment on attachment 690017 [details] [diff] [review]
Patch take 2.

Review of attachment 690017 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with a few minor changes. If you make these changes you can attach a new version of the patch, mark it r+ (include r=ted in the changeset message) and set the checkin-needed keyword.

::: toolkit/crashreporter/tools/symbolstore.py
@@ +324,5 @@
>  
>      You don't want to use this directly if you intend to call
>      ProcessDir.  Instead, call GetPlatformSpecificDumper to
> +    get an instance of a subclass.
> +    

nit: whitespace

@@ +356,5 @@
> +        # book-keeping to keep track of our jobs and the cleanup work per file tuple
> +        self.files_record = {}
> +        self.jobs_record = collections.defaultdict(int)
> +
> +    def IncrementJobs(self, file_key):

I might call these "JobStarted" and "JobFinished", which I feel have a little more semantic meaning.

@@ +360,5 @@
> +    def IncrementJobs(self, file_key):
> +        """Increments the number of submitted jobs for the specified key file,
> +        defined as the original file we processed; note that a single key file
> +        can generate up to 1 + len(self.archs) jobs in the Mac case."""
> +        with self.jobs_condition:

It'd be clearer to refer to this (and the other class member vars) with the Dumper. prefix instead of self, since they actually reside on the class, not the instance.

@@ +887,5 @@
> +    # note that these need to be in the __main__ guard, or else Windows will choke
> +    Dumper.manager = multiprocessing.Manager()
> +    Dumper.jobs_condition = Dumper.manager.Condition()
> +    Dumper.lock = Dumper.manager.RLock()
> +    Dumper.pool = multiprocessing.Pool(num_cpus)

Can you wrap all these in a class method on Dumper and call that here instead? http://docs.python.org/2/library/functions.html#classmethod
Maybe Dumper.global_init() or something.

::: toolkit/crashreporter/tools/unit-symbolstore.py
@@ +73,5 @@
>          self.add_test_files(add_extension(["foo", "bar", "abc/xyz", "abc/fooxyz", "def/asdf", "def/xyzfoo"]))
>          d = symbolstore.GetPlatformSpecificDumper(dump_syms="dump_syms",
>                                                    symbol_path="symbol_path",
>                                                    exclude=["*foo*"])
> +        d.ProcessFiles = mock_process_file

We should probably switch to actually using mock here, but you don't have to do that in this patch. (I imported the mock module after I wrote a bunch of these tests.)

@@ +247,5 @@
> +    # that our mocking/module-patching works
> +    symbolstore.Dumper.manager = multiprocessing.dummy.Manager()
> +    symbolstore.Dumper.jobs_condition = symbolstore.Dumper.manager.Condition()
> +    symbolstore.Dumper.lock = symbolstore.Dumper.manager.RLock()
> +    symbolstore.Dumper.pool = multiprocessing.dummy.Pool(2)

If you moved this into a class method you could just call it here instead of having to repeat yourself. You do want to provide a different pool, so perhaps you can have an optional pool argument to the global_init method to pass that in?
Attachment #690017 - Flags: review?(ted) → review+
Attached patch Final patch. (obsolete) — Splinter Review
Updated patch with nits from final review fixed.
Attachment #690017 - Attachment is obsolete: true
Attachment #690862 - Flags: review+
Keywords: checkin-needed
This appears to have been causing hangs on the Windows builders. Backed out.
https://hg.mozilla.org/integration/mozilla-inbound/rev/4b1328bd377c

https://tbpl.mozilla.org/php/getParsedLog.php?id=18003626&tree=Mozilla-Inbound

2208: Submitting jobs for files: ('.\\xpcom\\tests\\testtimers.pdb',)
2208: Submitting jobs for files: ('.\\xpcom\\tests\\testunicodearguments.pdb',)
2208: Submitting jobs for files: ('.\\xpcom\\tests\\bug656331_component\\test656331.pdb',)
2208: Submitting jobs for files: ('.\\xpcom\\tests\\component\\testcomponent.pdb',)
2208: Submitting jobs for files: ('.\\xpcom\\tests\\component_no_aslr\\testcompnoaslr.pdb',)
2208: Submitting jobs for files: ('.\\xpcom\\tests\\external\\testminstringapi.pdb',)
2208: Submitting jobs for files: ('.\\xpcom\\tests\\windows\\testcom.pdb',)
2208: Submitting jobs for files: ('.\\xpcom\\typelib\\xpt\\tests\\primitivetest.pdb',)
2208: Submitting jobs for files: ('.\\xpcom\\typelib\\xpt\\tests\\simpletypelib.pdb',)

command timed out: 3600 seconds without output, attempting to kill
SIGKILL failed to kill process
using fake rc=-1
program finished with exit code -1
I think the last version of the patch didn't get pushed to try first. Whoops. I'll try and debug this soon. It looks like the worker processes aren't getting started for some reason. No surprise that Windows was the issue though; the multiprocessing stuff is trickier there.
To that end, access to try so I can test Windows builds would be very useful.
Now with a fully green try build: https://tbpl.mozilla.org/?tree=Try&rev=5155805ec1cb

The issue was that since Windows doesn't/can't use fork() to implement multiprocessing, the worker processes didn't have access to the lock object, and were thus error'ing out and not running any jobs. Fixed by explicitly setting the lock object in the child processes with an initializer.
Attachment #690862 - Attachment is obsolete: true
Attachment #696199 - Flags: review?(ted)
Comment on attachment 696199 [details] [diff] [review]
Fixed final patch.

Review of attachment 696199 [details] [diff] [review]:
-----------------------------------------------------------------

Nice catch, thanks for seeing this through!
Attachment #696199 - Flags: review?(ted) → review+
Thanks for the review!
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/26954234c778
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla20
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: