Closed Bug 883209 Opened 7 years ago Closed 6 years ago

Record resource usage of builds

Categories

(Firefox Build System :: General, defect)

defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED
mozilla26

People

(Reporter: gps, Assigned: gps)

References

(Blocks 2 open bugs)

Details

Attachments

(3 files, 6 obsolete files)

I've long wanted to record system resource usage during builds. Now that we have a class in mozbase for easily capturing system resource usage, I'd like to hook it up to the build system.

The end goal is to facilitate easier "monitoring" of the efficiency of the build system. We'd like to optimize for 100% CPU usage while minimizing CPU time, I/O, and memory. Having these recorded in a machine readable manner will allow us to better get a handle on historical trends. It can also help developers get an understanding on why builds are slow. e.g. "oh, my machine is spending a lot of time in I/O wait, perhaps I should get an SSD." Or, "my machine is using all available CPU, maybe I should get a faster processor."
Fun Friday patch!

I hooked the system resource monitor into the BuildMonitor class. We
drop markers into the resource monitor during tier traversal. When a
build has finished, we write out some structured log events with
high-level resource usage. These are printed to the user. e.g.

---
 3:58.32 nsBrowserApp.cpp
 4:01.30 390 compiler warnings present.
 4:02.55 Overall system resources - Wall time: 241s; CPU: 50%; Read bytes: 3854320640; Write bytes: 4454077952; Read time: 147867; Write time: 88086
 4:02.55 base tier - Wall time: 3s; CPU: 29%; Read bytes: 4902912; Write bytes: 6599168; Read time: 161; Write time: 135
 4:02.55 nspr tier - Wall time: 2s; CPU: 15%; Read bytes: 7102464; Write bytes: 12748288; Read time: 164; Write time: 169
 4:02.55 nss tier - Wall time: 5s; CPU: 56%; Read bytes: 74002944; Write bytes: 109625856; Read time: 2254; Write time: 1238
 4:02.55 js tier - Wall time: 11s; CPU: 31%; Read bytes: 442766336; Write bytes: 920877056; Read time: 11961; Write time: 14635
 4:02.56 platform tier - Wall time: 183s; CPU: 58%; Read bytes: 3206482944; Write bytes: 3268277760; Read time: 130150; Write time: 64596
 4:02.56 app tier - Wall time: 10s; CPU: 32%; Read bytes: 16626176; Write bytes: 50634752; Read time: 899; Write time: 5919
Your build was successful!
To take your build for a test drive, run:
/Users/gps/src/firefox/obj-firefox.noindex/dist/Nightly.app/Contents/MacOS/firefox
For more information on what to do now, see
https://developer.mozilla.org/docs/Developer_Guide/So_You_Just_Built_Firefox
---

In addition, we also write out a JSON file into the object directory
containing the raw data. I'd like to provide a graph view of this data
so people can easily see what systems are doing during builds.
Eventually, I'd like to have this file uploaded to ftp. But that
requires releng using mach, which is out of scope for this bug.

There are currently pieces of data missing from the JSON file, notably
info on all the phases. I may upload a v2 of this patch. Or, we can
delegate it to a followup. I figured I might as well upload something
that works as soon as I have it.
Attachment #763026 - Flags: review?(ted)
Assignee: nobody → gps
I quickly added phase times to the JSON.
Attachment #763030 - Flags: review?(ted)
Attachment #763026 - Attachment is obsolete: true
Attachment #763026 - Flags: review?(ted)
I made modifications necessary to support an HTML viewer.
Attachment #763030 - Attachment is obsolete: true
Attachment #763030 - Flags: review?(ted)
Attachment #763202 - Flags: review?(ted)
I implemented a (very ugly looking) HTML interface to visualize the
produced resource usage JSON file. It is far from feature complete and
it is very ugly. But it's a start.

There is a new mach command (mach build-resource-usage) which fires up a
Python HTTP server and launches Firefox against said server. If all goes
well, you should see some visualizations within a second or two.

Again, this is very ugly. Web development is not my strong suit.
I hope once it gets committed a more talented web developer than
myself will feel compelled to fix it after it causes their eyes
to bleed.
Attachment #763203 - Flags: review?(ted)
Attached image screenshot
This screenshot shows what things should look like. Tufte would not approve.
I added a legend to the graph.
Attachment #763692 - Flags: review?(ted)
Attachment #763203 - Attachment is obsolete: true
Attachment #763203 - Flags: review?(ted)
Comment on attachment 763202 [details] [diff] [review]
Record system resource usage of builds

I found some issues on Windows. Cancelling review until I have them sorted out.
Attachment #763202 - Flags: review?(ted)
Just FWIW, the output in comment 2 is way too verbose and I don't think it'll be useful to most people. I would shorten it to just the "Overall system resources" line, and maybe print the mach command that gives you the HTML view.
Comment on attachment 763692 [details] [diff] [review]
Part 2: Add HTML interface to render build resource usage

Review of attachment 763692 [details] [diff] [review]:
-----------------------------------------------------------------

::: python/mozbuild/mozbuild/html_build_viewer.py
@@ +21,5 @@
> +            self.send_header('Content-Type', 'application/json; charset=utf-8')
> +            self.end_headers()
> +
> +            keys = sorted(s.json_files.keys())
> +            json.dump(keys, self.wfile)

nit: application/json is supposed to have an object as the root, not a list.

@@ +76,5 @@
> +        self.send_header('Content-Type', ct)
> +        self.end_headers()
> +
> +        with open(local_path, 'rb') as fh:
> +            self.wfile.write(fh.read())

Can you not punt this to some stdlib class?

@@ +89,5 @@
> +
> +        self.doc_root = doc_root
> +        self.json_files = {}
> +
> +        self.server = BaseHTTPServer.HTTPServer((address, port), HTTPHandler)

Pass 0 as the port to get an arbitrary free port number. You can then use httpd.server_address to get the (real_address, port) tuple back.

@@ +101,5 @@
> +        self.json_files[key] = path
> +
> +    def run(self):
> +        while not self.do_shutdown:
> +            self.server.handle_request()

You don't seem to try to handle Ctrl+C anywhere, won't users wind up with a stack trace when they follow instructions? That doesn't seem very friendly.

::: python/mozbuild/mozbuild/mach_commands.py
@@ +379,5 @@
> +        description='Show information about system resource usage from a build.')
> +    @CommandArgument('--address', default='localhost',
> +        help='Address the HTTP server should listen on.')
> +    @CommandArgument('--port', type=int, default=12000,
> +        help='Port number the HTTP server should listen on.')

These feel unnecessary. Just use localhost and default to an open port.

::: python/mozbuild/mozbuild/resources/html-build-viewer/index.html
@@ +1,4 @@
> +<!-- This Source Code Form is subject to the terms of the Mozilla Public
> +   - License, v. 2.0. If a copy of the MPL was not distributed with this
> +   - file, You can obtain one at http://mozilla.org/MPL/2.0/. -->
> +<!DOCTYPE html>

I am not going to review this file. It's totally NPOTB anyway, so I don't think it matters.
Attachment #763692 - Flags: review?(ted) → review+
Depends on: 901601
I rebased the first patch against a recentish trunk (trivial, fortunately), and 30 seconds into the build, I get:

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 808, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 761, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/mh/mozilla-central/testing/mozbase/mozprocess/mozprocess/processhandler.py", line 698, in _processOutput
    self.processOutputLine(line.rstrip())
  File "/home/mh/mozilla-central/testing/mozbase/mozprocess/mozprocess/processhandler.py", line 663, in processOutputLine
    handler(line)
  File "/home/mh/mozilla-central/python/mach/mach/mixin/process.py", line 86, in handleLine
    line_handler(line)
  File "/home/mh/mozilla-central/python/mozbuild/mozbuild/mach_commands.py", line 235, in on_line
    warning, state_changed, relevant = self.monitor.on_line(line)
  File "/home/mh/mozilla-central/python/mozbuild/mozbuild/controller/building.py", line 175, in on_line
    duration = self.resources.finish_phase(phase)
  File "/home/mh/mozilla-central/testing/mozbase/mozsystemmonitor/mozsystemmonitor/resourcemonitor.py", line 294, in finish_phase
    assert name in self._active_phases
AssertionError
(In reply to Gregory Szorc [:gps] from comment #11)
> Try
> https://hg.mozilla.org/users/gszorc_mozilla.com/gecko-patches/raw-file/
> default/build-resource-monitor

It works. I added the following to part 2 because my build machine doesn't have a browser:

diff --git a/python/mozbuild/mozbuild/mach_commands.py b/python/mozbuild/mozbuild/mach_commands.py
--- a/python/mozbuild/mozbuild/mach_commands.py
+++ b/python/mozbuild/mozbuild/mach_commands.py
@@ -407,17 +407,20 @@ class Build(MachCommandBase):
         if not os.path.exists(last):
             print('Build resources not available. If you have performed a '
                 'build and receive this message, the psutil Python package '
                 'likely failed to initialize properly.')
             return 1
 
         server = BuildViewerServer(address, port)
         server.add_resource_json_file('last', last)
-        webbrowser.get(browser).open('http://%s:%d/' % (address, port))
+        try:
+            webbrowser.get(browser).open('http://%s:%d/' % (address, port))
+        except Exception:
+            print('Point a browser at http://%s:%d/' % (address, port))
         print('Hit CTRL+c to stop server.')
         server.run()
 
     @Command('clobber', category='build',
         description='Clobber the tree (delete the object directory).')
     def clobber(self):
         try:
             self.remove_objdir()
Rebased to work with concurrent subtiers and directories. This involved
moving the tier tracking into its own class. I think the resulting code
is much cleaner.

As part of this, I also changed the format of the BUILDSTATUS messages
so downstream consumers have to keep less context.

https://tbpl.mozilla.org/?tree=Try&rev=019b7a75fba7
Attachment #763202 - Attachment is obsolete: true
Comment on attachment 790638 [details] [diff] [review]
Part 1: Record system resource usage of builds;

I haven't performed a polish pass on this patch yet. Throwing it up for feedback to get an idea of what you'd like to see before review/checkin.
Attachment #790638 - Flags: feedback?(mh+mozilla)
Comment on attachment 790638 [details] [diff] [review]
Part 1: Record system resource usage of builds;

Review of attachment 790638 [details] [diff] [review]:
-----------------------------------------------------------------

::: config/rules.mk
@@ +718,5 @@
> +# Tiers are traditionally composed of directories that are invoked either
> +# once (so-called "static" directories) or 3 times with the export, libs, and
> +# tools sub-tiers.
> +#
> +# If the TIER_$(tier)_CUSTOM variable is defined, traditional these traditional

"traditional these traditional"?

::: python/mozbuild/mozbuild/controller/building.py
@@ +47,5 @@
> +
> +    Tiers can have subtiers. Subtiers can execute sequentially or concurrently.
> +
> +    Subtiers can have directories. Directories can execute sequentially or in
> +    parallel.

Not sure why you used "concurrently" for subtiers and "in parallel" for directories; is there a distinction I'm missing?

@@ +124,5 @@
> +        if self.active_subtiers:
> +            st['concurrent'] = True
> +
> +        try:
> +            del self.active_dirs[subtier]

I'm confused to find this here

@@ +194,5 @@
> +        o = []
> +
> +        for tier, state in self.tiers.items():
> +
> +            t_entry = dict(

Nit: no newline before this
Comment on attachment 790638 [details] [diff] [review]
Part 1: Record system resource usage of builds;

Review of attachment 790638 [details] [diff] [review]:
-----------------------------------------------------------------

I haven't looked at the python code yet, but i figured i'd already send my first comments.

::: config/rules.mk
@@ +443,5 @@
>  
>  endef # The extra line is important here! don't delete it
>  
>  define TIER_DIR_SUBMAKE
> +@echo "BUILDSTATUS TIERDIR_START  $(1) $(2) $(3)"

I bitrotted you, but even after updating the patch, it doesn't work on current m-i :-/

@@ +726,5 @@
>  tier_$(1)::
> +ifdef TIER_$(1)_CUSTOM
> +	$$(foreach dir,$$($$@_dirs),$$(call SUBMAKE,,$$(dir)))
> +else
> +	@printf "BUILDSTATUS TIER_START $(1)"

I'd rather see the tier munging in a separate patch, with a cover letter, because i'm not sure what the goal is here, and what it has to do with resource usage monitoring.
Depends on: 905879
I've isolated the resource monitoring code. Let's see what happens with
code review :)

Requires patch in bug 905879 or it won't apply cleanly.

https://tbpl.mozilla.org/?tree=Try&rev=a0650f17b7bc
Attachment #791067 - Flags: review?(mh+mozilla)
Attachment #790638 - Attachment is obsolete: true
Attachment #790638 - Flags: feedback?(mh+mozilla)
Adding latest part 2. Ted was fine with not reviewing the HTML bits.
Attachment #791074 - Flags: review?(mh+mozilla)
Attachment #763692 - Attachment is obsolete: true
Comment on attachment 791067 [details] [diff] [review]
Part 1: Record system resource usage of builds;

Review of attachment 791067 [details] [diff] [review]:
-----------------------------------------------------------------

::: python/mozbuild/mozbuild/controller/building.py
@@ +208,5 @@
> +                duration=state['duration'],
> +                subtiers=[],
> +            )
> +
> +            add_resources_to_entry(t_entry, *resources(self._phase(tier)))

You're doing this three times, and I don't see why resources() and add_resources_to_entry() have to be two separate functions.

@@ +448,5 @@
> +            entry['cpu_percent'] = cpu_percent
> +            entry['cpu_times'] = list(cpu_times)
> +            entry['io'] = list(io)
> +
> +        cpu_percent, cpu_times, io = resources(None)

You're using resources() and add_resources_to_entry() only once here, I think you can inline them.
Attachment #791067 - Flags: review?(mh+mozilla) → review+
Comment on attachment 791074 [details] [diff] [review]
Part 2: Add HTML interface to render build resource usage

Review of attachment 791074 [details] [diff] [review]:
-----------------------------------------------------------------

Haven't looked at the HTML. Note it would be nice to be able to get useful information when hovering the graph instead of the timeline below it.

::: python/mozbuild/mozbuild/html_build_viewer.py
@@ +58,5 @@
> +        local_path = os.path.join(root, path)
> +
> +        # We don't normalize paths so there is technically a security
> +        # vulnerability for path traversal here. But, we should only run
> +        # locally, so we don't care.

... on the other hand, doing it takes about as many lines as the comment saying why you're not doing it...

::: python/mozbuild/mozbuild/mach_commands.py
@@ +436,5 @@
> +            return 1
> +
> +        server = BuildViewerServer(address, port)
> +        server.add_resource_json_file('last', last)
> +        webbrowser.get(browser).open_new_tab(server.url)

try:
    webbrowser.get(browser).open_new_tab(server.url)
except Exception:
    print('Point a browser at %s.' % server.url)
Attachment #791074 - Flags: review?(mh+mozilla) → review+
Okay, there's a big problem with part 1:

$ time ./mach build foo
 0:00.10 /usr/bin/make -j8 -s foo
 0:00.13 make: *** No rule to make target `foo'.  Stop.

real	0m1.301s
user	0m0.184s
sys	0m0.056s

$ time make -j8 -s foo -C ../build/obj-x86_64-unknown-linux-gnu/
make: *** No rule to make target `foo'.  Stop.

real	0m0.038s
user	0m0.032s
sys	0m0.004s

Without part 1 applied:
$ time ./mach build foo
 0:00.09 /usr/bin/make -j8 -s foo
 0:00.12 make: *** No rule to make target `foo'.  Stop.

real	0m0.296s
user	0m0.188s
sys	0m0.044s

This is concerning because it makes a < 1s rebuild an almost 2s one.
That 1s comes from the way mozsystemmonitor is implemented. Currently, there is a background process performing a sleep(poll_interval=1.0). When it wakes up, it checks a socket to see whether it needs to shut down. That is silly. It should poll the socket for poll_interval and detect the shutdown message immediately. I'm addressing this in bug 892342.

Perhaps it would be better to only enable resource monitoring on full tree builds until that improvement lands.
(In reply to Gregory Szorc [:gps] from comment #22)
> Perhaps it would be better to only enable resource monitoring on full tree
> builds until that improvement lands.

Agreed.
Blocks: 907297
Blocks: 901601
Blocks: 907331
https://hg.mozilla.org/mozilla-central/rev/cb14e3a5a7c8
https://hg.mozilla.org/mozilla-central/rev/cc7686d07acf
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla26
Product: Core → Firefox Build System
You need to log in before you can comment on or make changes to this bug.