Closed Bug 619561 Opened 14 years ago Closed 13 years ago

Add GC stats to Test Pilot

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: paul.biggar, Assigned: cdleary)

References

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(2 files, 4 obsolete files)

We want to track simple stuff via Test Pilot or whatever the metrics stuff is. We don't know how all this works, so we should figure it out.

We want to track at least:

- timestamp before GC
- timestamp after marking
- timestamp after GC
I've been chatting to jono about this, and created a proposal with his help:

https://wiki.mozilla.org/Labs/Test_Pilot/Test_Proposals/GC

Comments welcome.

One particular need is to figure out exactly what data we want to collect. So far, I'm at:

 - the timestamp when GC starts
 - the timestamp when GC finishes its first phase (marking)
 - the timestamp when GC finishes its second phase (sweeping)
 - the timestamp when GC finishes completely
 - the size of the heap before
 - the size of the heap after

But I'd like others to suggest what is useful here. (Note that if we want to impact performance, we'll need to warn users, which will probably significantly decrease participation).
Depends on: 622638
It would be nice to collect all the data that we do when MOZ_GCTIMER is set (see the bottom of jsgcstats.cpp).

It would also be nice to know why the GC was triggered:
- per-compartment GC? (after bug 605662 lands)
- full GC?
- cycle collection?
- triggered via MaybeGC?

None of this data should be difficult to collect.
I think it would really help if we started with what questions we want to answer or what decisions we want to make based on the data. Off the top of my head, getting parameters that could feed into performance models would be really great, such as:

- Stats on conservative mark/conservative stack-scan time. The basic question is "How long does a typical stack scan take?" I don't know if we can do this easily, though.

- Allocation rates. E.g., how many strings do we allocate per second? More generally, we want to know the allocation count and the byte count for every allocation of interest since last GC (along with the time since last GC): objects, strings, functions, slot arrays, string character data, etc. That is probably too many numbers to do at once. I think "total headers" and "total arrays" would be good summary stats. This tells us how often we need to collect with various generation sizes.

- Total heap size, allocated heap size, mark time, and sweep time, most of which were previously mentioned, could also be useful. It would be interesting to correlate allocated heap size with mark time and total heap size with sweep time.
I had a chat with Jono and Jinghua about this.

In terms of how much data we can collect, we're constrained by an 8MB limit (with is the limit of POST requests). So far, they haven't collected any data that is remotely that large.

We may or may not be able to collect URLs. Although the privacy policy states that URLs can be gathered, there was a promise that they wouldn't be during the FF4 beta period. However, if we really need URLs, we may be able to ask the user if they'd like to send the URLs too (or a selection, perhaps for the ones with really long GC time, or something). We can certainly get hashes of the URLs, and it may be possible to find out just the domains (Jono will get back to me).

Test Pilot currently has about a million users, but that may drop once because it's not in FF4. However, there'll still be plenty enough users for us (more than 100K definitely().
Jono sent some resources for writing the test pilot study:

- Tutorial for developing an experiment: https://wiki.mozilla.org/Labs/Test_Pilot/Experiment_Tutorial

- API reference: https://wiki.mozilla.org/Labs/Test_Pilot/ExperimentAPI

- Source code for all existing studies can be found in hg: http://hg.mozilla.org/labs/testpilotweb/file/tip/testcases
e.g. the beta UI study code is at http://hg.mozilla.org/labs/testpilotweb/file/ef201f4a6572/testcases/beta/combined.js

FYI: I'll be on the type-inference side of things rather than GC, so I think this is as far as I'm going to go here. Whoever takes over please ask me anything that's unclear.
Component: JavaScript Engine → jemalloc
Target Milestone: --- → flash10
Component: jemalloc → JavaScript Engine
Target Milestone: flash10 → ---
Taking assignment.
Assignee: general → cdleary
Status: NEW → ASSIGNED
I have an initial prototype MQ repo at http://hg.mozilla.org/users/cleary_mozilla.com/gc-testpilot/ -- just have to make it GC-safe now. When we drop info from the circular GC-stats buffer we can just mark those database entries as tainted so that we can ignore them in post-processing. Ideally the circular buffer will be sufficiently large (or our test pilot timeout interval will be sufficiently small) that we very rarely drop any info.
Start off with some basic information for the first iteration of the study.
Attached file WIP: TestPilot study. (obsolete) —
Needs plotting functionality and testing on Windows.
I'm not really proud of the fact I poked a "give me the JS runtime" hole in CTypes here but it seemed like the most straightforward route to getting something out there. The GC trace method changes are to avoid cross-compartment mismatches (and assertions) when collecting ctypes objects under debug.
Attachment #520837 - Attachment is obsolete: true
Attachment #525894 - Flags: review?(wmccloskey)
Attached patch JS GC TestPilot study. (obsolete) — Splinter Review
No system metadata yet, will have to create an XPCOM component for that as a followup. We'll want to iterate on this study pretty rapidly (on the order of weeks) as we figure out what additional GC information would be useful.
Attachment #520838 - Attachment is obsolete: true
Attachment #525895 - Flags: review?
Attachment #525895 - Flags: review? → review?(jdicarlo)
Comment on attachment 525894 [details] [diff] [review]
JS engine hooks to support tespilot GC study.

Great! Just a few things.

I think you need more #ifdef JSGC_TESTPILOTs (in JSRuntime::init). Maybe also try compiling with MOZ_GCTIMER enabled (with --enable-gctimer) and JSGC_TESTPILOT disabled?

It would be nice if GCTimer had a more normal constructor. As far as I can tell, the memset is unnecessary. The remaining fields could be initialized using the ": field(xxx)" form.

It seems like when JS_PopGCInfo gets called, since it immediately increments gcInfoStart, it's possible for the info to get overwritten before the TestPilot code is done with it. Maybe have two calls: one to get it and one to increment?

The GetRuntime thing is a little ugly. Rather than passing in a random object to get the runtime, can't you just return cx->runtime? Also, it would be cool to put the new JSRuntime fields into their own object. The circular buffer code could be encapsulated inside that.

Finally, as long as you're there, could you fix the bracing in jsgcstats.cpp?
Comment on attachment 525895 [details] [diff] [review]
JS GC TestPilot study.

First of all, thanks a lot for writing this study!

Now on to the nit-picking:

In makeCTypeProps(), is var record supposed to be declared inside the loop?  that looks a little weird.  Are you trying to have an array of objects where each object only has a single property?  (And isn't var scoped to the function anyway, even if you use it inside a loop?)

Please change the name of ExampleStudyGlobalObserver to something more descriptive -- this is not the example study anymore, it's a real study now!

Dump() statements should be removed (or replaced with console.debug) before we push this to production, but keep them in for now as we test the code.

What operating systems is this intended to run on?  (It is possible to distribute a study only to certain OSes, if we ever want to do that.)  We need to make sure to test for crashes on all supported OSes.

Should GCObserver.uninstall() set alreadyInstalled back to false?

If createCTypes() fails, then this.setGCInfoEnabled  and this.getGCInfoEnabled will not be functions, which will cause errors why you try to call setGCInfoEnabled in GCObserver.uninstall().

I see you're trying to restore GCInfoEnabled to its original state when the study is done, which is good.  But if Firefox exits without GCObserver.uninstall() having been called (it can happen on a crash) then GCInfoEnabled won't be restored, and then what happens when you start back up again and GCObserver.install() calls getGCInfoEnabled again?  I don't know exactly how this works, but won't you lose the original setting in that case?  You might need to store the user's original setting in a pref the first time the study runs, just to make sure you don't lose it.

You should also write a doExperimentCleanup() method on your global observer object.  This gets called when the study is finally over (either it's finished or it's been canceled) - as opposed to uninstall() which gets called on Firefox shutdown even if the study is still running.  Your doExperimentCleanup() method might be the best place to restore the original value of GCInfoEnabled from the pref.

Why does your selfPingTimer callback call pingSelf() again?  You're passing a type code of 1, which is TYPE_REPEATING_SLACK according to https://developer.mozilla.org/en/nsITimer, to initWithCallback(), so that hould be enough to ensure that dumpGCInfos() is called approx. every 30 seconds; calling pingSelf() again should be unneccessary, and in fact might cause extra redundant timers to be created.

That's all I have to say from reading the code; I may have more comments after I start testing it.
Blocks: 505308
Revised per review comments. Builds with all gctimer/jsgctestpilot configure flag combos.
Attachment #525894 - Attachment is obsolete: true
Attachment #533839 - Flags: review?(wmccloskey)
Attachment #525894 - Flags: review?(wmccloskey)
Attached file JS GC TestPilot study.
Updated per review comments. (Some comments were added at the top address some of the questions!)
Attachment #525895 - Attachment is obsolete: true
Attachment #533841 - Flags: review?(jdicarlo)
Attachment #525895 - Flags: review?(jdicarlo)
Last time I checked GCTimer didn't compile on ARM. I don't know if this is fixed by now.
(In reply to comment #17)
> Last time I checked GCTimer didn't compile on ARM. I don't know if this is
> fixed by now.

Thanks for the heads up... I'll check it on the tegra 2 (and try) before I check it in.
Comment on attachment 533839 [details] [diff] [review]
JS engine hooks to support tespilot GC study.

Review of attachment 533839 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good to me.

::: js/src/ctypes/CTypes.cpp
@@ +5690,5 @@
> +
> +  jsval* argv = JS_ARGV(cx, vp);
> +  if (JSVAL_IS_PRIMITIVE(argv[0]) ||
> +      !CType::IsCType(cx, JSVAL_TO_OBJECT(argv[0]))) {
> +    JS_ReportError(cx, "second argument must be a CType");

I think you mean "first argument ..."

@@ +5697,5 @@
> +
> +  JSObject* targetType = JSVAL_TO_OBJECT(argv[0]);
> +  size_t targetSize;
> +  if (!CType::GetSafeSize(cx, targetType, &targetSize) ||
> +      targetSize > sizeof(void*)) {

Why isn't this a != test?

::: js/src/ctypes/CTypes.h
@@ +503,5 @@
>    bool IsCData(JSContext* cx, JSObject* obj);
>  
>    // Attached by JSAPI as the function 'ctypes.cast'
>    JSBool Cast(JSContext* cx, uintN argc, jsval* vp);
> +  // Attached by JSAPI as the function 'ctypes.runtime'

Isn't it getRuntime?
Attachment #533839 - Flags: review?(wmccloskey) → review+
Android and friends looked good on the try run, so I pushed to TM:

http://hg.mozilla.org/try/rev/e678548b2f5a
http://hg.mozilla.org/tracemonkey/rev/00ed83d09564

Marking as fixed-in-tracemonkey. I'll make a new bug to track the TestPilot study after Jono's comments on the new version of the study.
Whiteboard: fixed-in-tracemonkey
I've added the study, provisionally, to the index-dev channel, so that I
and the other internal users of this channel can try it out. It's in the
hg repo at:
http://hg.mozilla.org/labs/testpilotweb/file/c2448c803f6f/testcases/garbage-collection/gcstudy.js 

A couple more things, after reviewing the newest code.

experimentInfo should define a runOrNotFunc() that returns true if the study can be run (i.e. the instrumentation is available), false if not.  This is so the study doesn't show up in the UI for people who can't run it.

Also, before we can release, we need a thumbnail image (90x90px) to represent the study, and a URL for a landing page the user gets to when they click the appropriate link.

You have:

GCGlobalObserver.prototype.getStudyMetadata = function() {
    return [{droppedGCInfoCount: this.gcObserver.dropCount}];
};

but onExperimentShutdown will be called before getStudyMetadata, which means that this.gcObserver will already have been uninstalled when this is called.  You may have to work around this by storing dropCount in a preference or something.

(It's kind of a design flaw in the test pilot extension, to be honest.  Ed Lee discovered this problem when working on his study.  In a future version I'd like to make the extension call getStudyMetadata just before shutting down the experiment for the last time and cache that value, which will make the workaround unnecessary.)

Finally, debug() should switch over to using console.info() or console.warn() instead of dump().
When the Javascript Performance study finishes for me, and I click "more info", to open the study detail view window...
I get a blank window.

This is on Aurora, 6.0a2, on Mac OS X 10.6.7.

The error message in the Javascript console is:
Error: [Exception... "Could not convert JavaScript argument arg 0 [nsIDOMWindow.getComputedStyle]"  nsresult: "0x80570009 (NS_ERROR_XPC_BAD_CONVERT_JS)"  location: "JS frame :: chrome://testpilot/content/flot/jquery.js :: <TOP_LEVEL> :: line 808"  data: no]
Source File: chrome://testpilot/content/flot/jquery.js
Line: 808

Looking manually at the data that was collected for me, the events array is empty.

There's also an exception that occurs in getStudyMetadata:

TypeError: this.gcObserver is undefined

It seems like 1. this study should not be running on Mac at all, if the gcObserver is designed to work on Windows;
and 2. the code for WebContent.onPageLoad() needs to correctly handle the case where the data is empty.
The patches for this bug have landed on m-c, should it be resolved fixed?  (Nb: I've read comment 22 and comment 23, maybe they should be spun off into another bug?)
I'll file new bugs under Test Pilot.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Attachment #533841 - Flags: review?(jdicarlo)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: