Closed Bug 1135076 Opened 10 years ago Closed 10 years ago

Missing histograms in childPayloads

Categories

(Toolkit :: Telemetry, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla39
Tracking Status
firefox39 --- fixed

People

(Reporter: rvitillo, Assigned: rvitillo)

References

Details

Attachments

(1 file, 3 obsolete files)

[1] is a list of all histogram names present in childPayloads[0].histograms in the submissions received for Nightly on the 19/02. I am surprised that the content process has few histograms. Furthermore, when calling TelemetrySession.getPayload() from the console, sometimes I can see the childPayloads field and some other times I can't. But even when the field is not present, childPayloads is stored on disk on shutdown. [1]: u'A11Y_IATABLE_USAGE_FLAG', u'A11Y_INSTANTIATED_FLAG', u'A11Y_ISIMPLEDOM_USAGE_FLAG', u'CHARSET_OVERRIDE_USED', u'COMPONENTS_SHIM_ACCESSED_BY_CONTENT', u'CYCLE_COLLECTOR_OOM', u'CYCLE_COLLECTOR_WORKER_OOM', u'DECODER_INSTANTIATED_ARMSCII8', u'DECODER_INSTANTIATED_HZ', u'DECODER_INSTANTIATED_IBM866', u'DECODER_INSTANTIATED_ISO2022CN', u'DECODER_INSTANTIATED_ISO2022JP', u'DECODER_INSTANTIATED_ISO2022KR', u'DECODER_INSTANTIATED_ISO_8859_5', u'DECODER_INSTANTIATED_KOI8R', u'DECODER_INSTANTIATED_KOI8U', u'DECODER_INSTANTIATED_MACARABIC', u'DECODER_INSTANTIATED_MACCE', u'DECODER_INSTANTIATED_MACCROATIAN', u'DECODER_INSTANTIATED_MACCYRILLIC', u'DECODER_INSTANTIATED_MACDEVANAGARI', u'DECODER_INSTANTIATED_MACFARSI', u'DECODER_INSTANTIATED_MACGREEK', u'DECODER_INSTANTIATED_MACGUJARATI', u'DECODER_INSTANTIATED_MACGURMUKHI', u'DECODER_INSTANTIATED_MACHEBREW', u'DECODER_INSTANTIATED_MACICELANDIC', u'DECODER_INSTANTIATED_MACROMANIAN', u'DECODER_INSTANTIATED_MACTURKISH', u'DECODER_INSTANTIATED_VIETTCVN5712', u'DECODER_INSTANTIATED_VIETVPS', u'DECODER_INSTANTIATED_VISCII', u'DEVTOOLS_ANIMATIONINSPECTOR_OPENED_PER_USER_FLAG', u'DEVTOOLS_BROWSERCONSOLE_OPENED_PER_USER_FLAG', u'DEVTOOLS_CANVASDEBUGGER_OPENED_PER_USER_FLAG', u'DEVTOOLS_COMPUTEDVIEW_OPENED_PER_USER_FLAG', u'DEVTOOLS_CUSTOM_OPENED_PER_USER_FLAG', u'DEVTOOLS_DEVELOPERTOOLBAR_OPENED_PER_USER_FLAG', u'DEVTOOLS_EYEDROPPER_OPENED_PER_USER_FLAG', u'DEVTOOLS_FONTINSPECTOR_OPENED_PER_USER_FLAG', u'DEVTOOLS_INSPECTOR_OPENED_PER_USER_FLAG', u'DEVTOOLS_JSBROWSERDEBUGGER_OPENED_PER_USER_FLAG', u'DEVTOOLS_JSDEBUGGER_OPENED_PER_USER_FLAG', u'DEVTOOLS_JSPROFILER_OPENED_PER_USER_FLAG', u'DEVTOOLS_LAYOUTVIEW_OPENED_PER_USER_FLAG', u'DEVTOOLS_MENU_EYEDROPPER_OPENED_PER_USER_FLAG', u'DEVTOOLS_NETMONITOR_OPENED_PER_USER_FLAG', u'DEVTOOLS_OPTIONS_OPENED_PER_USER_FLAG', u'DEVTOOLS_PAINTFLASHING_OPENED_PER_USER_FLAG', u'DEVTOOLS_PICKER_EYEDROPPER_OPENED_PER_USER_FLAG', u'DEVTOOLS_RESPONSIVE_OPENED_PER_USER_FLAG', u'DEVTOOLS_RULEVIEW_OPENED_PER_USER_FLAG', u'DEVTOOLS_SCRATCHPAD_OPENED_PER_USER_FLAG', u'DEVTOOLS_SHADEREDITOR_OPENED_PER_USER_FLAG', u'DEVTOOLS_STORAGE_OPENED_PER_USER_FLAG', u'DEVTOOLS_STYLEEDITOR_OPENED_PER_USER_FLAG', u'DEVTOOLS_TILT_OPENED_PER_USER_FLAG', u'DEVTOOLS_TOOLBOX_OPENED_PER_USER_FLAG', u'DEVTOOLS_WEBAUDIOEDITOR_OPENED_PER_USER_FLAG', u'DEVTOOLS_WEBCONSOLE_OPENED_PER_USER_FLAG', u'DEVTOOLS_WEBIDE_OPENED_PER_USER_FLAG', u'ENABLE_PRIVILEGE_EVER_CALLED', u'GEOLOCATION_ERROR', u'IMAGE_DECODE_COUNT', u'IMAGE_MAX_DECODE_COUNT', u'MASTER_PASSWORD_ENABLED', u'PLUGIN_CALLED_DIRECTLY', u'SEARCH_SERVICE_US_COUNTRY_MISMATCHED_TIMEZONE', u'SEARCH_SERVICE_US_TIMEZONE_MISMATCHED_COUNTRY', u'SOCIAL_ENABLED_ON_SESSION', u'STARTUP_CACHE_INVALID', u'STARTUP_CRASH_DETECTED', u'STARTUP_STARTUP_CACHE_INVALID', u'STARTUP_XUL_CACHE_DISABLED', u'TELEMETRY_TEST_FLAG', u'TELEMETRY_TEST_RELEASE_OPTIN', u'TELEMETRY_TEST_RELEASE_OPTOUT', u'URL_PATH_CONTAINS_EXCLAMATION_DOUBLE_SLASH', u'URL_PATH_CONTAINS_EXCLAMATION_SLASH', u'URL_PATH_ENDS_IN_EXCLAMATION', u'WEBRTC_CALL_COUNT', u'XUL_CACHE_DISABLED'
Flags: needinfo?(gfritzsche)
Assignee: nobody → rvitillo
Looks like those histograms were the only ones present just after introducing childPayloads in Telemetry so it's not a regression.
Flags: needinfo?(gfritzsche)
Attached patch 1135076 (obsolete) — Splinter Review
Note that: 1) I replaced the weak reference to the message manager with a strong one as it was garbage collected and it caused to add a new child payload for the same child. 2) As discussed with Georg on IRC, I added a new method called getPayloadAndChilds which will be used in a future patch to show the child histograms in about:telemetry. It's also useful for debugging and testing purposes.
Attachment #8567973 - Flags: review?(gfritzsche)
Regarding 1), instead of holding a reference to a message manager, what's the best way to identify a child process?
Comment on attachment 8567973 [details] [diff] [review] 1135076 Review of attachment 8567973 [details] [diff] [review]: ----------------------------------------------------------------- How can e.g. about:telemetry retrieve the child payloads now? I think for that we will need also need a way to identify which specific request the received child payloads belong to (or whether they are just triggered by content shutdowns). Otherwise we will not be able to tell the payloads from each other in case we have multiple collections triggered. ::: toolkit/components/telemetry/TelemetrySession.jsm @@ +43,5 @@ > const PREF_FHR_UPLOAD_ENABLED = "datareporting.healthreport.uploadEnabled"; > const PREF_ASYNC_PLUGIN_INIT = "dom.ipc.plugins.asyncInit"; > > const MESSAGE_TELEMETRY_PAYLOAD = "Telemetry:Payload"; > +const MESSAGE_TELEMETRY_GET_PAYLOAD = "Telemetry:Payload:Get"; "Telemetry:Payload" seems a little generic now that we send more messages. How about something like: * "Telemetry:CollectPayload" * "Telemetry:RequestPayload" @@ +1053,5 @@ > case MESSAGE_TELEMETRY_PAYLOAD: > { > let target = message.target; > for (let child of this._childTelemetry) { > + if (child.source === target) { What is this change for? @@ +1061,5 @@ > } > } > // Did not find existing child in this._childTelemetry. > this._childTelemetry.push({ > + source: target, What is this change for? @@ +1123,5 @@ > this.gatherMemory(); > return this.getSessionPayload("gather-payload"); > }, > > + getPayloadAndChilds: function getPayloadAndChilds() { We historically used "foo: function foo() ..." due to debugger limitations. Now we can just use "foo: function() ..."
Attachment #8567973 - Flags: review?(gfritzsche)
(In reply to rvitillo@mozilla.com from comment #2) > 1) I replaced the weak reference to the message manager with a strong one as > it was garbage collected and it caused to add a new child payload for the > same child. (In reply to rvitillo@mozilla.com from comment #3) > Regarding 1), instead of holding a reference to a message manager, what's > the best way to identify a child process? Sorry, i missed that. What do you need to identify the specific process for?
(In reply to Georg Fritzsche [:gfritzsche] from comment #5) > (In reply to rvitillo@mozilla.com from comment #2) > > 1) I replaced the weak reference to the message manager with a strong one as > > it was garbage collected and it caused to add a new child payload for the > > same child. > > (In reply to rvitillo@mozilla.com from comment #3) > > Regarding 1), instead of holding a reference to a message manager, what's > > the best way to identify a child process? > > Sorry, i missed that. What do you need to identify the specific process for? Each child has its own payload. When the parent process receives a payload from a child it overwrites its former copy, if it exists.
(In reply to rvitillo@mozilla.com from comment #6) > (In reply to Georg Fritzsche [:gfritzsche] from comment #5) > > (In reply to rvitillo@mozilla.com from comment #2) > > > 1) I replaced the weak reference to the message manager with a strong one as > > > it was garbage collected and it caused to add a new child payload for the > > > same child. > > > > (In reply to rvitillo@mozilla.com from comment #3) > > > Regarding 1), instead of holding a reference to a message manager, what's > > > the best way to identify a child process? > > > > Sorry, i missed that. What do you need to identify the specific process for? > > Each child has its own payload. When the parent process receives a payload > from a child it overwrites its former copy, if it exists. Ah, i was a little quick. Yes, we definitely should not hold on to the message manager longer. mconley, do we have any process id or something similar that we could use to identify content processes (instead of message manager objects)?
Flags: needinfo?(mconley)
So I don't fully understand the problem space here - it's possible there might be a way of accomplishing what you're trying to do with a process-level script (see 1133594), so that there's only ever one thing responsible for sending up Telemetry information from a child, so that it reduces the need to disambiguate..maybe? Alternatively, if you're stuck comparing browsers, you could have them each pass up their child process message managers as CPOWs, and compare the two of them, but that's pretty hacky.
Flags: needinfo?(mconley)
(In reply to Mike Conley (:mconley) - Needinfo me! from comment #8) > Alternatively, if you're stuck comparing browsers, you could have them each > pass up their child process message managers as CPOWs, and compare the two > of them, but that's pretty hacky. You could also just compare msg.target (which is a per-process message manager) for messages from different child process message managers. However, it seems wrong to me to store the histograms for different content processes separately. Is there a way we could combine them? When we have one process per tab, we could end up with hundreds of "payloads" in a single session.
(In reply to Bill McCloskey (:billm) from comment #9) > (In reply to Mike Conley (:mconley) - Needinfo me! from comment #8) > > Alternatively, if you're stuck comparing browsers, you could have them each > > pass up their child process message managers as CPOWs, and compare the two > > of them, but that's pretty hacky. > > You could also just compare msg.target (which is a per-process message > manager) for messages from different child process message managers. That's what we are doing here but it seems hat a child message manager can be garbage collected while the child process is still running and I would like to avoid to hold a strong reference to it. > However, it seems wrong to me to store the histograms for different content > processes separately. Is there a way we could combine them? When we have one > process per tab, we could end up with hundreds of "payloads" in a single > session. Agreed; that's something we are going to address in the future, after the Telemetry/FHR unification.
Attached patch 1135076 (obsolete) — Splinter Review
Attachment #8567973 - Attachment is obsolete: true
Bill, regarding Comment 10, would it be acceptable to hold a strong reference to the child message managers? We are going to change this code in the future to support multiple content processes but that will probably require some rather heavy changes to Telemetry. This is a temporary fix that allows us to collect and analyze Telemetry data for e10s.
Flags: needinfo?(wmccloskey)
(In reply to rvitillo@mozilla.com from comment #12) > Bill, regarding Comment 10, would it be acceptable to hold a strong > reference to the child message managers? We are going to change this code in > the future to support multiple content processes but that will probably > require some rather heavy changes to Telemetry. This is a temporary fix that > allows us to collect and analyze Telemetry data for e10s. Unfortunately, it's not really okay. Although we have only one content process at a time, we can have many content processes over time. One way to see this is to load about:newtab (a non-remote tab that runs in the parent process) in one tab and then repeatedly open and close a tab for mozilla.org. Each mozilla.org tab will get a new process since the process it uses will exit when the tab closes and a new one will start the next time you load mozilla.org. In that situation we would get many telemetry payloads, one for each process created. That seems pretty bad, since each payload is probably pretty big. We could try to keep around content processes longer, but it might be kind of complicated. This code is shared with b2g and we definitely don't want to keep processes around on b2g. I'd rather we just fix this the right way here and figure out a way to merge the histograms together or something. If you're looking for a really simple alternative, perhaps we could just keep the most recent payload?
Flags: needinfo?(wmccloskey)
Don't we have any unique identifier per process that survives it (e.g. process counters etc.)?
Yes, you could use the process ID, but the more fundamental problem is that we don't want to have a ton of child payloads clogging up the telemetry ping.
Apparently the process id could get reused? AFAIK the plan is to fix this quickly now and improve later. We are not expecting many child payloads at this point yet, right?
Please read comment 13. It's very easy to accumulate a lot of child payloads over time, even with only one content process. If we need a quick fix, we could throw away all but the most recent payload.
Sorry, running a little late here. If we discard all but the most recent we would probably lose a lot of data when we go through multiple content processes.
(In reply to Georg Fritzsche [:gfritzsche] [away Feb 27 - March 15] from comment #18) > Sorry, running a little late here. > If we discard all but the most recent we would probably lose a lot of data > when we go through multiple content processes. Agreed, that doesn't seem to be a good option.
Given comment 13 I feel like we should merge the payloads in the parent process as they arrive; simple workarounds don't seem to be satisfactory. Vladan, do you agree with the proposal or do you see another option?
Flags: needinfo?(vdjeric)
(In reply to rvitillo@mozilla.com from comment #20) > Given comment 13 I feel like we should merge the payloads in the parent > process as they arrive; simple workarounds don't seem to be satisfactory. > Vladan, do you agree with the proposal or do you see another option? The content process Telemetry is not a separate Telemetry ping, instead content process telemetry just gets appended to the ping.childPayloads array in the parent's Telemetry process ping. The JSON for a minimal content-process Telemetry payload is 13KB (almost all histograms), the JS object itself is about 4x the size (~50KB). Bill, would it be very rare for a session to have more than 20 short-lived processes? (is there a probe for this? - If it's very rare, I'd prefer we accept the memory penalty in the short term. In the long-term, I think the right solution would be to have the child process stick around for longer. - If we think dozens of content processes per session are not going to be uncommon, then I'd prefer we report only child payloads with non-trivial uptimes (e.g uptime > 2 mins) and just report the existence of the other missing child processes, e.g. childPayloads: [ {}, {}, {<child #3 telemetry}, {}, {<child #5 telemetry} ]
Flags: needinfo?(vdjeric) → needinfo?(wmccloskey)
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #21) > Bill, would it be very rare for a session to have more than 20 short-lived > processes? (is there a probe for this? I don't know. I don't think we have a probe for this. It shouldn't be too hard to add. > - If it's very rare, I'd prefer we accept the memory penalty in the short > term. In the long-term, I think the right solution would be to have the > child process stick around for longer. > - If we think dozens of content processes per session are not going to be > uncommon, then I'd prefer we report only child payloads with non-trivial > uptimes (e.g uptime > 2 mins) and just report the existence of the other > missing child processes, e.g. childPayloads: [ {}, {}, {<child #3 > telemetry}, {}, {<child #5 telemetry} ] What would we do with this data even if we had it? It's hard for me to imagine asking a question where it would be useful to have the children separated out like this (aside from a few questions like "how many children were there and how long did they live for", which can be answered more simply in the parent). So it seems that we're almost certainly just going to merge all the data on the server anyway. If so, why not just do it on the client and save space?
Flags: needinfo?(wmccloskey)
I'm not opposed to merging content payloads (especially since there's only ever one content process right now) but I think the merging is going to be messy. Histograms, keyed histograms, simple measurements, logs, BHR and chrome hang sections would have to be merged.
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #23) > I'm not opposed to merging content payloads (especially since there's only > ever one content process right now) but I think the merging is going to be > messy. Histograms, keyed histograms, simple measurements, logs, BHR and > chrome hang sections would have to be merged. I guess my point is that, one way or another, we're going to have to write that code before this data is useful to us. Perhaps we could just start with regular histograms and people could do more as needed.
FWIW merging the histograms and keyedHistograms sections shouldn't be too painful. In fact our histogram implementation already supports merging, see https://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/histogram.cc?from=histogram.cc&case=true#767.
The content process telemetry is *currently* part of the main ping, because that was expedient. But I don't think that's necessarily the best decision now that we have the ability to send different ping types. Is there a reason we can't send a content-process telemetry ping for each content process separate from the main session ping?
The first thing I would do for most analyses is to merge the main ping and the content pings back in one logical entity, as I don't see them as independent and it's convenient to have them together. On the other hand if content processes might be associated with some unique traits in the future, then I can see the benefit of not merging the payloads. I would still like to have the parent and child payloads in one single entity for analysis purposes though. Also, once we have one process per tab the amount of pings generated might become prohibitive, maybe Mark can comment on that.
Flags: needinfo?(mreid)
That really depends on the analysis. Certainly for many analyses we're going to need to treat content and chrome separately: most jank/startup/shutdown metrics need to analyze them separately. So I don't think that merging them should be an obvious default.
(In reply to Benjamin Smedberg [:bsmedberg] from comment #28) > That really depends on the analysis. Certainly for many analyses we're going > to need to treat content and chrome separately: most jank/startup/shutdown > metrics need to analyze them separately. So I don't think that merging them > should be an obvious default. What I mean with merging in my first paragraph is to rebuild a single entity that contains both the parent payload and the merged child payloads. There is no argument about the benefits of having separate parent and child payloads.
I don't really like the idea of putting the burden of assembling possibly many content payloads on the server. It will make analysis more expensive and complex. I may be biased by my own usage pattern - I typically have hundreds of tabs open, which would mean a lot of server-side assembly work (at least, in the future if/when there is a process per tab). Keeping the content and chrome info logically separate sounds fine, but it seems you'd need both to get a full picture of the browser's behaviour when doing analysis, so if possible we should keep them together. Are there cases where separate submissions per content process would be advantageous?
Flags: needinfo?(mreid)
Flags: needinfo?(vdjeric)
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #21) > Bill, would it be very rare for a session to have more than 20 short-lived > processes? (is there a probe for this? I had a look at the length of childPayloads in our nightly data: - the 99th percentile is 3 - about 3 every 10000 submissions has more than 20 childs - the maximum number of childs I saw in a ping was about 400
From a simplicity perspective, I really don't like having nested data structures in the one main ping. In terms of server aggregation, I don't see why it would matter: either we're assembling subsessions by clientID, in which case separate pings for content processes wouldn't make it any worse, or we're analyzing on the fly, where reassembling isn't necessary anyway.
(In reply to Benjamin Smedberg [:bsmedberg] from comment #32) > In terms of server aggregation, I don't see > why it would matter: either we're assembling subsessions by clientID, in > which case separate pings for content processes wouldn't make it any worse, I am still not sure if our infrastructure could scale to support individual submissions per content process once we have one process per tab. Can you confirm that this is not an issue?
Flags: needinfo?(benjamin)
A. Wrt deciding whether we store & submit content payloads separately or together with the parent payload, I think we should continue to nest the content payload inside the parent payload: - When we sandbox the content process, the content process won't be able to access important environment information like changed prefs, gfx settings, installed addons, etc. We can work around this by sending this info down from the parent process, but it just makes things harder - I don't get the aversion to having nested data structures in the main ping. I guess if some batch job or streaming job is only interested in the parent process data, it'll have to do extra work to parse the content data inside the joint ping, but it doesn't seem like a big deal right now. - We don't have to decide on this question right now, so I think it's easiest to continue to report content+parent process data in a single payload, and we can revisit it after we gain experience with the new backend B. With regards to reporting Telemetry from multiple content processes in a single session, I think we should report each content payload separately for now (inside parentPing.childPayloads[]) but cap it to some reasonable maximum (e.g. max 10 most-recent content payloads) - With this approach, we don't lose any data or any *precision* 99% of the time, but memory & storage overheads are capped - Once we have more experience analyzing e10s content pings, we can decide whether it's ok to merge the content Telemetry or not - Given that only Nightly has e10s enabled, and given how few users are on Nightly, the bigger pings won't cause extra strain on the backend - The future backend analyses that need to merge the content paylaods can do it themselves, they'll only need to do it for one or two measurements and we can add code for merging histograms etc in the Spark Telemetry library - The only drawback I can see with this approach is that we'll still have to merge flag histograms from the content processes when aggregating for dashboards With that said, I'm not strongly opposed to merging on the client-side, so if other people feel strongly about it, I'm ok with merging instead. The merging techniques could be reused for merging subsession pings into full-session pings on the server-side.
Flags: needinfo?(vdjeric)
Attached patch 1135076 (obsolete) — Splinter Review
I agree on postponing the decision of merging the content payloads to after we gained some insights analyzing e10s data.
Attachment #8569226 - Attachment is obsolete: true
Attachment #8572593 - Flags: review?(vdjeric)
Flags: needinfo?(benjamin)
Comment on attachment 8572593 [details] [diff] [review] 1135076 Review of attachment 8572593 [details] [diff] [review]: ----------------------------------------------------------------- ::: toolkit/components/telemetry/Telemetry.cpp @@ +1804,5 @@ > } > > TelemetryImpl::TelemetryImpl(): > mHistogramMap(Telemetry::HistogramCount), > +mCanRecord(true), Shouldn't mRecord initialize to true only for GeckoProcessType_Default and GeckoProcessType_Content? The other process types won't run TelemetryPing.jsm so canRecord will never be set to false in those processes (e.g. plugin-container process) http://hg.mozilla.org/mozilla-central/annotate/a27dd304348d/xpcom/build/nsXULAppAPI.h#l352 ::: toolkit/components/telemetry/TelemetrySession.jsm @@ +138,5 @@ > // strip {} > return str.substring(1, str.length - 1); > } > > +let getProcessUUID = function(){ this seems a little funky, why not just add "_processGUID: generateUUID()," to Impl? @@ +373,5 @@ > /** > + * Asks the content processes to send their payloads. > + * @returns Object > + */ > + requestChildPayloads: function() { File a bug for adding calls to requestChildPayloads, CC Alessio + mark it blocker of 1069869 @@ +1251,5 @@ > payload: message.data, > }); > + > + if (this._childTelemetry.length == MAX_NUM_CONTENT_PAYLOADS + 1) { > + this._childTelemetry.shift(); report the # of payloads we dropped
Attachment #8572593 - Flags: review?(vdjeric) → review+
Blocks: 1069869
Attached patch 1135076Splinter Review
Attachment #8572593 - Attachment is obsolete: true
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Whiteboard: [fixed-in-fx-team]
Target Milestone: --- → mozilla39
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: