Closed Bug 1399229 Opened 8 years ago Closed 8 years ago

able to hang firefox nightly while looking at nightly taskcluster graphs

Categories

(Firefox :: Untriaged, defect)

57 Branch
defect
Not set
normal

Tracking

()

RESOLVED INCOMPLETE
Performance Impact ?
Tracking Status
firefox57 --- affected

People

(Reporter: mozilla, Unassigned)

Details

(Keywords: perf, Whiteboard: [photon-perf][triage])

Attachments

(2 files)

I've seen this a number of times using Fx 57 Nightly. I filed a tools.tc.net issue [1], but ideally Nightly would be able to gracefully handle this without hanging. Because of the plea for bug filing, I'm filing now. I see this while doing releaseduty, viewing live release graphs; I believe it's also reproducible with live nightly graphs. Finished nightly and release graphs are much less resource-intensive on the tools.tc.net page, so they don't hang Fx. Fuzzy STR: - Open one, preferably two live nightly or release taskcluster graphs. For nightly: I believe these trigger at 10:00 and 22:00 UTC (3am/3pm PDT). These will show up on the mozilla-central treeherder page [2] - click on the "Nd", and look for the line that looks like "Task: YobvQyJrR_OzWjGw3HTm_Q" in the lower left. Click on the link; that will bring you to the task view. For beta: we tend to run beta promotion graphs on Mondays and Thursdays. Join #releaseduty on IRC for more precise scheduling times. The person on releaseduty should be able to point you at the live graph link... these will partially show up on the mozilla-beta treeherder view. - Jump between the "Task Group" view (click on the Task Group tab) and "Task Details" views (click on a task in the list). Clicking the checkbox for "notify me on task failures" may help trigger this. For me, CPU usage for one of the content processes goes to over 100% and doesn't lower. I would think that with 4 content processes, some other tabs would be able to perform, but I've seen the entire browser hang. I've saved it by killing the content processes via commandline; I've also forced quit. Expected behavior: ideally tools.tc.net wouldn't be able to hang a single content process, let alone the entire browser. [1] https://github.com/taskcluster/taskcluster-tools/issues/271 [2] https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&filter-searchStr=nightly%20decision
Any chance you could get a profile with the gecko profiler? ( https://perf-html.io/ )
Component: General → Untriaged
Flags: needinfo?(aki)
Whiteboard: [qf][photon-perf][triage]
Here's a single task group view: https://perfht.ml/2wtpiLb Does that have enough info?
Flags: needinfo?(aki)
Hello, Mike, can you please take a look on the provided Cleopatra profile from comment 2 and see if it helps to find out what causes this issue?
Flags: needinfo?(mconley)
There's a few things that stand out to me in this profile: - there's an add-on you're using that is causing regular hangs because it's observing all mutations. It looks like this interacts poorly with taskcluster. I *think* this is the "metacert check this" extension, is that possible? ( https://perfht.ml/2vY0ATX ) - at one point one of the content processes ends up doing CC for 660ms ( https://perfht.ml/2vYE65c ) - the taskcluster react app does some kind of recursive rebuild/update that hangs for a good 400ms (plus change for associated gc/cc). The stacks might mean more to you/taskcluster or JS folks than to me/other people. ( https://perfht.ml/2vYjREt )
Solid analysis, Gijs. Interestingly, those CC's that you're noticing are from a content process that's not looking at Taskcluster. It has some Pocket and Google (Calendar?) stuff running in it, but it's mostly idle; that's why it's CC'ing. The extension that's monitoring mutations is certainly contributing jank. I'm not seeing any major major hangs in here though. I _do_ see some kind of animation running on the main thread of the Taskcluster content process though, which might explain the high CPU usage.
Flags: needinfo?(mconley)
I see you're on macOS. Next time you hit this hang, would you be able to get a Process Sample from the hung content process? You can do this from the Activity Monitor.
Flags: needinfo?(aki)
(In reply to :Gijs from comment #4) > There's a few things that stand out to me in this profile: > > - there's an add-on you're using that is causing regular hangs because it's > observing all mutations. It looks like this interacts poorly with > taskcluster. I *think* this is the "metacert check this" extension, is that > possible? ( https://perfht.ml/2vY0ATX ) Yes. I have that addon installed. I can remove it, or keep it if we want to see if it's the culprit. > - at one point one of the content processes ends up doing CC for 660ms ( > https://perfht.ml/2vYE65c ) > - the taskcluster react app does some kind of recursive rebuild/update that > hangs for a good 400ms (plus change for associated gc/cc). The stacks might > mean more to you/taskcluster or JS folks than to me/other people. ( > https://perfht.ml/2vYjREt ) (In reply to Mike Conley (:mconley) (:⚙️) from comment #5) > Solid analysis, Gijs. > > Interestingly, those CC's that you're noticing are from a content process > that's not looking at Taskcluster. It has some Pocket and Google (Calendar?) > stuff running in it, but it's mostly idle; that's why it's CC'ing. I had getpocket.com open in a tab, and 2 gmail and 1 gcal pinned tabs open. > The extension that's monitoring mutations is certainly contributing jank. > > I'm not seeing any major major hangs in here though. I _do_ see some kind of > animation running on the main thread of the Taskcluster content process > though, which might explain the high CPU usage. I *think* this is because taskcluster can only load 100 tasks per api call, and the page moves the colored bar proportions based on how many tasks are not-scheduled/pending/running/completed/failed/exception til it hits the full >3k. (In reply to Mike Conley (:mconley) (:⚙️) from comment #6) > I see you're on macOS. Next time you hit this hang, would you be able to get > a Process Sample from the hung content process? You can do this from the > Activity Monitor. Ok. I'll try to trigger this and process sample.
(In reply to Aki Sasaki [:aki] from comment #7) > (In reply to Mike Conley (:mconley) (:⚙️) from comment #5) > > I'm not seeing any major major hangs in here though. I _do_ see some kind of > > animation running on the main thread of the Taskcluster content process > > though, which might explain the high CPU usage. > > I *think* this is because taskcluster can only load 100 tasks per api call, > and the page moves the colored bar proportions based on how many tasks are > not-scheduled/pending/running/completed/failed/exception til it hits the > full >3k. Might also have to do with the live log view.
So far, I haven't been able to intentionally replicate this. This could be an improved tools.tc.net or an improved Nightly or both, or maybe I did something previously that I'm not doing now. I can keep an eye out for future hangs.
I still haven't been able to hang Firefox, or even get a content process at high cpu usage (yay!). I'll close and open new if/when I can replicate.
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(aki)
Resolution: --- → WORKSFORME
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Attached file process_sample.txt
Process sample from content process that spikes over 100%. I haven't completely hung Fx nightly yet, but it's definitely slowed everything down.
Looks like a lot of garbage collection. Next time this starts to occur, but if the browser is still responsive, can you get an about:memory report?
Flags: needinfo?(aki)
Attached file memory-report.json.gz
Here's a memory report from when I temporarily hung a content process today, on OSX 58.0a1 (2017-10-01) (64-bit). This happened on treeherder autoland, with some small taskcluster tabs loaded. Treeherder was often in my tab list when taskcluster hung Fx nightly, so it's possible it's related. The content process went to >100% cpu. Scrolling worked in the tab, but no content appeared above or below the current screen, just a white page. Clicking on the back/refresh buttons, force-refreshing, and link-clicking all did nothing in that tab. It appears to have self-corrected.
Flags: needinfo?(aki)
Hey erahm, anything from the memory report in comment 13 stand out to you? Anything that'd explain a lot of GC?
Flags: needinfo?(erahm)
(In reply to Mike Conley (:mconley) (:⚙️) - Backlogged on reviews from comment #14) > Hey erahm, anything from the memory report in comment 13 stand out to you? > Anything that'd explain a lot of GC? Memory-wise everything looks reasonable, I did notice this: > jetpack-extension@dashlane.com Which is maybe just a naming thing, but we shouldn't support jetpack anymore right? So if this is somehow force-enabled that would be bad. Also there's a detached window associated with what I think is the treestyle tabs extension: > │ ├──0.75 MB (00.16%) ++ window(chrome://browser/content/webext-panels.xul?panel=moz-extension%3A%2F%2F151cff75-cd1e-9847-8a14-70189e2caf12%2Fsidebar%2Fsidebar.html&browser-style=1) That might be fine, I'm not sure how that's supposed to show up.
Flags: needinfo?(erahm)
Keywords: perf
My current symptoms may be related to bug 1369274. I seem to be hanging on treeherder the most; when I disable the dashlane extension it speeds back up. I initially hit these symptoms without Dashlane installed at all, but I'm not seeing them now. This may be due to fixes; it may also be because I haven't been dealing with loading as many large taskgraphs these days.
Okay, I'm going to close this one out again. Feel free to re-open if you see it again and we'll try to dig in more.
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → INCOMPLETE
Performance Impact: --- → ?
Whiteboard: [qf][photon-perf][triage] → [photon-perf][triage]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: