Closed Bug 1124728 Opened 10 years ago Closed 10 years ago

All talos on OSX fails with running in e10s mode

Categories

(Testing :: Talos, defect)

x86_64
macOS
defect
Not set
normal

Tracking

(e10sm7+, firefox38 wontfix, firefox38.0.5 wontfix, firefox39 wontfix, firefox40 wontfix, firefox41 fixed, firefox-esr31 unaffected, firefox-esr38 wontfix)

RESOLVED FIXED
mozilla41
Tracking Status
e10s m7+ ---
firefox38 --- wontfix
firefox38.0.5 --- wontfix
firefox39 --- wontfix
firefox40 --- wontfix
firefox41 --- fixed
firefox-esr31 --- unaffected
firefox-esr38 --- wontfix

People

(Reporter: RyanVM, Assigned: mconley)

References

(Depends on 1 open bug, Blocks 1 open bug, )

Details

(Keywords: intermittent-failure)

Attachments

(5 files, 5 obsolete files)

21.92 KB, text/plain
Details
8.54 KB, patch
jmaher
: review+
Details | Diff | Splinter Review
3.61 KB, patch
jmaher
: review+
Details | Diff | Splinter Review
4.35 KB, patch
wlach
: review+
Details | Diff | Splinter Review
12.95 KB, application/zip
Details
Per Joel's request on IRC, I'm hiding them for now.
Also, talos-other on e10s is still being seen as a non-e10s job, so I had to hide T(o) on 10.6 as well :(
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #4) > Also, talos-other on e10s is still being seen as a non-e10s job, so I had to > hide T(o) on 10.6 as well :( It wasn't listed in the set of jobs in bug 1121003.
:billm, do you know why talos runs on osx 10.6 would be failing when we try to run in e10s mode? We can run talos with the same code in e10s mode for linux/windows, and mochitests for e10s run on 10.6.
Flags: needinfo?(wmccloskey)
I'm not sure I understand what's going on here. Is this about running Talos with multiple processes, or is it about using the message manager more in Talos code while still running in a single process? Either way I'm afraid I can't think of any reason why we'd time out. It looks like Firefox isn't even starting up?
Flags: needinfo?(wmccloskey)
this is about running with browser.tabs.remote = true as we do for linux and windows. Right now this isn't working at all for talos on osx. I suspect we are starting up, but not loading the webpage or having issues printing info and quitting the browser.
Requesting re-triage, releng would like someone to look at why osx is having problems getting tests launched and running with e10s.
Assignee: nobody → davidp99
Joel, In looking over the build log, gcc is missing from the build machine, causing the build of psutil to fail. jimm suggested that you might have an idea what happened there. Do you know why gcc isn't being found on the OS 10.6 machines and if its needed? Do we need to change the scripts to use llvm or something?
Flags: needinfo?(jmaher)
these are run on test slaves (not build machines) so we don't have compilers installed. In fact this psutil error is present on all flavors (look at passing linux machines). We install psutil as a python module which happens to include some native library components. In appears this is and has been known for a while in bug 917346, and bug 893254. Are you looking into why these don't run at all? I know :avih tried running Talos locally on osx 10.10 and it failed to run with using the --e10s switch (hung at the warmup window, just like in automation). Here are some instructions for running it locally: https://wiki.mozilla.org/Buildbot/Talos/Running#Running_locally_-_Source_Code
Flags: needinfo?(jmaher)
NIing myself to update this
Flags: needinfo?(davidp99)
This looks to be related to the teardown of nsAppService and friends -- the destruction process is very involved and has some mac-specific (#ifdef) behavior. An early look seems to suggest that the teardown process is confusing the nsAppService instance in the main process and the one in the content process -- the Quit procedure operates as a state machine transition (it is re-entrant and called multiple times) but the evolution of the states must be done for both main and content. The process involving Quit is conflating the states of the two instances, leading to a place where they don't make the expected transitions and are therefore never actually destroyed.
Flags: needinfo?(davidp99)
oh great, thanks for looking into this. Do feel free to ask any questions about talos!
David, any idea how we can fix this? Either by modifying firefox, or talos, or both, or using prefs to overcome this, etc? What would be a viable approach here, and who has the knowledge to handle it? The fact is that we have talos mostly running on e10s firefox everywhere except OS X, and the lack of performance data on this platform might be hiding performance issues which no one is able to detect or track due to this.
Flags: needinfo?(davidp99)
We're going to try to get to this between m5 and m6.
Nominating for e10s triage as per jimm's comment. We'll address the milestone this week.
Flags: needinfo?(davidp99)
Assignee: davidp99 → mconley
Blocks: 1050706
Blocks: 1144120
Ping?
Flags: needinfo?(mconley)
Hey Avi, This bug is milestoned for M7. We're about halfway through M6, and then we'll start to address the M7's.
Flags: needinfo?(mconley)
Would it be possible for the robot to not post so frequently? This is perma-fail right now (and always has been), and we know it. This bug was opened to handle the issue, but the frequent posts from the robot will make people un-CC themselves from this bug.
They aren't even stars for the correct bug. This bug is specifically about Talos e10s on 10.6, not the widespread timeouts we're hitting on the non-e10s jobs. But yes, we can hide the jobs and stop starring them.
Flags: needinfo?(cbook)
I have an OSX loaner and ran talos on it. Talos appears to work correctly except that the process doesn't fully terminate (despite the window closing). Oddly enough a 'kill -9 <pid>', results in continuing the test and the test finishes (closes down itself and the process just fine). could it be that we are closing the browser too fast?
hmm, waiting to close the browser doesn't necessarily help, in fact clicking the 'x' doesn't close the browser. is there something about an uninitialized profile that could be causing this behaviour on osx and e10s?
The browser not closing if you click the X - that can happen if the content process is hung. Does the content process list as Not Responding in the Activity Monitor when in that state?
in activity monitor, I don't see any indication of 'not responding' or other stuff. The cpu activity seems to be changing, so it is doing something.
on a non e10s run, I see a javascript error for resource:///modules/WebappManager.jsm, line 48 NS_ERROR_FAILURE...nsIObserverService.removeObserver
A stack of the busy content process main thread after trying to close the window would be useful.
any tips for how to do that?
Assuming you can get into the state reliably, open up OS X's Activity Manager, find the Nightly Web Content process, and click "Sample Process". In the window that comes up, copy and paste it to a text file, and attach to this bug. Thanks!
and we have a sample! Thanks for helping guide me through some of this.
Attachment #8592404 - Flags: feedback?(mconley)
Looks like the process is garbage / cycle collecting. What happens if you take a sample after a minute or so?
Attachment #8592404 - Attachment is obsolete: true
Attachment #8592404 - Flags: feedback?(mconley)
Comment on attachment 8592406 [details] talos + e10s + osx 10.6 + activity monitor + a few minutes idle time So this is an interesting one. The content process appears to just be sitting there spinning the event loop. I suspect this is the one we need to solve.
Attachment #8592406 - Flags: feedback?(mconley)
Depends on: 1154434
Can I assume that, since we're not running 10.6 Talos jobs anymore, this bug is WONTFIX?
Flags: needinfo?(jmaher)
maybe, I believe the same failure happens on osx 10.10. I need to get the tests scheduled- if you have time to work on it soon, I can get the tests scheduled.
Flags: needinfo?(jmaher)
Flags: needinfo?(cbook)
(In reply to Joel Maher (:jmaher) from comment #243) > maybe, I believe the same failure happens on osx 10.10. I need to get the > tests scheduled- if you have time to work on it soon, I can get the tests > scheduled. Yeah, we're just coming around to M7 now, which is where this bug is milestoned at. I can look at it on Monday.
As far as I understand/know, the title should probably be changed to "All talos e10s tests always hang on all OS X versions" Joel, Is that correct?
Flags: needinfo?(jmaher)
that is correct
Flags: needinfo?(jmaher)
Summary: Frequent OSX 10.6 T-e10s "timed out after 3600 seconds of no output" → All talos on OSX fails with running in e10s mode
Ok, I can reproduce this hang on my local 10.6 machine. Investigating...
I think I've figured this out. We have this getInfo.html page that loads before some (all?) tests are run in order to get the screen dimensions and browser dimensions, which it then dumps into a log that (I assume) Talos consumes to make decisions on how tests should operate under those conditions. One of the last things that getInfo.html does is call goQuitApplication(); and window.close();. goQuitApplication() is defined in talos/page_load_test/quit.js. goQuitApplication() uses special privileges and calls nsIAppStartup::Quit passing in nsIAppStartup.eForceQuit. That getInfo.html page is being loaded in the content process when the --e10s flag is set. When the content process calls goQuitApplication(), the content process successfully shuts down, which effectively closes the window. On Windows and Linux, this closing of the last window results in the parent process also shutting down. On OS X, however, a process still runs even if all of the windows are closed. You can reproduce this in Firefox today - just close all of your windows by pressing the little red button in the top-left corner on OS X. The Firefox process will keep running, and Firefox will be available in the dock. When you click on the dock icon, a new Firefox window opens. So that's why this is happening - forceQuit is happening in the content process and not the parent, and the parent can keep running without any windows. This also means that when talos tests do run (when I manually quit the getInfo.html process), that Talos is killing the parent process after each test because it hits the time out, since (again) the parent process is just hanging out since the quit didn't occur properly. What I suggest is that we have getInfo.html (and any other content that tries to make use of quit) send a message up to the parent to cause the quit. jmaher - where would be a good place to put a message handler for that quit message? Ideally it'd be in an add-on somewhere... is pageloader the right place to put that kind of general Talos utility for quitting?
Flags: needinfo?(jmaher)
great question and thanks for taking a look at this! Ok, so we have a slight difference in the OS. this is probably why when we actually run the tests after manually killing the process running getinfo.html we continue on. pageloader is not present in all tests, for example tpaint, ts_paint, sessionrestore*, tresize, and media_tests. we could consider adding pageloader to them, or creating a really small addon that handles quit functionality. Alternatively we could just have the addon for the getinfo as well if we don't want to affect the other test numbers. Maybe we can chat sometime in the next 2 days about it since I will be in your home office. This might be easier to brainstorm that way.
Flags: needinfo?(jmaher)
So, having talked with jmaher in person, the plan at this time is to create a very small default add-on to be installed in the browser before any talos test that can add the listeners for a quit message from content. It'll have to be very small and light in order to minimize noise in the talos results.
Comment on attachment 8602292 [details] [diff] [review] [WIP] Add talos-powers default add-on to allow talos tests to request force quit. r=? So this gets things moving for me, locally. Joel, can you test this to see how much noise, if any, this would add to our talos numbers?
Attachment #8602292 - Flags: feedback?(jmaher)
Comment on attachment 8602292 [details] [diff] [review] [WIP] Add talos-powers default add-on to allow talos tests to request force quit. r=? Review of attachment 8602292 [details] [diff] [review]: ----------------------------------------------------------------- thanks for this: a try push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=beb62e8bce0d compare view: https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=98d3519ab098&newProject=try&newRevision=beb62e8bce0d ^ here I looked at the stddev and specifically the % with it, then looked in details on anything that matches new_stdev% >= 2*old_stddev% a few issues: session restore no auto - win8 <- the graph view seems to have data not in any range session restore* <- data is showing way out of the normal range. code wise, this patch is good!
Attachment #8602292 - Flags: feedback?(jmaher) → feedback+
in addition, this patch needs to add entries to talos/mtio-whitelist.json; TEST-UNEXPECTED-FAIL | mainthreadio | File '{talos}-powers@mozilla.org\plugins' was accessed and we were not expecting it: {'Count': 1, 'Duration': 0.029184, 'RunCount': 1} TEST-UNEXPECTED-FAIL | mainthreadio | File '{talos}-powers@mozilla.org' was accessed and we were not expecting it: {'Count': 3, 'Duration': 0.10636800000000002, 'RunCount': 3} TEST-UNEXPECTED-FAIL : xperf: File '{profile}\extensions\talos-powers@mozilla.org\chrome\talos-powers-content.js' was accessed and we were not expecting it. DiskReadCount: 2, DiskWriteCount: 0, DiskReadBytes: 184, DiskWriteBytes: 0 TEST-UNEXPECTED-FAIL : xperf: File '{profile}\extensions\talos-powers@mozilla.org\chrome.manifest' was accessed and we were not expecting it. DiskReadCount: 2, DiskWriteCount: 0, DiskReadBytes: 248, DiskWriteBytes: 0 we just need to add these specific files to the mtio-whitelist.json file, nothing too crazy.
Attachment #8602292 - Attachment is obsolete: true
Attachment #8602910 - Attachment is obsolete: true
Comment on attachment 8602911 [details] [diff] [review] [WIP] Add talos-powers default add-on to allow talos tests to request force quit. r=? Can we try this instead to see if the Session Store test numbers settle? (I've tried to move all of the injection work to after the windows have been restored). Also, instead of me handing this off to you every time (since we might iterate on this a bit), can you let me know how to test this?
Attachment #8602911 - Flags: feedback?(jmaher)
awesome here, I will have some results for this in the next 30 minutes!
oh, the session restore stuff is awesome with this latest fix. Lets add the xperf bits and I can review the code one more time- then the landings will start.
Attachment #8602911 - Flags: feedback?(jmaher) → feedback+
Attachment #8602911 - Attachment is obsolete: true
Attachment #8602998 - Attachment is obsolete: true
Attachment #8603000 - Flags: review?(jmaher)
Comment on attachment 8603000 [details] [diff] [review] Add talos-powers default add-on to allow talos tests to request force quit. r=? Review of attachment 8603000 [details] [diff] [review]: ----------------------------------------------------------------- thanks, this is great. ::: talos/talos-powers/components/TalosPowersService.js @@ +9,5 @@ > + "resource://gre/modules/Services.jsm"); > + > +const FRAME_SCRIPT = "chrome://talos-powers/content/talos-powers-content.js"; > + > +function TalosPowersService() {} I know ; are not required, all but 2 lines here have it- we should use it for consistency.
Attachment #8603000 - Flags: review?(jmaher) → review+
please land on talos- I will get it landed on trunk, there is an android update I need to do as well.
Thanks, semi-colons added! remote: https://hg.mozilla.org/build/talos/rev/ae8a6f232e7b Do we close this now, or when we get this deployed?
Flags: needinfo?(jmaher)
bug 1162753 for deployment, try push for final testing.
hmm, this is frustrating- this latest patch is causing failures on the 's' group: https://treeherder.mozilla.org/#/jobs?repo=try&revision=672324392d7c we appear to run svgx, tsvg_opacity, but when running 'cart', we fail to get going. I can look into this a bit, should be easy to reproduce.
Flags: needinfo?(jmaher)
on linux we appear to collect all the data, but we are not sending information up to the server. I wonder if the quit is causing us to not properly close the handle to our logging file 'browser_output.txt' which would make us experience a lack of data to post.
Hrm, that's frustrating. :( Let me know if you want me to dig at it too.
Hey jmaher - did you ever find out what was going on here?
Flags: needinfo?(jmaher)
I did not, nor did I get a chance to hack on it this weekend or Monday. Given that this only happens on the 's' job, it probably means that the tart/cart addon is blocking access to our output file browser_output.txt. I really need an hour or so to focus on this and might be able to figure it out. If you have a few cycles, I would appreciate any thoughts.
Flags: needinfo?(jmaher)
reproduced this locally, tart seems to not have data: 16:22:32 INFO - INFO : Could not find report in browser output: [('tsformat', ('__start_report', '__end_report')), ('tpformat', ('__start_tp_report', '__end_tp_report'))] [browser_output.txt] will narrow down and verify on try.
switching about:newtab -> about:blank in tart.js on lines 527 and 539 allows this to work: http://hg.mozilla.org/build/talos/file/29df8fbf1358/talos/page_load_test/tart/addon/content/tart.js#l526 for some reason the about:newtab is causing talos-powers to see a shutdown message.
So this is a sad story. With my first patch, getInfo.html now gets these sweet powers to kill the whole browser process as soon as it loads. That's great, except that about:newtab shows thumbnails of all of the most frecent visits. For the talos profile running the tart test, the only entry in that group is getInfo.html. And so to gather the thumbnail, we open a remote browser, and load the content in order to capture the thumbnail. Ta-da, whole browser goes down because the background thumbnail process just told the whole app to quit. I have a workaround solution, but we should probably come up with something longer term, like removing getInfo.html from the session history.
Attachment #8603000 - Attachment is obsolete: true
Comment on attachment 8604974 [details] [diff] [review] Follow-up: Do not allow getInfo.html to quit parent application when loaded in thumbnailer process. r=? We can either do this, or try to remove getInfo.html from the history - though I'm not sure exactly how best to do that, or if anything actually relies on that being in the history.
Attachment #8604974 - Flags: review?(jmaher)
Comment on attachment 8604974 [details] [diff] [review] Follow-up: Do not allow getInfo.html to quit parent application when loaded in thumbnailer process. r=? Review of attachment 8604974 [details] [diff] [review]: ----------------------------------------------------------------- ok, this is hacky, but I don't know a better form of hacky. We don't depend on the thumbnail at all, but I suspect it will change our tart/cart numbers. How about we put some comments in here so when thumbnails change in the future we can reduce our hair pulling time.
Attachment #8604974 - Flags: review?(jmaher) → review+
(In reply to Joel Maher (:jmaher) from comment #276) > ok, this is hacky, but I don't know a better form of hacky. We don't depend > on the thumbnail at all, but I suspect it will change our tart/cart numbers. > How about we put some comments in here so when thumbnails change in the > future we can reduce our hair pulling time. CART doesn't care about thumbnails, but TART does. Well, TART itself doesn't care, but we know that the numbers depend on whatever shows up at the newtab page because it can affect the animation. Last time we've had a related discussion, it was when tiles made it into Firefox and IIRC they were eventually kept enabled such that TART measure newtab animation with tiles. IIRC the behavior was such that due to the new profile, no page had high enough frecency to show up at the newtab page during the tests. So I wonder what changed that now it shows the getinfo page. Tim, thoughts?
Depends on: 1164459
Attachment #8603000 - Attachment is obsolete: false
oh, this was a fun one, but a good thing to find. 2 things of interest: 1) We had originally added talos-powers to the mainthreadio whitelist (which was needed). There is a problem with that in the script was looking for 'talos' and converting .*talos -> {talos}, so profile/extensions/talos-powers/... -> {talos}-powers/...; we don't use {talos} in mainthreadio, so I removed the talos substitutions. 2) etlparser looks for disk access in general (not just mainthread io counts). this is the same job and has a slightly different error message. We rarely have to modify this file, but today we got lucky and have a few things to modify. 3) and for a bonus we seem to try to access a /searchplugins folder in an extension subdir of the profile by default. That is added as well!
Attachment #8605899 - Flags: review?(wlachance)
Comment on attachment 8605899 [details] [diff] [review] add talos-powers to mainthreadio and etlparser manifests (1.0) lgtm
Attachment #8605899 - Flags: review?(wlachance) → review+
Hey Joel - with your patch in bug 1166906 landed, are we good to close this, or are we still blocked on something?
Flags: needinfo?(jmaher)
thanks for the reminder
Status: NEW → RESOLVED
Closed: 10 years ago
Flags: needinfo?(jmaher)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: