Closed Bug 536473 Opened 15 years ago Closed 14 years ago

Electrolysis-windows-talos: Seemingly random Talos "browser frozen"

Categories

(Release Engineering :: General, defect)

x86
Windows 7
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: benjamin, Assigned: smaug)

Details

Electrolysis has been having a problem since late last week: for some seemingly random percentage of pushes, all of the Talos runs on Windows are orange with "browser frozen" (at startup... the test doesn't seem to produce any results at all). This is tdhtml, tsspider, tgfx, tsvg, and tp4, but not the Ts tests.

Sample logs:
http://tinderbox.mozilla.org/showlog.cgi?log=Electrolysis/1261522502.1261523822.3152.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Electrolysis/1261514366.1261516401.16815.gz

I've downloaded the same exact hourly builds that Talos is downloading (from stage) and run StandaloneTalos on them without incident.

About 25% of builds, all the tests pass. But it's always all-or-nothing: for a particular build/set-of-talos, they either all fail or all pass. For example, the Talos runs for cset afc656f387fe, pushed Monday, December 21, 2009 9:24:14 AM -0800 all failed, and the runs for 3d5dcaeba50f, pushed Monday, December 21, 2009 9:37:55 AM -0800 all passed. The code inbetween those two pushes is not run by Talos at all, and can't have affected them.

The last reliably-green push was 0e3ed118aedd, pushed Thursday, December 17, 2009 1:39:39 PM -0800. The orangeness started with 07c66d63ecb7, pushed Thursday, December 17, 2009 4:12:27 PM -0800 (which also is IPC-only code that Talos can't hit).

I mentioned it to lsblakk and nthomas today and IRC, and they suggested perhaps clobbering the builders, but that didn't help. I don't know what to do next. Maybe somebody from releng can catch the "browser frozen" on one of the actual machines and see what it's doing (e.g. it's displaying a dialog box of some sort which prevents the tests from running), or if there's a spare Talos slave I can try to reproduce on it over VNC.
I've just rerun "WINNT 5.1 electrolysis talos" using http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/electrolysis-win32/1261517623/firefox-3.7a1pre.en-US.win32.zip (the current latest build) on an xp talos machine. Minefield has launched enough that Windows drew a titlebar and the window frame, but everything inside it is white (no chrome or content drawn). It's sitting idle, no processes using any cpu time.
I've tidied up talos-rev1-xp02 for you to play with.  It is located at the MV
office so you will need to use the MV office vpn to access it through VNC. 
Please ping me on irc for user/passwd.
Error console messages:
failed to load XPCOM component: C:\.....\firefox\components\tp-cmdline.js
Failed to load XPCOM component: C:\.....\firefox\components\nsProgresDialog.js
Warning: unrecognized command line flag -tp
Warning: unrecognized command line flag -tpchrome
Warning: unrecognized command line flag -tpformat
Warning: unrecognized command line flag -tpcycles
This appears to be because, at least on the Talos slave I have for testing, talos\page_load_test\components doesn't exist.
The script generate-tpcomponent.py is meant to populate the page_load_test\components directory:

http://hg.mozilla.org/build/tools/file/default/buildfarm/utils/generate-tpcomponent.py

Has e10s moved any of the files that that script is trying to copy? Maybe we're getting a silent copy failure early on.
Hrm, maybe I'm not using the slave correctly (I was trying to use it like a standalone talos setup, running run_tests.py my.config), so perhaps I'm not running that script correctly. I don't know of any relevant changes e10s has made to the files in question.
Sorry, that was overly aggressive slave cleanup on my part.  I'll install the pageloader in the correct location for you.
(In reply to comment #7)
> Sorry, that was overly aggressive slave cleanup on my part.  I'll install the
> pageloader in the correct location for you.

pushing over to Alice.
Assignee: nobody → anodelman
Fix already in place, this bug should not be assigned to me.
Assignee: anodelman → nobody
(In reply to comment #9)
> Fix already in place, this bug should not be assigned to me.

Per irc, I didnt know Alice already put her fix in place, so we think her work here is done. 

bsmedberg, can you see if this is still a problem and if so, is it an electrolysis problem?
Assignee: nobody → benjamin
Sorry, fix as in talos-rev1-xp02 is now correctly configured for testing - not as in a fix for the actual browser freezing issue.
In the hanging case, I get a JS exception:
Error: uncaught exception: [Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIWebNavigation.loadURI]"  nsresult: "0x80004005 (NS_ERROR_FAILURE)"  location: "JS frame :: chrome://global/content/bindings/browser.xml :: loadURIWithFlags :: line 187"  data: no]

When I set up a try/catch, the error occurs trying to load http://localhost/page_load_test/tp4/www.youtube.com/www.youtube.com/index.html, which is a perfectly normal URL. I then get a couple of errors:

Error: evt.originalTarget.defaultView is undefined
Source File: chrome://pageloader/content/pageloader.js
Line: 293

There is no mozilla-runtime process, so I'm pretty sure we're not running into any issues with accidentally trying to do remote tabs. That leaves the set of content changes for e10s which aren't plugin-related somehow causing the initial pageload to fail. bz/smaug, any clues about that (and why it would only show up on the Talos machines and not locally)?
Hmm.  LoadURI can return NS_ERROR_FAILURE in the following cases (if we ignore the history-entry cases, which I assume we're not hitting here):

1) Empty (or whitespace-only) URI string
2) CreateFixupURI failed
3) GetService for the security manager fails with that error code
4) IsSystemPrincipal fails with that error code
5) SchemeIs on the given URI fails, or it's a wyciwyg URI
6) It's a targeted load and window.open fails with that error code
7) mIsBeingDestroyed is true
8) CheckLoadingPermissions returns this error code
9) NS_DispatchToCurrentThread returns this error code
10) Load is external and CreateAboutBlankContentViewer fails
11) It's an anchor scroll and session history is not working right
12) Stop() fails with that error code
13) DoURILoad fails with that error code

My money is on #7... but that's checked twice in nsDocShell::InternalLoad and the first check doesn't actually return NS_ERROR_FAILURE.  It's worth double-checking by adding some code to log things around the second check, I guess, and in general to see whether we make it into the DoURILoad call here.
Oh, and talos vs locally is almost certainly a timing issue of some sort.
->smaug for more investigation. This seems to be specific to the tab/frameloader changes in the Electrolysis branch which aren't in m-c (or is just really freaky).
Assignee: benjamin → Olli.Pettay
in irc with bsmedberg, this has been working for about a month now. Unclear what changed or what fixed it. Reopen if this reoccurs.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → WORKSFORME
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.