Closed Bug 940690 Opened 11 years ago Closed 10 years ago

Get existing metrofx talos tests running on release/project branches

Categories

(Release Engineering :: General, defect)

x86
Windows 8
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: jlund, Unassigned)

References

Details

(Whiteboard: [feature] [leave open])

Attachments

(7 files, 6 obsolete files)

13.25 KB, patch
jlund
: review-
Details | Diff | Splinter Review
3.02 KB, patch
jmaher
: review+
Details | Diff | Splinter Review
43.52 KB, patch
jmaher
: review+
Details | Diff | Splinter Review
699 bytes, patch
jmaher
: review+
Details | Diff | Splinter Review
542 bytes, patch
jmaher
: review-
Details | Diff | Splinter Review
1.33 KB, patch
armenzg
: review+
jlund
: checked-in+
Details | Diff | Splinter Review
1.33 KB, patch
armenzg
: review+
jlund
: checked-in+
Details | Diff | Splinter Review
Currently we have 4 metro talos suites: other-metro, svgr-metro, tp5o-metro, and dromaeojs-metro, which are disabled by default and only enabled on Cedar.

We should enable these by default so they run on release/project branches (eg: m-c, aurora, etc).
Depends on: 940699
before we roll this out, we need to have a unique platform name so our win8 and metro results don't get mixed up on graph server and datazilla.
(In reply to Joel Maher (:jmaher) from comment #1)
> before we roll this out, we need to have a unique platform name so our win8
> and metro results don't get mixed up on graph server and datazilla.

So the depending bug was resolved on this as of today so I am starting to take a look at this.

The last time there was talk of whether to do a new platform for metro or not, I believe the sheriffs and jimm said to leave it inside win 8 like we do with metro chrome unittests. I guess this is from the buildbot and tbpl point of view. As to graph server and datazilla (not too familiar on these), I am trying to figure out how to specify an additional platform. I am seeing some control flow like this:

http://mxr.mozilla.org/build/source/talos/talos/output.py#580
  def test_machine(self):

http://mxr.mozilla.org/build/source/talos/talos/output.py#436
  machine = self.test_machine()

http://mxr.mozilla.org/build/source/talos/talos/output.py#473
  collection = DatazillaResultsCollection(machine_name=machine['name'],


Which at some point is used here?:
http://mxr.mozilla.org/build/source/talos/talos/compare.py#391
    datazilla, pgodatazilla, xperfdata = getDatazillaCSET(options.revision, branch)

http://mxr.mozilla.org/build/source/talos/talos/compare.py#139
    platform = getDatazillaPlatform(testrun['test_machine']['os'], 

http://mxr.mozilla.org/build/source/talos/talos/compare.py#105
    def getDatazillaPlatform(os, platform, osversion, product):
    ...
    ...
        elif osversion == '6.2.9200':
            platform = 'Win8'


So from the above it seems we use mozinfo to determine platform information:
https://github.com/mozilla/mozbase/blob/master/mozinfo/mozinfo/mozinfo.py

Which it itself uses the built in os, sys, and platform modules. I am not sure at which point to change the platform that is given. I suppose before we init DatazillaResultsCollection()?:
https://github.com/mozilla/datazilla_client/blob/master/dzclient/client.py#L86

But to change at any point we would have to let mozinfo decide on the platform and then if osversion is '6.2.9200' or platform is 'Win8', change it to some metro known value. In which case, talos would have to be more conscious that it is running on the metro browser (right now we don't have a --immersive-mode or whatever, but we do know the --binary has metrotestharness.exe in it).

I am also not sure if datazilla has a set of pre known platforms you push results to or if it creates new ones as it gets posts/pushes?

Joel, would you be able to help me with this or direct me to someone who would be able to assist?
Flags: needinfo?(jmaher)
also, I don't think we ever use cached datazilla responses as I think there may be a typo on this line:
http://mxr.mozilla.org/build/source/talos/talos/compare.py#126
The cached issue is only for an out of band tool- I have it fixed in a local patch, although I really should get it checked in.

We had to solve this problem for android and tegras, and we changed the title to have a .n so graph server would work.  This was prior to datazilla:
http://hg.mozilla.org/build/talos/file/0ec5e19808f6/talos/remotePerfConfigurator.py#l44  (this won't work anymore, but it shows we have done it before)

lets key off the executable name and if it is metroharness.exe we can adjust the mozinfo block to suport win8.m.  We will need to update the graph database:
http://hg.mozilla.org/graphs/file/15609490c297/sql/data.sql

and make sure that the values going there and to datazilla are what we want.
Flags: needinfo?(jmaher)
First time looking at graphs repo. I would guess we have this data.sql file to insert more entries as we need them? I'm guessing this that because I do not see any t-w864-ix-{0-110} rows in the machines table in this file nor windows 6.2 (id 31?) entry in the os_list table. Yet, both must be there if lines like:
http://graphs.mozilla.org/graph.html#tests=[[230,26,31]] are working.

I added the '.m' to the machine names and i'll submit my patch I have for talos next
Attachment #8348552 - Flags: feedback?(jmaher)
So again, I'm not too familiar with all of this so bare with me as I spill my logic verbosely so you can find the faults in them :)

So first I am appending '.m' to the title, if metrotestharness is in browser_config['browser_path']. This I think will do the trick once this dict makes its way through post():
http://mxr.mozilla.org/build/source/talos/talos/output.py#170

I am then also changing the value of osversion here:
http://mxr.mozilla.org/build/source/talos/talos/output.py#601

I am changing this because when we use the datazilla client:
http://mxr.mozilla.org/build/source/talos/talos/output.py#473
https://github.com/mozilla/datazilla_client/blob/master/dzclient/client.py#L86

it looks like currently for win8, datazilla uses a combination of the os ('win') and os_version ('6.2.9200') to make up its list of platforms:
https://datazilla.mozilla.org/?project=talos

Finally, I am also adjusting compare.py. I am not 100% on how getDatazillaPlatform() is used but I seems like it uses test_machine() as well and will need to know about the additional osversion I added?
Attachment #8348564 - Flags: feedback?(jmaher)
Comment on attachment 8348552 [details] [diff] [review]
bug_940690_metro_talos_major_releases-graphs.diff

Review of attachment 8348552 [details] [diff] [review]:
-----------------------------------------------------------------

why do we have w64-ix-slave107 vs t-w864-ix-107.m?

I would assume it would be the old name with a .m!  Do machines such as  w64-ix-slave107 even exist?
Attachment #8348552 - Flags: feedback?(jmaher) → feedback+
Comment on attachment 8348564 [details] [diff] [review]
bug_940690_metro_talos_major_releases-talos.diff

Review of attachment 8348564 [details] [diff] [review]:
-----------------------------------------------------------------

::: talos/output.py
@@ +603,5 @@
> +            #    datazilla platform
> +            # DatazillaResultsCollection takes an os and os_version.
> +            # Differentiate between win8 non-metro and win8 metro
> +            # by again appending an '.m' suffix to the os_version
> +            if self.results.title.endswith(".m"):

and not version.endswith('.m')
Attachment #8348564 - Flags: feedback?(jmaher) → feedback+
(In reply to Joel Maher (:jmaher) from comment #7)
> Comment on attachment 8348552 [details] [diff] [review]
> bug_940690_metro_talos_major_releases-graphs.diff
> 
> Review of attachment 8348552 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> why do we have w64-ix-slave107 vs t-w864-ix-107.m?
> 
> I would assume it would be the old name with a .m!  Do machines such as 
> w64-ix-slave107 even exist?

So according to slave health and slavealloc these machines exist and are rev2 6.1 NT machines. I believe all the win8 machines have the t-w864-ix-{0..} format. Again I am not sure why they are not in this data.sql file but I'd guess they were entered another way? On second look, it seems we have 130 win8 machines. I'll update the patch and get someone from releng to confirm all is well.
Hal I've seen you have worked on this repo a few times. I am trying to add some metro talos results to graph server and datazilla. Problem is I need to differentiate between win8 metro/non-metro. My understanding is that the title in the talos run is a 1x1 mapping of a machine name in the graph sql DB: for android we did something similar in the past (Bug 708793 - provide way to override platform name in POST)

So I am adding more machines with a '.m' suffix and then in Talos, when we run in the metro browser, I add  the same '.m' suffix to the title (see my talos patch I have in this bug for ref).

What I am not sure about is why other win8 machines are not in this data.sql file? I see w64-ix-slave{01..} machines but I think those are 6.1 NT machines. I'd assume the win8 machines were entered another way? I determined that the os_list table id we want is 31. This was based off poking in:
http://hg.mozilla.org/graphs/file/15609490c297/sql/schema.sql#l3
http://hg.mozilla.org/graphs/file/15609490c297/server/pyfomatic/collect.py#l274
and then finding the log line:
    "http://graphs.mozilla.org/graph.html#tests=[[287,94,31]]"
from:
    https://tbpl.mozilla.org/php/getParsedLog.php?id=32050416&full=1&branch=mozilla-central

1) Is this all I need to add the machine names to the database?
2) Will I have to run this file somehow once it's landed? 
3) Most importantly, am I going about this the right way?

Feel free to defer me to someone else if I should be asking someone else :)
Attachment #8348552 - Attachment is obsolete: true
Attachment #8348971 - Flags: review?(hwine)
> > +            if self.results.title.endswith(".m"):
> 
> and not version.endswith('.m')

Thanks, will test this on my dev master once graph patch lands
Comment on attachment 8348971 [details] [diff] [review]
bug_940690_metro_talos_major_releases-graphs-171213.diff

whoops this is wrong. I'll basically be achieving the same thing: I'd grab the machine name but it point to the same os in the os_list. I'll need to add another row to the os_list table and point the '.m' win8 machines to that id.

How did we know that the os added (Android 2.2 (Native)) would have id 23 here in this patch?:
https://bug708793.bugzilla.mozilla.org/attachment.cgi?id=582029
Attachment #8348971 - Flags: review?(hwine) → review-
This patch will add a new OS to os_list. It will also add 130 win8 machines with the '.m' suffix for metro. These 130 will be a mirror of our actual 130 machines in our build system.

Note: the WINNT 6.2 and t-w864-ix-{001, 130} entries are not in data.sql as I am only using this file to document/review what I will be manually entering on the staging + production sql instance.
Attachment #8350203 - Flags: review?(jmaher)
Attachment #8350203 - Flags: review?(jmaher) → review+
Hi sheeri,

So as per our convo this morning, I was about to insert my prev patch from this bug into production when I realized that staging is not only missing one row (WINNT 5.1 (ix)) but its highest auto incremented id is less than just the one row difference. It turns out there is one os_list row with id = 39.

So to clarify:
staging has: 30 rows in os_list with highest auto-inc ID of 36
production has: 31 rows in os_list with highest auto-inc ID of 39
https://pastebin.mozilla.org/3811391    :(

I only realized this after I tested everything out in staging and moved on to production. So now staging has "WINNT 5.1 (ix)" with id = 37. This is what we talked about doing before and it syncs with production. But it also has "WINNT 6.2 x64 (metro)" with id = 38 along with win8 metro machines with an os_list foriegn key of 38.

My plan is to just auto increment os_list in production. This I believe would set the win8 metro os_list row to 40. Then set all FK in machines to match the 40.

I guess this would again bring staging and production further out of sync but it already differs with the id of 39 for mac 10.9 in production. I did not want to assume without querying the DB gods first.

I'll do what you recommend. :)
Attachment #8348971 - Attachment is obsolete: true
Attachment #8350203 - Attachment is obsolete: true
Attachment #8350435 - Flags: review?(scabral)
Hi Jordan,

Your statements in Comment 14 appear to be a little different than expected at this point in time.

In Production I have:

mysql> select * from os_list order by id;
+----+-----------------------------------+
| id | name                              |
+----+-----------------------------------+
|  1 | WINNT 5.1                         |
|  2 | WINNT 6.0                         |
|  3 | MacOSX Darwin 8.8.1               |
|  4 | MacOSX Darwin 9.2.2               |
|  5 | Ubuntu 7.10                       |
|  6 | CentOS release 5 (Final)          |
|  7 | MacOSX 10.5.2                     |
|  8 | WINNT 5.2                         |
|  9 | Nokia n810                        |
| 10 | Ubuntu 9.04 (x64)                 |
| 12 | WINNT 6.1                         |
| 13 | MacOSX 10.5.8                     |
| 14 | Fedora 12 - Constantine           |
| 15 | Fedora 12 x64 - Constantine       |
| 16 | Nokia n900                        |
| 17 | MacOSX 10.6.2 (rev3)              |
| 18 | CentOS (x86_64) release 5 (Final) |
| 19 | WINNT 6.1 x64                     |
| 20 | Android 2.2 (Native)              |
| 21 | MacOSX 10.6 (rev4)                |
| 22 | MacOSX 10.7                       |
| 23 | Android 2.2                       |
| 24 | MacOSX 10.8                       |
| 25 | WINNT 6.1 (ix)                    |
| 27 | Windows Server 2008 R2 (x64)      |
| 29 | Android 4.0.4                     |
| 31 | WINNT 6.2 x64                     |
| 33 | Ubuntu HW 12.04                   |
| 35 | Ubuntu HW 12.04 x64               |
| 37 | WINNT 5.1 (ix)                    |
| 39 | MacOSX 10.9                       |
+----+-----------------------------------+
31 rows in set (0.00 sec)


In Stage I have:

mysql> select * from os_list order by id;
+----+-----------------------------------+
| id | name                              |
+----+-----------------------------------+
|  1 | WINNT 5.1                         |
|  2 | WINNT 6.0                         |
|  3 | MacOSX Darwin 8.8.1               |
|  4 | MacOSX Darwin 9.2.2               |
|  5 | Ubuntu 7.10                       |
|  6 | CentOS release 5 (Final)          |
|  7 | MacOSX 10.5.2                     |
|  8 | WINNT 5.2                         |
|  9 | Nokia n810                        |
| 10 | Ubuntu 9.04 (x64)                 |
| 12 | WINNT 6.1                         |
| 13 | MacOSX 10.5.8                     |
| 14 | Fedora 12 - Constantine           |
| 15 | Fedora 12 x64 - Constantine       |
| 16 | Nokia n900                        |
| 17 | MacOSX 10.6.2 (rev3)              |
| 18 | CentOS (x86_64) release 5 (Final) |
| 19 | WINNT 6.1 x64                     |
| 20 | Android 2.2 (Native)              |
| 21 | MacOSX 10.6 (rev4)                |
| 22 | MacOSX 10.7                       |
| 23 | Android 2.2                       |
| 24 | MacOSX 10.8                       |
| 25 | WINNT 6.1 (ix)                    |
| 27 | Windows Server 2008 R2 (x64)      |
| 29 | Android 4.0.4                     |
| 31 | WINNT 6.2 x64                     |
| 33 | Ubuntu HW 12.04                   |
| 35 | Ubuntu HW 12.04 x64               |
| 36 | MacOSX 10.9                       |
| 37 | WINNT 5.1 (ix)                    |
| 38 | WINNT 6.2 x64 (metro)             |
+----+-----------------------------------+
32 rows in set (0.01 sec)

The differences between them:

- In Prod, MacOSX 10.9 has ID 39. In stage it's 39.
- In Stage, we have an extra value: ID 38 of  WINNT 6.2 x64 (metro).
- Auto Increment values differ between prod/stage.

As for the extra value and auto_increment problems, I've solved them both with the below sql:

insert into os_list values (38, 'WINNT 6.2 x64 (metro)'); -- prod only (this already existed on stage)
ALTER TABLE os_list AUTO_INCREMENT=40; -- run on both servers.

You're now safe to run the rest of your inserts on both prod and stage.

This will leave everything working and put you in a good spot. IDs will line up and the only difference becomes the difference in ID for MacOSX 10.9.

We can fix that if you're concerned about it, but we'd of course have to identify all the dependent areas that lookup that ID and adjust them.
(In reply to Brandon Johnson [:cyborgshadow] from comment #15)

> - In Prod, MacOSX 10.9 has ID 39. In stage it's 39.


Correct that to be: In stage it's 36.
> You're now safe to run the rest of your inserts on both prod and stage.
> 
> This will leave everything working and put you in a good spot. IDs will line
> up and the only difference becomes the difference in ID for MacOSX 10.9.

Thank you very much for taking time away to do this :)
Comment on attachment 8350435 [details] [diff] [review]
bug_940690_metro_talos_major_releases-graphs-191213-3.diff

thanks for help on this! :)
Attachment #8350435 - Flags: review?(scabral) → review-
Depends on: 956384
Blocks: metrobacklog
Whiteboard: [feature] p=0
Were your inserts completed after the update? Just checking in.
(In reply to Brandon Johnson [:cyborgshadow] from comment #19)
> Were your inserts completed after the update? Just checking in.

perfect. Graph server accepted the "new" platform. Thx for checking in.
status update --

Both Graph Server and Datazilla differientiate between metro and non metro firefox. For datazilla, results were not showing up because talos did not recognize the appInfo ID for metro: {99bceaaa-e3c6-48c1-b981-ef9b46b67d60}

Results for test runs can be seen here:
Datazilla -- https://datazilla.mozilla.org/?start=1388459083&stop=1389063883&product=Firefox&repository=Cedar&os=win&os_version=6.2.9200.m&test=tsvgr_opacity&project=talos

Graph Server -- http://graphs.mozilla.org/graph.html#tests=[[72,26,38]]&sel=1388342640120,1388947440120&displayrange=7&datatype=running

Patches incoming.
same patch as for with the feedback addition and extra ID for firefox.
Attachment #8348564 - Attachment is obsolete: true
Attachment #8356421 - Flags: review?(jmaher)
Comment on attachment 8356421 [details] [diff] [review]
940690_metro_talos_major_releases-talos-060114.diff

Review of attachment 8356421 [details] [diff] [review]:
-----------------------------------------------------------------

thanks for the good comments.  please land this on the talos repo, I have some other changes I am looking to land today/tomorrow.
Attachment #8356421 - Flags: review?(jmaher) → review+
this has been pushed to default: https://hg.mozilla.org/build/talos/rev/8c5f2725fbdd
No longer depends on: 956384
The addon blocker is fixed. Datazilla/graph server patches have landed in talos repo.

I think now all we need to do is update talos.json to at least 8c5f2725fbdd (my talos patches). This would enable graphserver/datazilla results for metro and update metrotestharness.exe with timeout patch. Two things we should add first to cedar before releasing on all branches.

Jmaher, is it safe to update talos repo or are we still testing/holding off for results? Is there something I should do here?
Flags: needinfo?(jmaher)
I have updated talos.json as of this morning.   I needed to make sure I had 6 hours of time to be around in case there were issues with the patch (i.e. land and disconnect is not my idea of responsible).

This is on inbound now, we can see it roll through the repositories in a short amount of time.
Flags: needinfo?(jmaher)
> This is on inbound now, we can see it roll through the repositories in a
> short amount of time.

Perfect! I'll keep an eye on cedar for results.
Armen, I think we are good to enable our metro talos tests. This patch will enable the suites across all project/release branches by default. 

I have some concerns here:

1) These suites can be temperamental and not pass every single time. Will that have consequences we do not want? I am not sure if we have branches that expect 100% green? Are there branches this patch will enable that we don't want enabled?

2) Metro talos suites require a very recent talos.json from m-c (http://hg.mozilla.org/mozilla-central/diff/c55b3d5314eb/testing/talos/talos.json). I think we will need to add a patch to the release branches that include this talos.json before enabling the buildbot patches? Are there other branches that will need a patch for talos.json beside 'mozilla-aurora', 'mozilla-beta', and 'mozilla-release'?

Thanks for patience.
Attachment #8360870 - Flags: feedback?(armenzg)
> Armen, I think we are good to enable our metro talos tests. This patch will
> enable the suites across all project/release branches by default. 

These need to roll out with the browser, which is currently sitting on aurora and will roll out with fx28. Not sure what happens if you enable on beta and release without the browser present in the tree?
we would need to ensure all talos updates are on the branches we turn this on for.  Do we need to update talos on Aurora, or will we stick with m-c based branches and follow the trains?
(In reply to Joel Maher (:jmaher) from comment #31)
> we would need to ensure all talos updates are on the branches we turn this
> on for.  Do we need to update talos on Aurora, or will we stick with m-c
> based branches and follow the trains?

Today mc and aurora, add beta after the feb 4th merge, add release after 3-17.
once these are live on m-c, we should work on getting the talos.json changes into aurora as much as we can.
Attachment #8360870 - Flags: feedback?(armenzg)
We have gecko_version to control on which branches a suite will be enabled for (see config.py).
(In reply to Armen Zambrano [:armenzg] (Release Engineering) (EDT/UTC-4) from comment #34)
> We have gecko_version to control on which branches a suite will be enabled
> for (see config.py).

Ah OK, thanks Armen. So I see it as we have two options.

1) enable only branches that use gecko_version 29 (m-c)
   patch: https://pastebin.mozilla.org/4030868
   branches that would be affected/enabled: https://pastebin.mozilla.org/4030823

2) enable only on the m-c branch.
   patch: https://pastebin.mozilla.org/4030857
   branches that would be affected/enabled: m-c

Jimm, Jmaher: just to be crystal clear, we can enable uses m-c (gecko 29) but there are a lot. Is this what we want or do we literally only want m-c (option 2) for now?
Flags: needinfo?(jmathies)
Flags: needinfo?(jmaher)
we can do this on any gecko 29 based branch right now.

I would like to see that baked in for a week before rolling down to Aurora prior to the uplift on the 4th.
Flags: needinfo?(jmaher)
Flags: needinfo?(jmathies)
Thanks jmaher, jimm.

Here is a patch to enable all m-c based branches.

Armen, there were a few ways I thought about doing this but explicitly adding them like I did seemed best?
Attachment #8360870 - Attachment is obsolete: true
Attachment #8361856 - Flags: review?(armenzg)
Comment on attachment 8361856 [details] [diff] [review]
940690_metro_talos_bbotcfgs_geck29_160114-3.diff

Review of attachment 8361856 [details] [diff] [review]:
-----------------------------------------------------------------

This seems right.

Did you use builder_list.py from braindump to compare that it only adds the builders you care about? Or similar approach to check it?
Attachment #8361856 - Flags: review?(armenzg) → review+
> This seems right.
> 
> Did you use builder_list.py from braindump to compare that it only adds the
> builders you care about? Or similar approach to check it?

Cool thanks! I ran builder_list.py off a clean bbot-cfgs and then with my patch. The vimdiff: https://pastebin.mozilla.org/4030823 (see comment 35) shows the branches that would add these metro talos builders. There are a lot.

Note: There were other parts of the vimdiff that I did not include in the pastebin. These showed new totals of builders for a platform. eg: 

- t-w864-ix-046 has 1361 builders; limit is 2036; 66 percent of max
+ t-w864-ix-046 has 1493 builders; limit is 2036; 73 percent of max

I'd assume all these new builders are a result of creating 4 new talos builders for every branch based off gecko 29.
After discussion and review of https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy, it seems this won't meet requirements to enable on m-c based branches. We will need <5% intermittent failures. Right now, looking at https://tbpl.mozilla.org/?tree=Cedar (since Jan 9th), we are no where close to that. Granted, there are only a few data points.

It seems the majority of failures continue to revolve around timing out after 3600s while trying to launch the metro browser. I recall we made improvements to this while changing display/sleep/screensaver timeouts but we haven't truly solved the core culprit. I don't know if the metrotestharness.exe timeout update affected anything. I am trying to find the differences between the way mochitest and talos launches the metro browser (they both use ProcessHandler) but I will need aid from those who work on the talos, mochi, and mozinfo scripts.

I will coordinate a discussion with ateam/windev and maybe even a brief vidyo session to hammer this out when I am back from PTO this thurs/friday. Jmaher said he would help out after he clears up some work next week.
Are these messages relevant?
ProcessManager NOT managing child processes
ProcessManager UNABLE to use job objects to manage child processes
09:33:23     INFO -  INFO : Initialization of new profile failed

It might be worth triggering jobs on Cedar, seeing which fail, disable the slaves that had a failure from slavealloc and look at the Windows logs to see if anything shows up around the time that a failure happened.

It might also be worth forcing a screenshot before the talos job starts and another after it ends.
We might be able to see Windows error prompts.
(In reply to Armen Zambrano [:armenzg] (Release Engineering) (EDT/UTC-4) from comment #41)
> Are these messages relevant?
> ProcessManager NOT managing child processes
> ProcessManager UNABLE to use job objects to manage child processes
> 09:33:23     INFO -  INFO : Initialization of new profile failed
> 
> It might be worth triggering jobs on Cedar, seeing which fail, disable the
> slaves that had a failure from slavealloc and look at the Windows logs to
> see if anything shows up around the time that a failure happened.
> 
> It might also be worth forcing a screenshot before the talos job starts and
> another after it ends.
> We might be able to see Windows error prompts.

We could also try to force a crash using crashinject here like we do in mochitests. This mystery here is what's going wrong at the end of these runs -

https://tbpl.mozilla.org/php/getParsedLog.php?id=33337832&tree=Cedar&full=1#error0

Is this the machine? The browser hanging? Test harness problems? How can we get at one of these devices after this occurs?
(In reply to Jim Mathies [:jimm] from comment #42)
> https://tbpl.mozilla.org/php/getParsedLog.
> php?id=33337832&tree=Cedar&full=1#error0
> 
> Is this the machine? The browser hanging? Test harness problems? How can we
> get at one of these devices after this occurs?

gbrown: do you think we need to try the unbuffered mode in here?

I think what you saw affecting Android jobs might be the same thing we're seeing at time on desktop unit test jobs.
Flags: needinfo?(gbrown)
(In reply to Armen Zambrano [:armenzg] (Release Engineering) (EDT/UTC-4) from comment #43)
> gbrown: do you think we need to try the unbuffered mode in here?

It looks like unbuffered mode is already being used:

> 'c:/mozilla-build/python27/python' '-u' 'scripts/scripts/talos_script.py' ...

But if I am reading that wrong, or there is some other place where python is being called without -u, and that process is generating output that is ultimately being collected for the log, then certainly I would recommend using -u -- it has helped get more accurate logs for Android hangs, and did not introduce any new problems.
Flags: needinfo?(gbrown)
> > Are these messages relevant?
> > ProcessManager NOT managing child processes
> > ProcessManager UNABLE to use job objects to manage child processes
From what I gather, these mozprocess messages are normal and occur in other builds. I've been told they have not been proven to be the cause of any issues (yet)

> > 09:33:23     INFO -  INFO : Initialization of new profile failed
yes that is not good. rarely see these now. From what I can tell, not the main issue.
 
> > It might be worth triggering jobs on Cedar, seeing which fail, disable the
> > slaves that had a failure from slavealloc and look at the Windows logs to
> > see if anything shows up around the time that a failure happened.

I can reach timeouts on browser launch on my dev master so i don't see this as a 'bad machines' as much as an issue with one of the scripts.

> > It might also be worth forcing a screenshot before the talos job starts and
> > another after it ends.
> > We might be able to see Windows error prompts.

cool. I am also going to watch via vnc throughout a number of test builds today

> We could also try to force a crash using crashinject here like we do in
> mochitests. This mystery here is what's going wrong at the end of these runs

ahh. ya, looking at mochitests, somethign like this would be helpful:
http://mxr.mozilla.org/mozilla-central/source/testing/mochitest/runtests.py#699

> 
> https://tbpl.mozilla.org/php/getParsedLog.
> php?id=33337832&tree=Cedar&full=1#error0
> 
> Is this the machine? The browser hanging? Test harness problems? How can we
> get at one of these devices after this occurs?

I am not sure about that specific suite failing. That one seems rare as it happens after the metro browser launches during the test output. I *think* these ones are the predominant failures:
https://tbpl.mozilla.org/php/getParsedLog.php?id=33150302&full=1&branch=cedar
https://tbpl.mozilla.org/php/getParsedLog.php?id=33289400&full=1&branch=cedar
https://tbpl.mozilla.org/php/getParsedLog.php?id=33339900&full=1&branch=cedar
and maybe:
https://tbpl.mozilla.org/php/getParsedLog.php?id=33290265&full=1&branch=cedar

but ya, I am not sure if it's the machine, browser, or harness.  My guess is that since we are doing this successfully with mochitests and we call the talos script from mozharness the same way, I don't think it's the metroharness.exe or the machine. I feel it is talos itself or the way it uses mozinfo.

I am going to hack my testing machine so that it will not reboot after a hang and then try to grab some relevant state. will also look into trying to implement something like mochi's killAndGetStack() crashinject.

Will post findings later today.
> I can reach timeouts on browser launch on my dev master so i don't see this
> as a 'bad machines' as much as an issue with one of the scripts.
> 

I may have been wrong about my own dev master results

tl;dr:
   1) I am not sure about last weeks Cedar builds being the best for evaluating intermittent failures
   2) my own dev master seems to fail rarely (consistent passes for every suite bar talos-other). WRT to talos-other it seems to be a process holding a path permission/lock where I need expertise in talos/mozinfo



So I'm actually having a really hard time hitting timeouts on my dev master. Going to try some older revisions to see if there it was something causing regressions in m-c from last week.

On a side note, I am basing my intermittent failures from results in Cedar. I am not sure if this is the best place for accuracy. Looking at the the last few days on cedar (the days I based my metro results on), the builds are riddled with orange and red for many builds/tests. For example, metro mochitest chrome failed 9 times out of 13 runs since Jan 10th. There are many other consistent failures. 

Going from my own dev machine, my metro talos builds seem to be passing quite consistently with talos-other ("a11yr", "ts_paint") being the only one failing a few times. So maybe we should be going by that?

These few talos-other failures are demonstrated by this uploaded log.
http://people.mozilla.org/~jlund/metro-talos-other-fail-processLock-220114.txt

I vnc'd and watched such a failure and noticed the browser open/close frequently and in quick succession. That is normal for this suite but with the way that metroharness.exe and the browser is, I think that a process hasn't been killed by the time another process requires a profile path or the browser_output.txt. 

I believe these are related to findings way back:
https://bugzilla.mozilla.org/show_bug.cgi?id=897420#c16
https://bugzilla.mozilla.org/show_bug.cgi?id=897420#c62

so maybe 1) we use output from metrotestharness like mochi does 2) we figure out a way to make sure things like the profile path/browser_output.txt file never gets locked.

Joel, you mentioned a while back with regards to permission/locks that this is something you would look into. I know your plate is full but maybe I could get a hand with this when time allows or else pass me a long to someone else who knows talos/mozinfo well?
> These few talos-other failures are demonstrated by this uploaded log.
> http://people.mozilla.org/~jlund/metro-talos-other-fail-processLock-220114.
> txt

This specific failure looks like a combination of things:

1) a browser startup failure -

bad startup:

16:50:38     INFO -  MetroAppShell::Run
16:50:38     INFO -  mozilla::widget::winrt::FrameworkView::ActivateView
16:50:38     INFO -  Resize: 0.000000 0.000000 1600.000000 1200.000000
(nothing here)
16:50:43     INFO -  mozilla::widget::winrt::ProcessManager UNABLE to use job objects to manage child processes
16:50:43     INFO -  DEBUG : Terminating: metrotestharness, plugin-container, crashreporter, dwwim

good startup from higher up in the log:

16:50:38     INFO -  MetroAppShell::Run
16:50:38     INFO -  mozilla::widget::winrt::FrameworkView::ActivateView
16:50:38     INFO -  Resize: 0.000000 0.000000 1600.000000 1200.000000
16:50:38     INFO -  mozilla::widget::winrt::MetroInput::MetroInput
16:50:38     INFO -  mozilla::widget::winrt::FrameworkView::AddEventHandlers hr=0
16:50:38     INFO -  mozilla::widget::winrt::FrameworkView::SetupContracts
16:50:38     INFO -  __start_report707__end_report
16:50:38     INFO -  * delay load started...
16:50:38     INFO -  * delay load complete.
16:50:38     INFO -  __startTimestamp1390438231916__endTimestamp
16:50:38     INFO -  [22F4240] MetroWidget::Destroy mWnd=160234 type=0
16:50:38     INFO -  mozilla::widget::winrt::MetroApp::ShutdownXPCOM: IsMainThread:1 ThreadId:6F4
16:50:38     INFO -  mozilla::widget::winrt::FrameworkView::ShutdownXPCOM

2) a failure by talos to kill the browser process when the browser hangs. It only kills the test harness, which leaves the browser around.

DEBUG : Terminating: metrotestharness, plugin-container, crashreporter, dwwim

from there on out, every launch fails because the browser is still running.
(In reply to Jim Mathies [:jimm] from comment #47)
> DEBUG : Terminating: metrotestharness, plugin-container, crashreporter, dwwim
> 
> from there on out, every launch fails because the browser is still running.

Can we add firefox to this list somehow?
lets figure out how to kill firefox on metro.  What is the main process name we need to kill?  Right now we are trying to kill metrotestharness.exe, not firefox.exe.
(In reply to Joel Maher (:jmaher) from comment #49)
> lets figure out how to kill firefox on metro.  What is the main process name
> we need to kill?  Right now we are trying to kill metrotestharness.exe, not
> firefox.exe.

It's still firefox.exe. The test harness also reports this - 

Log(L"METRO_BROWSER_PROCESS=%d", processID);

Although I don't see this in jlund's log. Maybe an out of date harness there? We added this in mochitest bug 905628.
The fix is to kill firefox, not metrotestharness, here is where we do the terminating of the process:
http://hg.mozilla.org/build/talos/file/tip/talos/ttest.py#l129
which then calls:
 - http://hg.mozilla.org/build/talos/file/tip/talos/ffprocess.py#l55

We reference browser_config['process'] which in this case is 'metrotestharness'. We set browser_config['process'] here:
http://hg.mozilla.org/build/talos/file/tip/talos/run_tests.py#l249
browser_config['process'] = os.path.basename(browser_config['browser_path'])

browser_path is set when we launch talos via the -e parameter (clip from the log above):
browser_path: C:\slave\test\build\application\firefox\metrotestharness

My recommendation is to hack on cleanupProcess (http://hg.mozilla.org/build/talos/file/tip/talos/ffprocess.py#l55) and add:
if process_name == 'metrotestharness':
    self.extra_prog.append('firefox')

alternatively we could edit the extra_prog data when we create ffprocess, but that is inside of ttest.py and we don't do anything there.  Another alternative would be to edit self.extra_prog in ffprocess_win32 (http://hg.mozilla.org/build/talos/file/tip/talos/ffprocess_win32.py#l17).  My top choice is my recommendation in this comment.
Sorry for the delay in reply; I wanted to make sure this patch made the difference.

After numerous tests all day, I was able to get a consistent pass rate: 20 straight greens (unsure of performance change but metro talos is new with few data points to begin with).

This success rate was due to both input from jimm and jmaher. 1st problem was I had the wrong binary after the most recent talos metroharness exe changeset. 2nd was ensuring that talos closes both the metro harness and the browser.

I was able to see an improvement after applying a different fix for both these issues and running them separately. This patch represents both fixes. Following an an r+ for this, I'll try and get this live ASAP. The buildbot patch to enable this on m-c already has an r+ so it should move quickly.
Attachment #8364871 - Flags: review?(jmaher)
Note that after these fixces land, if we do see some timeout issues on shutdown, they are likely bug 951120 which I've been working on. Mochitests for metro are currently timing out about 15% of the time, and we've made figuring out why a high priority.
Comment on attachment 8364871 [details] [diff] [review]
940690_metro_talos_major_releases-talos-230114.diff

Review of attachment 8364871 [details] [diff] [review]:
-----------------------------------------------------------------

::: metro/metrotestharness_info.txt
@@ +7,5 @@
>      01/01/14
>      http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win32/1388536346/firefox-29.0a1.en-US.win32.tests.zip
> +
> +Current:
> +    23/01/14

I think the date formats are not compatible- earlier on you have mm/dd/yy, and this is dd/mm/yy

::: talos/ffprocess.py
@@ +63,5 @@
> +        # We must inform talos about the sub process, the metro browser itself,
> +        # that is spawned from metrotestharness. The metro browser proc is
> +        # given the same name as the non metro equivalent: 'firefox.exe'
> +        if (process_name == "metrotestharness" and
> +                "firefox" not in self.extra_prog):

I would remove the () around the clause, otherwise this is great stuff.
Attachment #8364871 - Flags: review?(jmaher) → review+
talos patch pushed to default: https://hg.mozilla.org/build/talos/rev/82b7680f9eaf

this patch is to enable that talos rev via talos.json
Attachment #8365167 - Flags: review?(jmaher)
Attachment #8365167 - Flags: review?(jmaher) → review+
Whiteboard: [feature] p=0 → [feature] p=0 [leave open]
Whiteboard: [feature] p=0 [leave open] → [feature] [leave open]
I didn't sync the revision in both places.
Attachment #8366079 - Flags: review?(jmaher)
Comment on attachment 8366079 [details] [diff] [review]
940690_metro_talos_m-i_270114.diff

Review of attachment 8366079 [details] [diff] [review]:
-----------------------------------------------------------------

did you create a new .zip file?  have you ran it on try server?  I need confirmation of that before r+
Attachment #8366079 - Flags: review?(jmaher) → review-
==== Status update ====

This should have been in yesterday. Unfortunately, Cedar is perma red for windows builds for last 5 days. I havn't been able to test this new change that works so well on my dev-master even once.

I am tempted to just skip the queue and enable it on m-c branches. But with todays long closures, I'll try for tomorrow.
I think I discovered why windows cedar is broken: web-platform-tests -> https://bugzilla.mozilla.org/show_bug.cgi?id=945222#c31
cedar is fixed. merging m-c -> cedar one more time to get more data results.

will make decision tomorrow based on that.
Wahoo!

Looks like we have a dramatic increase in pass rates. I wanted perfection but something is still not perfect. Some occasional timeouts that we have seen before are still creeping out. From what I can tell when I have caught this by vnc'n to a machine when this happens, the metro browser launches and runs tests but the output, and maybe the process itself, is lost to the talos script.

Here is an example of such a timeout (note, this is not new): https://tbpl.mozilla.org/php/getParsedLog.php?id=33795258&tree=Cedar

I think we are at a point where we either make this more visible today and call it 'good enough' by enabling on m-c, or make it perfect first. I'm fine with either.

However, if we go the route of solving first, I would like to request that we pull in someone from ateam to
dedicate themselves to hashing it out. I have tinkered here and there but I know that someone more familiar with the talos scripts would solve this much faster.

Joel, what's your thoughts? Are the improvements sufficient or should we pull in ateam?
Flags: needinfo?(jmaher)
I did a bunch of retriggers- lets see how stable these are in a couple hours.
Flags: needinfo?(jmaher)
looking at the retriggers: https://tbpl.mozilla.org/?tree=Cedar&rev=c08328d9f6ad&jobname=WINNT%206.2, I don't want to move forward on this, although d-m looks pretty good right now.  Thoughts?
The two blue retries are known and greened on the next attempt.

The remaining large majority are 3600 timeout. I think focus should be put on that.

Joel would you have time to look at this or suggest someone I can approach?
I don't know anybody else who would be a good candidate to look at this.  Can we enable some of the tests now and put the timeouts off for a later time?  I have a few projects that need some attention this quarter and have been offset by investigating other oranges for the last 3 weeks.
Thanks Joel, completely understand.

I am going to enable dromaeojs-metro immediately with the thought that enabling 1 suite is better than enabling none.

I can then supply someone access to our machines for any devs to debug the way that the talos repo and metroharness interact, while updating mozharness/buildbot as we make changes are made.

I don’t see this being a huge time sink since we are 90% of the way there. My guess is we need to modify the way that we Talos uses ProcessHandler and metroharness.exe.
Attachment #8361856 - Attachment is obsolete: true
Attachment #8368777 - Flags: review?(armenzg)
Attachment #8368777 - Flags: review?(armenzg) → review+
Comment on attachment 8368777 [details] [diff] [review]
940690_metro_talos_bbotcfgs_geck29_drom_310114.diff

this is checked in: https://hg.mozilla.org/build/buildbot-configs/rev/77f04568b40d
Attachment #8368777 - Flags: checked-in+
Depends on: 967352
Due to: 967352, looks like this will need to be backed out :(

I am not sure why it suddenly never wants to go green. Will need to be investigated after back out.
Attachment #8370142 - Flags: review?(armenzg)
Attachment #8370142 - Flags: review?(armenzg) → review+
Comment on attachment 8370142 [details] [diff] [review]
940690_metro_talos_bbotcfgs_rm_drom-040214.diff

checked in: https://hg.mozilla.org/build/buildbot-configs/rev/f5a02b58d338
Attachment #8370142 - Flags: checked-in+
buildbot-config patch is in production :)
Can you try running metro talos jobs on a PGO build?
PGO builds are a bit different than optimized builds (non-PGO).
(In reply to Armen Zambrano [:armenzg] (Release Engineering) (EDT/UTC-4) from comment #73)
> Can you try running metro talos jobs on a PGO build?
> PGO builds are a bit different than optimized builds (non-PGO).

you can see here that PGO was also not passing: https://bugzil.la/967352 However, I did a couple on
my dev-master and also caught the same timeout.

=== status update === Dromaeojs-metro failed too often on m-c/aurora and had to be backed out.

Basically all of these suites are not reliable enough to make it on every release/project branch. :(

In order to proceed:

1) from buildbot/mozharness perspective, everything is in place 2) someone needs to actively
work/massage the talos repo/metroharness. Ideally this would be ateam[1] when priorities are high
enough or time allocates itself 3) I can loan one of our releng machines and also assist in running
this on my dev machine (note, all of metro-talos will stay on Cedar as-well for dev testing) 4)
patches for putting this on major branches have been processed so once the issue is solved, I can
land immediately.

[1] I will be continuing my work on porting desktop builds from buildbot to
mozharness(https://bugzil.la/858797). If ateam continues to have other priorities and the 'higher
powers' prefer me to work on metro talos instead, I am sure, given enough time, I can figure out the
talos system to solve the problem.
Assignee: jlund → nobody
Status: NEW → ASSIGNED
jimm do you think we should close this bug? Should we also remove suites running on cedar?
Flags: needinfo?(jmathies)
ahh I see https://bugzil.la/982034 for a dedicated metro branch.

I will mirror https://bugzil.la/983269 and keep these suites running on cedar until new branch is set up.

BTW, these talos suites are running very well in cedar lately FYI.
I recall reading that we will not be working on metro firefox going forward, quite possibly we could discontinue this effort?
FYI - started: Bug 984476 - move mochitest-metro and talos-metro suites running on cedar over to new dedicated metro proj branch
marking as resolved since we will not be putting metro on major release branches.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → WONTFIX
Flags: needinfo?(jmathies)
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: