Closed
Bug 601798
Opened 14 years ago
Closed 14 years ago
create tp5 pageset
Categories
(Testing :: Talos, defect)
Testing
Talos
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: anodelman, Assigned: anodelman)
References
Details
Attachments
(6 files, 3 obsolete files)
7.21 KB,
patch
|
jmaher
:
review+
|
Details | Diff | Splinter Review |
1.33 KB,
patch
|
jmaher
:
review+
|
Details | Diff | Splinter Review |
2.45 KB,
patch
|
bhearsum
:
review+
|
Details | Diff | Splinter Review |
8.81 KB,
patch
|
armenzg
:
review+
|
Details | Diff | Splinter Review |
11.24 KB,
patch
|
anodelman
:
review+
|
Details | Diff | Splinter Review |
1.77 KB,
patch
|
Details | Diff | Splinter Review |
tp4 is getting old, need a refresh.
This will not be rolled out till post firefox4 so as not to affect the release schedule.
As part of a post-mortem we talked about possibly generating the list differently. For sites like facebook the index/public pages don't test the actual pages users use/interact with.
I'm not any snapshot will accurately capture what we are looking for.
At the very least we should create a developer test account for facebook (likely linked to the main firefox account, see http://developers.facebook.com/blog/post/35). That account should add/play the top 10 or so apps on facebook. Not sure how that would interact with a snapshot, but that way we don't need someone to give us their page.
It looks like Google may have similar test accounts available (http://code.google.com/googleapps/domain/email_migration/developers_guide_protocol.html). I'm sure we could reach out and get test accounts for most other sites as well. I know other companies have test accounts for various sites and games, so most companies we ask would likely already have a process in place
Is the same set of sites used for mobile as well? The mobile top sites may be different and behave differently depending on UA / capability sniffing.
I think the days of taking a static snapshot of the top URLs and believing it is a representative sample are over FWIW...
(In reply to comment #2)
> I'm not any snapshot will accurately capture what we are looking for.
Not sure that is...
Comment 4•14 years ago
|
||
(In reply to comment #2)
> I think the days of taking a static snapshot of the top URLs and believing it
> is a representative sample are over FWIW...
Using live sites, however, is dangerous because when Google rolls out changes our Tp numbers will change. I don't know that this is a tenable solution. Do you really think we'll be able to distinguish "someone checked in a regression" from "Google changed their code / servers / caching and our Tp went up"?
Assignee | ||
Comment 5•14 years ago
|
||
The local snapshop also remove issues with live sites not responding for whatever reason. We play our web pages from a local apache server, which removes a lot of noise from the results.
(In reply to comment #4)
> Using live sites, however, is dangerous because when Google rolls out changes
> our Tp numbers will change. I don't know that this is a tenable solution. Do
> you really think we'll be able to distinguish "someone checked in a regression"
> from "Google changed their code / servers / caching and our Tp went up"?
Right, but if google rolls out a change and the Tp numbers get worse, isn't that something we need to know? From the eventual user's point of view Tp did get slower. I guess the difference is I'm focusing on the product as a whole where Tp is perhaps meant to focus on the checkins.
(In reply to comment #5)
> The local snapshop also remove issues with live sites not responding for
> whatever reason. We play our web pages from a local apache server, which
> removes a lot of noise from the results.
I'm not saying going with live sites is the way to go (I definitely think they would be too noisy). I just think we should take a hard look at Tp, what its goals are, and perhaps spin up another system if it's not covering everything we think it should.
Assignee | ||
Comment 7•14 years ago
|
||
From a testing standpoint we need something:
- repeatable
- meaningful
Sounds like meaningful is the issue here - both due to pages aging and no longer representing the 'real world' and to pages being non-logged in (ie, just a login screen instead of any content).
As to pages aging the answer could be to update the tp test set more frequently (we aim for every year or so). For non-logged in I would still see having copies culled from real users as the answer.
I would be careful not to try and make tp the everything of tests. Its basic purpose is to get a feel for how quickly we can load web pages. Other sorts of tests should be designed to cover other areas of interest.
Yep, my point is that Tp isn't really meaningful (from a release standpoint). No regression in Tp doesn't really tell me anything about how the build will react once we release. Well, it tells me on those pages at that particular point in time the build will act no worse, but even that isn't super strong as the live pages might have changed since, the server serving the pages could do something wonky, they could be slow when loaded through a proxy because of buggy proxy code, perhaps the live site forwards to https once you login, etc.
I guess having something to compare against / reason about is better than nothing, and agree perhaps I'm talking about another tool that needs to be written.
Anyway, this isn't really related to the bug at hand, so I can take this meta discussion elsewhere.
Comment 9•14 years ago
|
||
When running the standalone talos with Tp4 I noticed that some of the pages tried to load content from non-local resources. Since my proxy blocked those attempts the pages didn't finish loading until they timed out, leading to exorbitant loading times. So I think that care should be taken to eliminate any outside links to avoid these kinds of scenarios, since they obviously lead to a lot of noise.
Assignee | ||
Updated•14 years ago
|
Assignee: nobody → anodelman
Assignee | ||
Comment 10•14 years ago
|
||
Attachment #530172 -
Flags: review?(jmaher)
Assignee | ||
Comment 11•14 years ago
|
||
Add special tp5 case to buildbotcustom, when tp4 is retired we can remove the tp4 code.
Attachment #530176 -
Flags: review?(bhearsum)
Comment 12•14 years ago
|
||
Comment on attachment 530172 [details] [diff] [review]
[checked in]add tp5 to the graph server db
Review of attachment 530172 [details] [diff] [review]:
looks good to me
Attachment #530172 -
Flags: review?(jmaher) → review+
Updated•14 years ago
|
Attachment #530176 -
Flags: review?(bhearsum) → review+
Assignee | ||
Comment 13•14 years ago
|
||
Attachment #531778 -
Flags: review?(bhearsum)
Assignee | ||
Comment 14•14 years ago
|
||
Attachment #531779 -
Flags: review?(jmaher)
Assignee | ||
Comment 15•14 years ago
|
||
Attachment #531778 -
Attachment is obsolete: true
Attachment #531787 -
Flags: review?(bhearsum)
Attachment #531778 -
Flags: review?(bhearsum)
Assignee | ||
Comment 16•14 years ago
|
||
Adds support to buildbotcustom for downloading multiple pagesets.
Attachment #530176 -
Attachment is obsolete: true
Attachment #531788 -
Flags: review?(bhearsum)
Assignee | ||
Comment 17•14 years ago
|
||
Comment on attachment 530172 [details] [diff] [review]
[checked in]add tp5 to the graph server db
changeset: 351:d11c8bf075c1
Attachment #530172 -
Attachment description: add tp5 to the graph server db → [checked in]add tp5 to the graph server db
Comment 18•14 years ago
|
||
Comment on attachment 531779 [details] [diff] [review]
[checked in]add tp5 to talos sample.config
Review of attachment 531779 [details] [diff] [review]:
-----------------------------------------------------------------
Attachment #531779 -
Flags: review?(jmaher) → review+
Updated•14 years ago
|
Attachment #531788 -
Flags: review?(bhearsum) → review+
Comment 19•14 years ago
|
||
Comment on attachment 531787 [details] [diff] [review]
enable tp5 in config.py (take 2)
This patch seems mostly fine, but I have a few questions:
- How long does tp5 take to run?
- How long are we planning to run tp4 and tp5 simultaneously?
- Is it OK to be running these one after the other from a caching perspective? Eg, does running tp5 directly after tp4 change its numbers at all? If so, that'll be an issue when we drop tp4.
I ask about the timings, because our test pool is pretty clogged up already these days and I want to get in front of anything that will put in further behind messaging-wise.
Assignee | ||
Comment 20•14 years ago
|
||
- tp5 in my tests took 10-15 minutes, so pretty much the same as tp4
- we'd like to run them side by side for ~2 weeks, to create a baseline and ensure that we get matched regressions
- running them one after an other is fine - we switch out profiles between tests and there are also no pages shared between tp4/tp5. Thus, it is as risky as the other tests that we choose to run in sets
Mostly, I wanted to get it up and running so that we can start the timer on discarding tp4. If you'd like we can split the test out, but that would increase the amount of time eaten by it as you'd have the overhead of the machine reboot + setup steps.
Comment 21•14 years ago
|
||
(In reply to comment #20)
> - running them one after an other is fine - we switch out profiles between
> tests and there are also no pages shared between tp4/tp5. Thus, it is as
> risky as the other tests that we choose to run in sets
Ah, I'd forgotten that we ran other tests like this. No reason to split them out then, indeed!
> - tp5 in my tests took 10-15 minutes, so pretty much the same as tp4
> - we'd like to run them side by side for ~2 weeks, to create a baseline and
> ensure that we get matched regressions
> Mostly, I wanted to get it up and running so that we can start the timer on
> discarding tp4. If you'd like we can split the test out, but that would
> increase the amount of time eaten by it as you'd have the overhead of the
> machine reboot + setup steps.
I don't really feel equipped to yay/nay turning this on, even if the elevated load is just for two weeks.
John, Chris - what do you two think?
Assignee | ||
Comment 22•14 years ago
|
||
Now that tp4.zip/tp5.zip are available on the build server I'm doing a final test run today to ensure that everything works from soup to nuts. This will also give a final timing run.
Assignee | ||
Comment 23•14 years ago
|
||
All green with patches working in concert.
tp5 takes same execution time as tp4, on fed64 they both took 10 minutes.
Assignee | ||
Comment 24•14 years ago
|
||
Comment on attachment 531779 [details] [diff] [review]
[checked in]add tp5 to talos sample.config
changeset: 238:c84f630d576f
Attachment #531779 -
Attachment description: add tp5 to talos sample.config → [checked in]add tp5 to talos sample.config
Assignee | ||
Comment 25•14 years ago
|
||
Found a minor issue with file name length in tp4 on win7, going to fix and post a new tp5.zip to be added to the build server.
Assignee | ||
Comment 26•14 years ago
|
||
Newly posted tp5.zip all green.
This is good to deploy now.
Comment 27•14 years ago
|
||
(In reply to comment #21)
> (In reply to comment #20)
> > - running them one after an other is fine - we switch out profiles between
> > tests and there are also no pages shared between tp4/tp5. Thus, it is as
> > risky as the other tests that we choose to run in sets
> Ah, I'd forgotten that we ran other tests like this. No reason to split them
> out then, indeed!
Alice, just to be clear, with this "testing tp4,tp5 suites together"
* do you assert there will *not* be any tp5 wobble when we disable tp4?
* can we run tp4+tp5 in some branches but tp4-only in other branches, until we rollout to tp5-only in all branches?
(For example starting tp4+tp5 on m-c,try but tp4-only on all other branches. Given the record high load we are dealing with now, its not ok to just double tp load across the board - it seems more prudent to carefully limit doubling our tp load to only the branches where it is needed.)
> > - tp5 in my tests took 10-15 minutes, so pretty much the same as tp4
> > - we'd like to run them side by side for ~2 weeks, to create a baseline and
> > ensure that we get matched regressions
* Is two week transition long enough to migrate from tp4 to tp5?
* Given the new rapid release cadence, can we time this changeover to happen between scheduled migrations across branches?
> > Mostly, I wanted to get it up and running so that we can start the timer on
> > discarding tp4.
+1. TP4 is old, so a refresh of pageset is great.
> > If you'd like we can split the test out, but that would
> > increase the amount of time eaten by it as you'd have the overhead of the
> > machine reboot + setup steps.
>
> I don't really feel equipped to yay/nay turning this on, even if the
> elevated load is just for two weeks.
> John, Chris - what do you two think?
Once we figure out a notification and rollout plan (questions above), I'm fine with updating from Tp4 to Tp5.
aki, alice, mfinkle: what about tp5 for mobile? Do we need a Tp5m, or does this new Tp5 as-is work fine on maemo+android?
Comment 28•14 years ago
|
||
(In reply to comment #27)
> (In reply to comment #21)
> > (In reply to comment #20)
> > > - running them one after an other is fine - we switch out profiles between
> > > tests and there are also no pages shared between tp4/tp5. Thus, it is as
> > > risky as the other tests that we choose to run in sets
> > Ah, I'd forgotten that we ran other tests like this. No reason to split them
> > out then, indeed!
> Alice, just to be clear, with this "testing tp4,tp5 suites together"
> * do you assert there will *not* be any tp5 wobble when we disable tp4?
* do you assert there will *not* be any tp4 wobble when we add tp5? (This would make the difference between needing a tree closure or not).
Comment 29•14 years ago
|
||
(In reply to comment #27)
> aki, alice, mfinkle: what about tp5 for mobile? Do we need a Tp5m, or does
> this new Tp5 as-is work fine on maemo+android?
We do not want a Tp5m unless we get a lot of mobile pages in the set. We can't use Tp5 for Maemo, it will kill the devices. I'd be happier getting Tp4m working on Maemo.
Assignee | ||
Comment 30•14 years ago
|
||
Yes, I assert that there will be no tp4 number wobble by adding tp5. They run using newly generated profiles and totally different pagesets - there is no sharing of cache.
Assignee | ||
Comment 31•14 years ago
|
||
Two weeks has been the acceptable switchover time in the past to go from one pageset to another, it is enough time to get a decent baseline and ensure that there is no reason to revert to tp4.
Comment 32•14 years ago
|
||
(In reply to comment #27)
> (In reply to comment #21)
> > (In reply to comment #20)
> > > - running them one after an other is fine - we switch out profiles between
> > > tests and there are also no pages shared between tp4/tp5. Thus, it is as
> > > risky as the other tests that we choose to run in sets
> > Ah, I'd forgotten that we ran other tests like this. No reason to split them
> > out then, indeed!
> Alice, just to be clear, with this "testing tp4,tp5 suites together"
> * do you assert there will *not* be any tp5 wobble when we disable tp4?
(In reply to comment #30)
> Yes, I assert that there will be no tp4 number wobble by adding tp5. They
> run using newly generated profiles and totally different pagesets - there is
> no sharing of cache.
Great!
> * can we run tp4+tp5 in some branches but tp4-only in other branches, until
> we rollout to tp5-only in all branches?
> (For example starting tp4+tp5 on m-c,try but tp4-only on all other branches.
> Given the record high load we are dealing with now, its not ok to just
> double tp load across the board - it seems more prudent to carefully limit
> doubling our tp load to only the branches where it is needed.)
In meeting just now, Armen suggested running tp4, tp5 side-by-side *only* on tracemonkey branch for the 2 week transition period. If all goes well, then roll out to all other branches.
> > > - tp5 in my tests took 10-15 minutes, so pretty much the same as tp4
> > > - we'd like to run them side by side for ~2 weeks, to create a baseline and
> > > ensure that we get matched regressions
> * Is two week transition long enough to migrate from tp4 to tp5?
(In reply to comment #31)
> Two weeks has been the acceptable switchover time in the past to go from one
> pageset to another, it is enough time to get a decent baseline and ensure
> that there is no reason to revert to tp4.
Who will be making the "no reason to revert" decision?
> * Given the new rapid release cadence, can we time this changeover to happen
> between scheduled migrations across branches?
Updated•14 years ago
|
Updated•14 years ago
|
Comment 33•14 years ago
|
||
I will drive this to completion and do the communication with developers.
I am going to enable tp5 on tracemonkey to begin with.
Updated•14 years ago
|
Assignee: anodelman → armenzg
Assignee | ||
Comment 35•14 years ago
|
||
This should enable tp5 on tracemonkey only. Due to auto-tools currently having no working master I cannot do staging for this patch.
Attachment #537840 -
Flags: review?(armenzg)
Comment 36•14 years ago
|
||
Comment on attachment 537840 [details] [diff] [review]
(checked-in) enable tp5 on tracemonkey only
This looks good but I doubt it will work without some work in here:
http://hg.mozilla.org/build/buildbotcustom/file/tip/process/factory.py#l8061
and
http://hg.mozilla.org/build/buildbotcustom/file/tip/process/factory.py#l8155
AFAIK there is currently no talos jobs that have more than one suite being run inside of it.
For instance for unittests we iterate over the list of suites that need to be run.
http://hg.mozilla.org/build/buildbotcustom/file/tip/process/factory.py#l7287
I am willing to take tp5 as an extra suite for tracemonkey to make things easier. The ability to run two talos suites on the same jobs is needed but I don't think I should block you on it.
Would you be able to modify the patch to add tp5 to tracemonkey as a separate job rather than two merged suites? This will gets us unstuck without blocking ourselves on multiple suites per job feature.
I will be busy for the next 2/3 days with the release but I can help out.
If you want/need to get yourself ahead of the VM you can use this trick:
https://wiki.mozilla.org/ReleaseEngineering:TestingTechniques#setup_one_master_and_output_the_steps_for_it
I have used it before for not having to setup masters and slaves but attach a diff to a bug showing that my change did exactly what I wanted.
Assignee | ||
Comment 37•14 years ago
|
||
The fix you are requesting is in the other patch in this bug, "buildbotcustom fix for tp5 (take 2)" which has been r+ed by bhearsum.
There are already talos jobs that run more than one test - like the dromaeo tests:
'dromaeo': GRAPH_CONFIG + ['--activeTests', 'dromaeo_basics:dromaeo_v8:dromaeo_sunspider:dromaeo_jslib:dromaeo_css:dromaeo_dom'],
This is actually 6 tests run in a row, much like I want to run tp5 directly after tp4.
I believe the patches that I have presented already work as expected and can be deployed.
Assignee | ||
Comment 38•14 years ago
|
||
Any ETA here, considering that the review questions have been answered?
Comment 39•14 years ago
|
||
(In reply to comment #38)
> Any ETA here, considering that the review questions have been answered?
Alice, before we roll anything into production, can you please answer (or find owners for) the remaining questions in comment#27, comment#32?
Comment 40•14 years ago
|
||
/me fixes dependency
Assignee | ||
Comment 41•14 years ago
|
||
- two weeks has been the acceptable switch over time frame before, it is now
- testing tp4/tp5 together is not a risk, we do that with other suites
- i will make the call as to if we need to revert or not
- there is no tp5 for mobile as mobile now uses a custom set of mobile only pages
- we are ready to roll out to tracemonkey only as a first step
I believe that all the questions are answered.
Assignee | ||
Comment 42•14 years ago
|
||
As a note, the roll out of tp4/tp5 to tracemonkey won't save us from running them side by side on the rest of the branches. We always need to construct a workable history of any given performance test to make any sense of the results - thus you end up running them side by side for 2 weeks. During that 2 weeks developers can refer to the tp4 numbers for their work, while at the same time we grow a 2 week set of tp5 history.
Comment 43•14 years ago
|
||
Comment on attachment 537840 [details] [diff] [review]
(checked-in) enable tp5 on tracemonkey only
This seems good on the light of your reply.
I am also running it on staging.
It should show up on tinderbox:
http://tinderbox.mozilla.org/showbuilds.cgi?tree=MozillaTest&noignore=1
I will check tomorrow and see if we can proceed.
By running a comparison [1] I see these new steps being added as expected.
[1] https://wiki.mozilla.org/ReleaseEngineering:TestingTechniques#setup_one_master_and_output_the_steps_for_it
- MozillaUpdateConfig {'addOptions': ['--resultsServer', 'graphs.mozilla.org', '--resultsLink', '/server/collect.cgi', '--activeTests', 'ts_paint:tpaint', '--setPref', 'dom.send_after_paint_to_content=true'], 'addonTester': False, 'branch': 'TraceMonkey', 'branchName': 'TraceMonkey', 'command': None, 'description': None, 'descriptionDone': None, 'env': {'XPCOM_DEBUG_BREAK': 'warn', 'MOZ_NO_REMOTE': '1', 'CYGWINBASE': 'C:\\cygwin', 'PATH': 'C:\\Python24;C:\\Python24\\Scripts;C:\\cygwin\\bin;C:\\WINDOWS\\System32;C:\\program files\\gnuwin32\\bin;C:\\WINDOWS;', 'MOZ_CRASHREPORTER_NO_REPORT': '1', 'NO_EM_RESTART': '1'}, 'executablePath': <buildbot.process.properties.WithProperties>, 'extName': 'addon.xpi', 'haltOnFailure': True, 'log_eval_func': None, 'logfiles': {}, 'remoteExtras': {}, 'remoteProcessName': 'org.mozilla.fennec', 'remoteTests': False, 'usePTY': 'slave-config', 'useSymbols': True, 'workdir': '../talos-data/talos/'} {}
+ DownloadFile {'command': None, 'description': None, 'descriptionDone': None, 'filename_property': None, 'ignore_certs': False, 'log_eval_func': None, 'logfiles': {}, 'url': 'http://build.mozilla.org/talos/zips/plugins.zip', 'url_fn': None, 'url_property': None, 'usePTY': 'slave-config', 'wget_args': None, 'workdir': '../talos-data/talos/base_profile'} {}
+ UnpackFile {'command': None, 'description': None, 'descriptionDone': None, 'filename': 'plugins.zip', 'log_eval_func': None, 'logfiles': {}, 'scripts_dir': '.', 'usePTY': 'slave-config', 'workdir': '../talos-data/talos/base_profile'} {}
+ DownloadFile {'command': None, 'description': None, 'descriptionDone': None, 'filename_property': None, 'ignore_certs': False, 'log_eval_func': None, 'logfiles': {}, 'url': 'http://build.mozilla.org/talos/zips/tp4.zip', 'url_fn': None, 'url_property': None, 'usePTY': 'slave-config', 'wget_args': None, 'workdir': '../talos-data/talos/page_load_test'} {}
+ UnpackFile {'command': None, 'description': None, 'descriptionDone': None, 'filename': 'tp4.zip', 'log_eval_func': None, 'logfiles': {}, 'scripts_dir': '.', 'usePTY': 'slave-config', 'workdir': '../talos-data/talos/page_load_test'} {}
+ DownloadFile {'command': None, 'description': None, 'descriptionDone': None, 'filename_property': None, 'ignore_certs': False, 'log_eval_func': None, 'logfiles': {}, 'url': 'http://build.mozilla.org/talos/zips/tp5.zip', 'url_fn': None, 'url_property': None, 'usePTY': 'slave-config', 'wget_args': None, 'workdir': '../talos-data/talos/page_load_test'} {}
+ UnpackFile {'command': None, 'description': None, 'descriptionDone': None, 'filename': 'tp5.zip', 'log_eval_func': None, 'logfiles': {}, 'scripts_dir': '.', 'usePTY': 'slave-config', 'workdir': '../talos-data/talos/page_load_test'} {}
+ MozillaUpdateConfig {'addOptions': ['--resultsServer', 'graphs.mozilla.org', '--resultsLink', '/server/collect.cgi', '--activeTests', 'tp4:tp5'], 'addonTester': False, 'branch': 'TraceMonkey', 'branchName': 'TraceMonkey', 'command': None, 'description': None, 'descriptionDone': None, 'env': {'XPCOM_DEBUG_BREAK': 'warn', 'MOZ_NO_REMOTE': '1', 'CYGWINBASE': 'C:\\cygwin', 'PATH': 'C:\\Python24;C:\\Python24\\Scripts;C:\\cygwin\\bin;C:\\WINDOWS\\System32;C:\\program files\\gnuwin32\\bin;C:\\WINDOWS;', 'MOZ_CRASHREPORTER_NO_REPORT': '1', 'NO_EM_RESTART': '1'}, 'executablePath': <buildbot.process.properties.WithProperties>, 'extName': 'addon.xpi', 'haltOnFailure': True, 'log_eval_func': None, 'logfiles': {}, 'remoteExtras': {}, 'remoteProcessName': 'org.mozilla.fennec', 'remoteTests': False, 'usePTY': 'slave-config', 'useSymbols': True, 'workdir': '../talos-data/talos/'} {}
Attachment #537840 -
Flags: review?(armenzg) → review+
Comment 44•14 years ago
|
||
Comment on attachment 531787 [details] [diff] [review]
enable tp5 in config.py (take 2)
Removing myself from this review request because Armen has taken over the releng side of things (and I'm not sure if this patch is still current).
Attachment #531787 -
Flags: review?(bhearsum)
Comment 45•14 years ago
|
||
I tested this on staging and did not succeed.
I see tp4 & tp5 unpacked to the same place:
c:\talos-slave\test\../talos-data/talos/page_load_test
Not sure if that is wanted/expected.
python PerfConfigurator.py -v -e ../firefox/firefox -t talos-r3-w7-002 -b TraceMonkey --branchName MozillaTest --resultsServer graphs-stage.mozilla.org --resultsLink /server/collect.cgi --activeTests tp4:tp5 --symbolsPath ../symbols
python run_tests.py --noisy 20110608_1358_config.yml
anode can you look at the logs and let me know what is going on?
The only one that succeeded is 10.5.
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTest/1307566395.1307567777.7465.gz&fulltext=1
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTest/1307566308.1307567228.4687.gz&fulltext=1
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTest/1307566363.1307567990.8796.gz&fulltext=1
Assignee | ||
Comment 46•14 years ago
|
||
Looks like you have two freezes in tp4. I can't say why tp4 would be freezing.
You are running the test without the talos patch listed in this bug - that is why it isn't even attempting to run tp5.
You may need to set up a clean environment and just test tp4 to ensure that you can get a good run out of it, otherwise the issue is with the staging env.
Otherwise, you could log into the machine and attempt to run the test manually and see what the browser is doing.
Comment 47•14 years ago
|
||
(In reply to comment #46)
> Looks like you have two freezes in tp4. I can't say why tp4 would be
> freezing.
>
> You are running the test without the talos patch listed in this bug - that
> is why it isn't even attempting to run tp5.
>
> You may need to set up a clean environment and just test tp4 to ensure that
> you can get a good run out of it, otherwise the issue is with the staging
> env.
>
> Otherwise, you could log into the machine and attempt to run the test
> manually and see what the browser is doing.
I concur with Alice, it looks like the buildbot side of things is busted. Is there any way we can get alice access to this staging box so we can sort this out more quickly since you're (armen) busy with a release?
Comment 48•14 years ago
|
||
I mentioned it on IRC.
If access to a machine is needed a file should be file and the buildduty/IT would pick that up fairly quickly (hours).
I am going to try to kick this with the talos bundle from bug 658392.
Comment 49•14 years ago
|
||
In case anyone would need to run this manually. I hope I recovered the steps properly.
mkdir -p ../talos-data/talos
cd ../talos-data
wget --progress=dot:mega -N http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/tracemonkey-win32/1307480826/firefox-7.0a1.en-US.win32.zip
unzip -o firefox-7.0a1.en-US.win32.zip
cd firefox; chmod -v -R a+x .; cd ..
# wget --progress=dot:mega -N http://build.mozilla.org/talos/zips/talos.zip
wget --progress=dot:mega -N http://people.mozilla.org/~anodelman/taloszips/c84f630d576f/talos.zip
unzip -o talos.zip
mkdir -p talos/page_load_test
cd talos/page_load_test; wget --progress=dot:mega -N http://build.mozilla.org/talos/xpis/pageloader.xpi; cd ../..
mkdir -p talos/base_profile
cd talos/base_profile; wget --progress=dot:mega -N http://build.mozilla.org/talos/zips/plugins.zip; unzip -o plugins.zip; cd ../..
cd talos/page_load_test; wget --progress=dot:mega -N http://build.mozilla.org/talos/zips/tp4.zip; unzip -o tp4.zip; cd ../..
cd talos/page_load_test; wget --progress=dot:mega -N http://build.mozilla.org/talos/zips/tp5.zip; unzip -o tp5.zip; cd ../..
cd talos; python PerfConfigurator.py -v -e ../firefox/firefox -t talos-r3-w7-002 -b TraceMonkey --branchName MozillaTest --resultsServer graphs-stage.mozilla.org --resultsLink /server/collect.cgi --activeTests tp4:tp5 --symbolsPath ../symbols; python run_tests.py --noisy 20110608_1358_config.yml; cd ..
Comment 50•14 years ago
|
||
The problems on comment 45 is because the current talos.zip was used rather than the one that anode posted on bug 658392.
I am setting http://people.mozilla.com/~armenzg/talos with all the production bundle and replaced the talos.zip bundle with the one that anode posted.
I will trigger soon a new set of tp jobs and see what happens.
Comment 51•14 years ago
|
||
Sweet.
This seems to be running now.
We should see the results in anytime soon:
http://tinderbox.mozilla.org/showbuilds.cgi?tree=MozillaTest
anode I am using:
* attachment 531788 [details] [diff] [review]
* attachment 537840 [details] [diff] [review]
* http://people.mozilla.org/~anodelman/taloszips/c84f630d576f/talos.zip
* local hack to point to http://people.mozilla.com/~armenzg/talos
I am going to post now a comment assuming that this cannot really wait until Monday. If it can wait until Monday please let's do so as I don't see why I have to put burden on other team members and IT when I can resume it on Monday (I am done with the release). I will check results tonight and perhaps I can send the email mentioned at the end myself.
anode can you check the results on MozillaTest tree once they are done?
anyone from releng, if anode approves the results on tinderbox could you please land the two attachments and deploy the new talos.zip?
We also have to check:
* that tbpl shows tp4 and tp5 properly
* that graph server shows both tp4 and tp5
If everything goes well we can close bug 663192. Otherwise use discernment on what to do next.
If everything lands properly for tracemonkey tomorrow could we send an email to dev.planning saying:
"We are ready to enable tp5 with tp4 for a 2 weeks period or less to set a baseline and give enough time for devs to start using tp5 instead tp4. releng has added for few weeks more rev3 machines from the win7 64-bit until Win 64-bit builds are fully supported. This gives us enough capacity to handle this extra load and the general bad wait times.
Please feel free to raise your questions/concerns.
Remember that this is only for 2 weeks and everything will be going back to normal"
Assignee | ||
Comment 52•14 years ago
|
||
Green runs overnight:
talos-r3-leopard-001
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTest/1307655293.1307659949.20075.gz&fulltext=1
talos-r3-fed64-002
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTest/1307655308.1307664309.6116.gz&fulltext=1
talos-r3-xp-003
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTest/1307655298.1307657590.9868.gz&fulltext=1
Comment 53•14 years ago
|
||
I noticed these just in case are important:
> Running test tp5:
> Started Thu, 09 Jun 2011 15:50:34
> LoadPlugin: failed to initialize shared library libXt.so [libXt.so: cannot open shared object file: No such file or directory]
> LoadPlugin: failed to initialize shared library libXext.so [libXext.so: cannot open shared object file: No such file or directory]
> LoadPlugin: failed to initialize shared library /tmp/tmphprZP2/profile/plugins/libflashplayer.so [/tmp/tmphprZP2/profile/plugins/libflashplayer.so: wrong ELF class: ELFCLASS32]
> Screen width/height:1600/1200
> colorDepth:24
> Browser inner width/height: 1024/682
...
> NOISE: Cycle 10: loaded http://localhost/page_load_test/tp5/yandex.ru/yandex.ru/yandsearch@text=mozilla&lr=21215.html (next: http://localhost/page_load_test/tp5/cgi.ebay.com/cgi.ebay.com/ALL-NEW-KINDLE-3-eBOOK-WIRELESS-READING-DEVICE-W-WIFI-/130496077314@pt=LH_DefaultDomain_0&hash=item1e622c1e02.html)
> Corrupt JPEG data: 8 extraneous bytes before marker 0xe1
...
> NOISE: Cycle 10: loaded http://localhost/page_load_test/tp5/goo.ne.jp/goo.ne.jp/index.html (next: http://localhost/page_load_test/tp5/alipay.com/www.alipay.com/index.html)
> Corrupt JPEG data: 40 extraneous bytes before marker 0xee
This question is unrelated to the deployment of this:
* how is it that tp5 takes 10-15 mins on your machine while both tp4 & tp5 take more that an hour on the test machines? Are you cycling once over the pages instead of 10 times?
Assignee | ||
Comment 54•14 years ago
|
||
I am cycling 10 times. Can you link me to a log showing the longer running time?
Comment 55•14 years ago
|
||
From the two logs you pasted you can see that fed64 took more than 2 hours and leopard took more than an hour.
TinderboxPrint: cycle time: 01:14:41
TinderboxPrint: cycle time: 02:27:33
Xp and Win7 took 36-38mins which is what we were expecting.
I have re-triggered both jobs and see what happens. Running on the same slave might not yield a difference though.
Comment 56•14 years ago
|
||
I run the Leopard jobs again with different slaves and I get consistent cycle times:
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTest/1307985208.1307989774.14518.gz&fulltext=1
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTest/1307985560.1307990254.16006.gz&fulltext=1
I have setup staging to trigger:
1) tp4
2) tp5
3) tp4 & tp5
Let's see what is going on.
Comment 57•14 years ago
|
||
What I am trying to answer is this:
On production:
> Completed test tp4:
> Stopped Thu, 09 Jun 2011 17:49:18
> RETURN: cycle time: 00:11:48<br>
On staging:
> Running test tp4:
> Started Mon, 13 Jun 2011 10:23:00
> Completed test tp4:
> Stopped Mon, 13 Jun 2011 10:59:59
That seems to me like a 25 mins increase which is more than 200%.
On Fedora 64 is even worst.
Comment 58•14 years ago
|
||
It seems that the cycle time of tp4 without tp5 is:
cycle time: 00:36:48 [1]
That cycle time should be ~11mins. The only two differences I can think of are that there is something weird with the new talos.zip or that the two staging slaves have an issue.
I have moved a 3rd leopard machine (talos-r3-leopard-003) to staging to discard that both these slaves are on a weird state.
I have also changed the support base url to point to the current talos.zip [3] bundle rather than the new one that was posted in bug 658392 [4](according to anode: the only addition to talos.zip was an entry in the config file so it seems pretty unlikely).
[1] http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTest/1307994525.1307996840.8883.gz&fulltext=1
[2] http://people.mozilla.org/~anodelman/taloszips/c84f630d576f/talos.zip
[3]
[armenzg@dm-wwwbuild01 zips]$ sha1sum *zip
d0df5cea24790dd29de825ce4b77864876a09a2b pagesets.zip
90c3dbfe022fb0e854f7af0329f7036d88461d54 plugins.zip
89e9ef8e23a96fd29978d6c0f696be543c2b3fb6 talos.zip
3f04e7bc80b7bf7add552382802e31ef29133de3 tp4.zip
7be6c7f8ab05416e8ef246b7bb850f293dd53ab7 tp5.zip
[4]
[armenzg@dm-peep01 zips]$ sha1sum *zip
d0df5cea24790dd29de825ce4b77864876a09a2b pagesets.zip
90c3dbfe022fb0e854f7af0329f7036d88461d54 plugins.zip
f2b45a7f42056b897104d46317902bb5468596a9 talos.zip
3f04e7bc80b7bf7add552382802e31ef29133de3 tp4.zip
7be6c7f8ab05416e8ef246b7bb850f293dd53ab7 tp5.zip
Comment 59•14 years ago
|
||
I just got a snow leopard tp4+tp5 run [1]:
> Running test tp4:
> Started Mon, 13 Jun 2011 13:07:31
>
> Completed test tp4:
> Stopped Mon, 13 Jun 2011 13:19:42
> Running test tp5:
> Started Mon, 13 Jun 2011 13:19:42
> Completed test tp5:
> Stopped Mon, 13 Jun 2011 13:56:51
> RETURN: cycle time: 00:49:19<br>
* tp4 -> ~12mins
* tp5 -> ~37mins
It seems that tp4 takes around the same amount as it takes on production but tp5 takes around *3x* as tp4 normally takes.
[1] http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTest/1307995374.1307998618.16096.gz&fulltext=1
Assignee | ||
Comment 60•14 years ago
|
||
I tried to re-create the leopard results that you are seeing. But it is going quick on my staging leopard box.
tools-r3-leopard-001:
Running test tp4:
Started Mon, 13 Jun 2011 14:11:52
Completed test tp4:
Stopped Mon, 13 Jun 2011 14:23:19
Running test tp5:
Started Mon, 13 Jun 2011 14:23:19
Completed test tp5:
Stopped Mon, 13 Jun 2011 14:36:07
So tp4 of 11 minutes and tp5 of 13 minutes.
Assignee | ||
Comment 61•14 years ago
|
||
I can work on further results tomorrow for the other slow systems that you are seeing, but I'm thinking you have staging issues.
Comment 62•14 years ago
|
||
talos-r3-leopard-003 has been able to pick up few tp4 jobs and scored good timings with the production zips. I have disabled the other two slaves and only run leopard-003.
Updated•14 years ago
|
Updated•14 years ago
|
Attachment #531787 -
Attachment is obsolete: true
Updated•14 years ago
|
Attachment #531788 -
Attachment description: buildbotcustom fix for tp5 (take 2) → (checked-in) buildbotcustom fix for tp5 (take 2)
Updated•14 years ago
|
Attachment #537840 -
Attachment description: enable tp5 on tracemonkey only → (checked-in) enable tp5 on tracemonkey only
Comment 63•14 years ago
|
||
They should be picked up on production this morning:
http://hg.mozilla.org/build/buildbot-configs/rev/b797e8aec078
http://hg.mozilla.org/build/buildbotcustom/rev/ecad1f4b880b
Comment 64•14 years ago
|
||
We can land this as soon as everything looks good on TraceMonkey.
I have triggered all the builders for TraceMonkey and tomorrow we should see the results on tbpl.
I used builder_list.py and vimdiff to help me compare old list of builders with the new one:
> python ~/repos/releng/braindump/buildbot-related/builder_list.py master.cfg > new_builders
L
s: talos-r3-fed-054
s: talos-r3-fed-054
id:20110613181633
rev:3acacde59381
cycle time: 00:23:20
tp5: 391.98
tp5_pbytes: 439.5MB
tp5_xres: 18.6MB
tp5_rss: 141.3MB
tp5_shutdown: 920.0
tp4: 349.42
tp4_pbytes: 159.7MB
tp4_xres: 427.5KB
tp4_rss: 48.9MB
tp4_shutdown: 728.0
Details:
tp5
tp5_pbytes
tp5_xres
tp5_rss
tp5_shutdown
tp4
tp4_pbytes
tp4_xres
tp4_rss
tp4_shutdown
Attachment #539260 -
Flags: review?(anodelman)
Comment 65•14 years ago
|
||
I can see tp5 showing up on graphs:
http://graphs-new.mozilla.org/graph.html#tests=[[89,4,1],[89,4,12],[89,4,13],[89,4,15],[89,4,14]]&sel=none&displayrange=7&datatype=running
anode can you please have a look at bug 664371?
We are ready to go ahead and enable this on every branch as soon as you give the go/no-go.
FTR I landed this patch http://hg.mozilla.org/build/buildbot-configs/rev/fb8a29ea0773#l1.99 which happened to fix that tp was being half-enabled on project_branches without being explicit. I added 'tp' to the list of suites to be disabled by default on line 1.99.
Assignee | ||
Comment 66•14 years ago
|
||
I'm not too worried about an intermittent fail on a new pageset - these are pages that we haven't been testing against yet so it is probably finding new and wonderful code paths in the browser.
Assignee | ||
Updated•14 years ago
|
Attachment #539260 -
Flags: review?(anodelman) → review+
Comment 67•14 years ago
|
||
FTR this is the time that the jobs take:
Rev3 Fedora 12 tracemonkey talos tp 0:32:08
Rev3 Fedora 12x64 tracemonkey talos tp 0:25:30
Rev3 MacOSX Leopard 10.5.8 tracemonkey talos tp 0:36:35
Rev3 MacOSX Snow Leopard 10.6.2 tracemonkey talos tp 0:36:48
Rev3 WINNT 5.1 tracemonkey talos tp 0:33:43
Rev3 WINNT 6.1 tracemonkey talos tp 0:32:05
Which are decent.
Comment 68•14 years ago
|
||
Comment on attachment 539260 [details] [diff] [review]
(checked-in) [configs] disable tp4 everywhere except older release branches & enable tp (tp4+tp5) everywhere except older release branches
Enabled everywhere except older release branches:
http://hg.mozilla.org/build/buildbot-configs/rev/48bf23b49d2a
This will be picked up in tomorrow's scheduled reconfig.
anode I leave it into your hands to come back to us when we are ready to disable tp4 and follow-up any bugs.
Closer to that time I will raise it to dev.planning and the Tuesday call (June 28th).
Sounds good?
Attachment #539260 -
Attachment description: [configs] disable tp4 everywhere except older release branches & enable tp (tp4+tp5) everywhere except older release branches → (checked-in) [configs] disable tp4 everywhere except older release branches & enable tp (tp4+tp5) everywhere except older release branches
Comment 69•14 years ago
|
||
This got deployed to production a couple of hours ago and it is showing up on tbpl.
http://hg.mozilla.org/build/buildbot-configs/rev/8942ffd33487
I have also announced it on dev.planning and dev.-tree-management:
http://groups.google.com/group/mozilla.dev.tree-management/browse_thread/thread/7ad2ba7f8f006d65#
Assignee | ||
Comment 70•14 years ago
|
||
I've compared the Tracemonkey tp4 and tp5 results and things look good. I see them reflecting the same wins/regressions over time and I believe that tp5 is ready to go it alone. Kill off tp4 at will on those branches where the side-by-side testing has been occurring.
Comment 71•14 years ago
|
||
Sweet.
The scheduled date is June 30th. I have announced it on the Tuesday's meeting and the mailing lists.
I guess this bug is done ("creating tp5 and deploying it"), right?
Comment 72•14 years ago
|
||
I am using bugzilla since just to keep everything centralized.
See issue opened on bitbucket as well:
https://bitbucket.org/mconnor/compare-talos/issue/12/add-tp5-support
Attachment #541136 -
Flags: review?(mconnor)
Comment 73•14 years ago
|
||
Comment on attachment 541136 [details] [diff] [review]
add tp5 support for compare-talos
https://bitbucket.org/mconnor/compare-talos/changeset/db884c7f8c25
Attachment #541136 -
Flags: review?(mconnor)
Comment 74•14 years ago
|
||
What's left in here? IIUC this is done.
Assignee | ||
Comment 75•14 years ago
|
||
Tp5 has been deployed so this is now complete.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•