Closed Bug 601798 Opened 14 years ago Closed 14 years ago

create tp5 pageset

Categories

(Testing :: Talos, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: anodelman, Assigned: anodelman)

References

Details

Attachments

(6 files, 3 obsolete files)

tp4 is getting old, need a refresh. This will not be rolled out till post firefox4 so as not to affect the release schedule.
As part of a post-mortem we talked about possibly generating the list differently. For sites like facebook the index/public pages don't test the actual pages users use/interact with.
I'm not any snapshot will accurately capture what we are looking for. At the very least we should create a developer test account for facebook (likely linked to the main firefox account, see http://developers.facebook.com/blog/post/35). That account should add/play the top 10 or so apps on facebook. Not sure how that would interact with a snapshot, but that way we don't need someone to give us their page. It looks like Google may have similar test accounts available (http://code.google.com/googleapps/domain/email_migration/developers_guide_protocol.html). I'm sure we could reach out and get test accounts for most other sites as well. I know other companies have test accounts for various sites and games, so most companies we ask would likely already have a process in place Is the same set of sites used for mobile as well? The mobile top sites may be different and behave differently depending on UA / capability sniffing. I think the days of taking a static snapshot of the top URLs and believing it is a representative sample are over FWIW...
(In reply to comment #2) > I'm not any snapshot will accurately capture what we are looking for. Not sure that is...
(In reply to comment #2) > I think the days of taking a static snapshot of the top URLs and believing it > is a representative sample are over FWIW... Using live sites, however, is dangerous because when Google rolls out changes our Tp numbers will change. I don't know that this is a tenable solution. Do you really think we'll be able to distinguish "someone checked in a regression" from "Google changed their code / servers / caching and our Tp went up"?
The local snapshop also remove issues with live sites not responding for whatever reason. We play our web pages from a local apache server, which removes a lot of noise from the results.
(In reply to comment #4) > Using live sites, however, is dangerous because when Google rolls out changes > our Tp numbers will change. I don't know that this is a tenable solution. Do > you really think we'll be able to distinguish "someone checked in a regression" > from "Google changed their code / servers / caching and our Tp went up"? Right, but if google rolls out a change and the Tp numbers get worse, isn't that something we need to know? From the eventual user's point of view Tp did get slower. I guess the difference is I'm focusing on the product as a whole where Tp is perhaps meant to focus on the checkins. (In reply to comment #5) > The local snapshop also remove issues with live sites not responding for > whatever reason. We play our web pages from a local apache server, which > removes a lot of noise from the results. I'm not saying going with live sites is the way to go (I definitely think they would be too noisy). I just think we should take a hard look at Tp, what its goals are, and perhaps spin up another system if it's not covering everything we think it should.
From a testing standpoint we need something: - repeatable - meaningful Sounds like meaningful is the issue here - both due to pages aging and no longer representing the 'real world' and to pages being non-logged in (ie, just a login screen instead of any content). As to pages aging the answer could be to update the tp test set more frequently (we aim for every year or so). For non-logged in I would still see having copies culled from real users as the answer. I would be careful not to try and make tp the everything of tests. Its basic purpose is to get a feel for how quickly we can load web pages. Other sorts of tests should be designed to cover other areas of interest.
Yep, my point is that Tp isn't really meaningful (from a release standpoint). No regression in Tp doesn't really tell me anything about how the build will react once we release. Well, it tells me on those pages at that particular point in time the build will act no worse, but even that isn't super strong as the live pages might have changed since, the server serving the pages could do something wonky, they could be slow when loaded through a proxy because of buggy proxy code, perhaps the live site forwards to https once you login, etc. I guess having something to compare against / reason about is better than nothing, and agree perhaps I'm talking about another tool that needs to be written. Anyway, this isn't really related to the bug at hand, so I can take this meta discussion elsewhere.
When running the standalone talos with Tp4 I noticed that some of the pages tried to load content from non-local resources. Since my proxy blocked those attempts the pages didn't finish loading until they timed out, leading to exorbitant loading times. So I think that care should be taken to eliminate any outside links to avoid these kinds of scenarios, since they obviously lead to a lot of noise.
Assignee: nobody → anodelman
Attached patch buildbotcustom fix for tp5 (obsolete) — Splinter Review
Add special tp5 case to buildbotcustom, when tp4 is retired we can remove the tp4 code.
Attachment #530176 - Flags: review?(bhearsum)
Comment on attachment 530172 [details] [diff] [review] [checked in]add tp5 to the graph server db Review of attachment 530172 [details] [diff] [review]: looks good to me
Attachment #530172 - Flags: review?(jmaher) → review+
Attachment #530176 - Flags: review?(bhearsum) → review+
Depends on: 656405
Attached patch enable tp5 in config.py (obsolete) — Splinter Review
Attachment #531778 - Flags: review?(bhearsum)
Attached patch enable tp5 in config.py (take 2) (obsolete) — Splinter Review
Attachment #531778 - Attachment is obsolete: true
Attachment #531787 - Flags: review?(bhearsum)
Attachment #531778 - Flags: review?(bhearsum)
Adds support to buildbotcustom for downloading multiple pagesets.
Attachment #530176 - Attachment is obsolete: true
Attachment #531788 - Flags: review?(bhearsum)
Depends on: 656487
Comment on attachment 530172 [details] [diff] [review] [checked in]add tp5 to the graph server db changeset: 351:d11c8bf075c1
Attachment #530172 - Attachment description: add tp5 to the graph server db → [checked in]add tp5 to the graph server db
Comment on attachment 531779 [details] [diff] [review] [checked in]add tp5 to talos sample.config Review of attachment 531779 [details] [diff] [review]: -----------------------------------------------------------------
Attachment #531779 - Flags: review?(jmaher) → review+
Attachment #531788 - Flags: review?(bhearsum) → review+
Comment on attachment 531787 [details] [diff] [review] enable tp5 in config.py (take 2) This patch seems mostly fine, but I have a few questions: - How long does tp5 take to run? - How long are we planning to run tp4 and tp5 simultaneously? - Is it OK to be running these one after the other from a caching perspective? Eg, does running tp5 directly after tp4 change its numbers at all? If so, that'll be an issue when we drop tp4. I ask about the timings, because our test pool is pretty clogged up already these days and I want to get in front of anything that will put in further behind messaging-wise.
- tp5 in my tests took 10-15 minutes, so pretty much the same as tp4 - we'd like to run them side by side for ~2 weeks, to create a baseline and ensure that we get matched regressions - running them one after an other is fine - we switch out profiles between tests and there are also no pages shared between tp4/tp5. Thus, it is as risky as the other tests that we choose to run in sets Mostly, I wanted to get it up and running so that we can start the timer on discarding tp4. If you'd like we can split the test out, but that would increase the amount of time eaten by it as you'd have the overhead of the machine reboot + setup steps.
(In reply to comment #20) > - running them one after an other is fine - we switch out profiles between > tests and there are also no pages shared between tp4/tp5. Thus, it is as > risky as the other tests that we choose to run in sets Ah, I'd forgotten that we ran other tests like this. No reason to split them out then, indeed! > - tp5 in my tests took 10-15 minutes, so pretty much the same as tp4 > - we'd like to run them side by side for ~2 weeks, to create a baseline and > ensure that we get matched regressions > Mostly, I wanted to get it up and running so that we can start the timer on > discarding tp4. If you'd like we can split the test out, but that would > increase the amount of time eaten by it as you'd have the overhead of the > machine reboot + setup steps. I don't really feel equipped to yay/nay turning this on, even if the elevated load is just for two weeks. John, Chris - what do you two think?
Now that tp4.zip/tp5.zip are available on the build server I'm doing a final test run today to ensure that everything works from soup to nuts. This will also give a final timing run.
All green with patches working in concert. tp5 takes same execution time as tp4, on fed64 they both took 10 minutes.
Depends on: 658392
Comment on attachment 531779 [details] [diff] [review] [checked in]add tp5 to talos sample.config changeset: 238:c84f630d576f
Attachment #531779 - Attachment description: add tp5 to talos sample.config → [checked in]add tp5 to talos sample.config
Found a minor issue with file name length in tp4 on win7, going to fix and post a new tp5.zip to be added to the build server.
Newly posted tp5.zip all green. This is good to deploy now.
(In reply to comment #21) > (In reply to comment #20) > > - running them one after an other is fine - we switch out profiles between > > tests and there are also no pages shared between tp4/tp5. Thus, it is as > > risky as the other tests that we choose to run in sets > Ah, I'd forgotten that we ran other tests like this. No reason to split them > out then, indeed! Alice, just to be clear, with this "testing tp4,tp5 suites together" * do you assert there will *not* be any tp5 wobble when we disable tp4? * can we run tp4+tp5 in some branches but tp4-only in other branches, until we rollout to tp5-only in all branches? (For example starting tp4+tp5 on m-c,try but tp4-only on all other branches. Given the record high load we are dealing with now, its not ok to just double tp load across the board - it seems more prudent to carefully limit doubling our tp load to only the branches where it is needed.) > > - tp5 in my tests took 10-15 minutes, so pretty much the same as tp4 > > - we'd like to run them side by side for ~2 weeks, to create a baseline and > > ensure that we get matched regressions * Is two week transition long enough to migrate from tp4 to tp5? * Given the new rapid release cadence, can we time this changeover to happen between scheduled migrations across branches? > > Mostly, I wanted to get it up and running so that we can start the timer on > > discarding tp4. +1. TP4 is old, so a refresh of pageset is great. > > If you'd like we can split the test out, but that would > > increase the amount of time eaten by it as you'd have the overhead of the > > machine reboot + setup steps. > > I don't really feel equipped to yay/nay turning this on, even if the > elevated load is just for two weeks. > John, Chris - what do you two think? Once we figure out a notification and rollout plan (questions above), I'm fine with updating from Tp4 to Tp5. aki, alice, mfinkle: what about tp5 for mobile? Do we need a Tp5m, or does this new Tp5 as-is work fine on maemo+android?
(In reply to comment #27) > (In reply to comment #21) > > (In reply to comment #20) > > > - running them one after an other is fine - we switch out profiles between > > > tests and there are also no pages shared between tp4/tp5. Thus, it is as > > > risky as the other tests that we choose to run in sets > > Ah, I'd forgotten that we ran other tests like this. No reason to split them > > out then, indeed! > Alice, just to be clear, with this "testing tp4,tp5 suites together" > * do you assert there will *not* be any tp5 wobble when we disable tp4? * do you assert there will *not* be any tp4 wobble when we add tp5? (This would make the difference between needing a tree closure or not).
(In reply to comment #27) > aki, alice, mfinkle: what about tp5 for mobile? Do we need a Tp5m, or does > this new Tp5 as-is work fine on maemo+android? We do not want a Tp5m unless we get a lot of mobile pages in the set. We can't use Tp5 for Maemo, it will kill the devices. I'd be happier getting Tp4m working on Maemo.
Yes, I assert that there will be no tp4 number wobble by adding tp5. They run using newly generated profiles and totally different pagesets - there is no sharing of cache.
Two weeks has been the acceptable switchover time in the past to go from one pageset to another, it is enough time to get a decent baseline and ensure that there is no reason to revert to tp4.
(In reply to comment #27) > (In reply to comment #21) > > (In reply to comment #20) > > > - running them one after an other is fine - we switch out profiles between > > > tests and there are also no pages shared between tp4/tp5. Thus, it is as > > > risky as the other tests that we choose to run in sets > > Ah, I'd forgotten that we ran other tests like this. No reason to split them > > out then, indeed! > Alice, just to be clear, with this "testing tp4,tp5 suites together" > * do you assert there will *not* be any tp5 wobble when we disable tp4? (In reply to comment #30) > Yes, I assert that there will be no tp4 number wobble by adding tp5. They > run using newly generated profiles and totally different pagesets - there is > no sharing of cache. Great! > * can we run tp4+tp5 in some branches but tp4-only in other branches, until > we rollout to tp5-only in all branches? > (For example starting tp4+tp5 on m-c,try but tp4-only on all other branches. > Given the record high load we are dealing with now, its not ok to just > double tp load across the board - it seems more prudent to carefully limit > doubling our tp load to only the branches where it is needed.) In meeting just now, Armen suggested running tp4, tp5 side-by-side *only* on tracemonkey branch for the 2 week transition period. If all goes well, then roll out to all other branches. > > > - tp5 in my tests took 10-15 minutes, so pretty much the same as tp4 > > > - we'd like to run them side by side for ~2 weeks, to create a baseline and > > > ensure that we get matched regressions > * Is two week transition long enough to migrate from tp4 to tp5? (In reply to comment #31) > Two weeks has been the acceptable switchover time in the past to go from one > pageset to another, it is enough time to get a decent baseline and ensure > that there is no reason to revert to tp4. Who will be making the "no reason to revert" decision? > * Given the new rapid release cadence, can we time this changeover to happen > between scheduled migrations across branches?
Blocks: 658392
No longer depends on: 658392
No longer blocks: 658392
Depends on: 658392
I will drive this to completion and do the communication with developers. I am going to enable tp5 on tracemonkey to begin with.
Assignee: anodelman → armenzg
Oops, resetting assignee.
Assignee: armenzg → anodelman
This should enable tp5 on tracemonkey only. Due to auto-tools currently having no working master I cannot do staging for this patch.
Attachment #537840 - Flags: review?(armenzg)
Comment on attachment 537840 [details] [diff] [review] (checked-in) enable tp5 on tracemonkey only This looks good but I doubt it will work without some work in here: http://hg.mozilla.org/build/buildbotcustom/file/tip/process/factory.py#l8061 and http://hg.mozilla.org/build/buildbotcustom/file/tip/process/factory.py#l8155 AFAIK there is currently no talos jobs that have more than one suite being run inside of it. For instance for unittests we iterate over the list of suites that need to be run. http://hg.mozilla.org/build/buildbotcustom/file/tip/process/factory.py#l7287 I am willing to take tp5 as an extra suite for tracemonkey to make things easier. The ability to run two talos suites on the same jobs is needed but I don't think I should block you on it. Would you be able to modify the patch to add tp5 to tracemonkey as a separate job rather than two merged suites? This will gets us unstuck without blocking ourselves on multiple suites per job feature. I will be busy for the next 2/3 days with the release but I can help out. If you want/need to get yourself ahead of the VM you can use this trick: https://wiki.mozilla.org/ReleaseEngineering:TestingTechniques#setup_one_master_and_output_the_steps_for_it I have used it before for not having to setup masters and slaves but attach a diff to a bug showing that my change did exactly what I wanted.
The fix you are requesting is in the other patch in this bug, "buildbotcustom fix for tp5 (take 2)" which has been r+ed by bhearsum. There are already talos jobs that run more than one test - like the dromaeo tests: 'dromaeo': GRAPH_CONFIG + ['--activeTests', 'dromaeo_basics:dromaeo_v8:dromaeo_sunspider:dromaeo_jslib:dromaeo_css:dromaeo_dom'], This is actually 6 tests run in a row, much like I want to run tp5 directly after tp4. I believe the patches that I have presented already work as expected and can be deployed.
Any ETA here, considering that the review questions have been answered?
(In reply to comment #38) > Any ETA here, considering that the review questions have been answered? Alice, before we roll anything into production, can you please answer (or find owners for) the remaining questions in comment#27, comment#32?
/me fixes dependency
Blocks: 658392
No longer depends on: 658392
- two weeks has been the acceptable switch over time frame before, it is now - testing tp4/tp5 together is not a risk, we do that with other suites - i will make the call as to if we need to revert or not - there is no tp5 for mobile as mobile now uses a custom set of mobile only pages - we are ready to roll out to tracemonkey only as a first step I believe that all the questions are answered.
As a note, the roll out of tp4/tp5 to tracemonkey won't save us from running them side by side on the rest of the branches. We always need to construct a workable history of any given performance test to make any sense of the results - thus you end up running them side by side for 2 weeks. During that 2 weeks developers can refer to the tp4 numbers for their work, while at the same time we grow a 2 week set of tp5 history.
Comment on attachment 537840 [details] [diff] [review] (checked-in) enable tp5 on tracemonkey only This seems good on the light of your reply. I am also running it on staging. It should show up on tinderbox: http://tinderbox.mozilla.org/showbuilds.cgi?tree=MozillaTest&noignore=1 I will check tomorrow and see if we can proceed. By running a comparison [1] I see these new steps being added as expected. [1] https://wiki.mozilla.org/ReleaseEngineering:TestingTechniques#setup_one_master_and_output_the_steps_for_it - MozillaUpdateConfig {'addOptions': ['--resultsServer', 'graphs.mozilla.org', '--resultsLink', '/server/collect.cgi', '--activeTests', 'ts_paint:tpaint', '--setPref', 'dom.send_after_paint_to_content=true'], 'addonTester': False, 'branch': 'TraceMonkey', 'branchName': 'TraceMonkey', 'command': None, 'description': None, 'descriptionDone': None, 'env': {'XPCOM_DEBUG_BREAK': 'warn', 'MOZ_NO_REMOTE': '1', 'CYGWINBASE': 'C:\\cygwin', 'PATH': 'C:\\Python24;C:\\Python24\\Scripts;C:\\cygwin\\bin;C:\\WINDOWS\\System32;C:\\program files\\gnuwin32\\bin;C:\\WINDOWS;', 'MOZ_CRASHREPORTER_NO_REPORT': '1', 'NO_EM_RESTART': '1'}, 'executablePath': <buildbot.process.properties.WithProperties>, 'extName': 'addon.xpi', 'haltOnFailure': True, 'log_eval_func': None, 'logfiles': {}, 'remoteExtras': {}, 'remoteProcessName': 'org.mozilla.fennec', 'remoteTests': False, 'usePTY': 'slave-config', 'useSymbols': True, 'workdir': '../talos-data/talos/'} {} + DownloadFile {'command': None, 'description': None, 'descriptionDone': None, 'filename_property': None, 'ignore_certs': False, 'log_eval_func': None, 'logfiles': {}, 'url': 'http://build.mozilla.org/talos/zips/plugins.zip', 'url_fn': None, 'url_property': None, 'usePTY': 'slave-config', 'wget_args': None, 'workdir': '../talos-data/talos/base_profile'} {} + UnpackFile {'command': None, 'description': None, 'descriptionDone': None, 'filename': 'plugins.zip', 'log_eval_func': None, 'logfiles': {}, 'scripts_dir': '.', 'usePTY': 'slave-config', 'workdir': '../talos-data/talos/base_profile'} {} + DownloadFile {'command': None, 'description': None, 'descriptionDone': None, 'filename_property': None, 'ignore_certs': False, 'log_eval_func': None, 'logfiles': {}, 'url': 'http://build.mozilla.org/talos/zips/tp4.zip', 'url_fn': None, 'url_property': None, 'usePTY': 'slave-config', 'wget_args': None, 'workdir': '../talos-data/talos/page_load_test'} {} + UnpackFile {'command': None, 'description': None, 'descriptionDone': None, 'filename': 'tp4.zip', 'log_eval_func': None, 'logfiles': {}, 'scripts_dir': '.', 'usePTY': 'slave-config', 'workdir': '../talos-data/talos/page_load_test'} {} + DownloadFile {'command': None, 'description': None, 'descriptionDone': None, 'filename_property': None, 'ignore_certs': False, 'log_eval_func': None, 'logfiles': {}, 'url': 'http://build.mozilla.org/talos/zips/tp5.zip', 'url_fn': None, 'url_property': None, 'usePTY': 'slave-config', 'wget_args': None, 'workdir': '../talos-data/talos/page_load_test'} {} + UnpackFile {'command': None, 'description': None, 'descriptionDone': None, 'filename': 'tp5.zip', 'log_eval_func': None, 'logfiles': {}, 'scripts_dir': '.', 'usePTY': 'slave-config', 'workdir': '../talos-data/talos/page_load_test'} {} + MozillaUpdateConfig {'addOptions': ['--resultsServer', 'graphs.mozilla.org', '--resultsLink', '/server/collect.cgi', '--activeTests', 'tp4:tp5'], 'addonTester': False, 'branch': 'TraceMonkey', 'branchName': 'TraceMonkey', 'command': None, 'description': None, 'descriptionDone': None, 'env': {'XPCOM_DEBUG_BREAK': 'warn', 'MOZ_NO_REMOTE': '1', 'CYGWINBASE': 'C:\\cygwin', 'PATH': 'C:\\Python24;C:\\Python24\\Scripts;C:\\cygwin\\bin;C:\\WINDOWS\\System32;C:\\program files\\gnuwin32\\bin;C:\\WINDOWS;', 'MOZ_CRASHREPORTER_NO_REPORT': '1', 'NO_EM_RESTART': '1'}, 'executablePath': <buildbot.process.properties.WithProperties>, 'extName': 'addon.xpi', 'haltOnFailure': True, 'log_eval_func': None, 'logfiles': {}, 'remoteExtras': {}, 'remoteProcessName': 'org.mozilla.fennec', 'remoteTests': False, 'usePTY': 'slave-config', 'useSymbols': True, 'workdir': '../talos-data/talos/'} {}
Attachment #537840 - Flags: review?(armenzg) → review+
Comment on attachment 531787 [details] [diff] [review] enable tp5 in config.py (take 2) Removing myself from this review request because Armen has taken over the releng side of things (and I'm not sure if this patch is still current).
Attachment #531787 - Flags: review?(bhearsum)
I tested this on staging and did not succeed. I see tp4 & tp5 unpacked to the same place: c:\talos-slave\test\../talos-data/talos/page_load_test Not sure if that is wanted/expected. python PerfConfigurator.py -v -e ../firefox/firefox -t talos-r3-w7-002 -b TraceMonkey --branchName MozillaTest --resultsServer graphs-stage.mozilla.org --resultsLink /server/collect.cgi --activeTests tp4:tp5 --symbolsPath ../symbols python run_tests.py --noisy 20110608_1358_config.yml anode can you look at the logs and let me know what is going on? The only one that succeeded is 10.5. http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTest/1307566395.1307567777.7465.gz&fulltext=1 http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTest/1307566308.1307567228.4687.gz&fulltext=1 http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTest/1307566363.1307567990.8796.gz&fulltext=1
Looks like you have two freezes in tp4. I can't say why tp4 would be freezing. You are running the test without the talos patch listed in this bug - that is why it isn't even attempting to run tp5. You may need to set up a clean environment and just test tp4 to ensure that you can get a good run out of it, otherwise the issue is with the staging env. Otherwise, you could log into the machine and attempt to run the test manually and see what the browser is doing.
(In reply to comment #46) > Looks like you have two freezes in tp4. I can't say why tp4 would be > freezing. > > You are running the test without the talos patch listed in this bug - that > is why it isn't even attempting to run tp5. > > You may need to set up a clean environment and just test tp4 to ensure that > you can get a good run out of it, otherwise the issue is with the staging > env. > > Otherwise, you could log into the machine and attempt to run the test > manually and see what the browser is doing. I concur with Alice, it looks like the buildbot side of things is busted. Is there any way we can get alice access to this staging box so we can sort this out more quickly since you're (armen) busy with a release?
I mentioned it on IRC. If access to a machine is needed a file should be file and the buildduty/IT would pick that up fairly quickly (hours). I am going to try to kick this with the talos bundle from bug 658392.
Depends on: 663192
In case anyone would need to run this manually. I hope I recovered the steps properly. mkdir -p ../talos-data/talos cd ../talos-data wget --progress=dot:mega -N http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/tracemonkey-win32/1307480826/firefox-7.0a1.en-US.win32.zip unzip -o firefox-7.0a1.en-US.win32.zip cd firefox; chmod -v -R a+x .; cd .. # wget --progress=dot:mega -N http://build.mozilla.org/talos/zips/talos.zip wget --progress=dot:mega -N http://people.mozilla.org/~anodelman/taloszips/c84f630d576f/talos.zip unzip -o talos.zip mkdir -p talos/page_load_test cd talos/page_load_test; wget --progress=dot:mega -N http://build.mozilla.org/talos/xpis/pageloader.xpi; cd ../.. mkdir -p talos/base_profile cd talos/base_profile; wget --progress=dot:mega -N http://build.mozilla.org/talos/zips/plugins.zip; unzip -o plugins.zip; cd ../.. cd talos/page_load_test; wget --progress=dot:mega -N http://build.mozilla.org/talos/zips/tp4.zip; unzip -o tp4.zip; cd ../.. cd talos/page_load_test; wget --progress=dot:mega -N http://build.mozilla.org/talos/zips/tp5.zip; unzip -o tp5.zip; cd ../.. cd talos; python PerfConfigurator.py -v -e ../firefox/firefox -t talos-r3-w7-002 -b TraceMonkey --branchName MozillaTest --resultsServer graphs-stage.mozilla.org --resultsLink /server/collect.cgi --activeTests tp4:tp5 --symbolsPath ../symbols; python run_tests.py --noisy 20110608_1358_config.yml; cd ..
The problems on comment 45 is because the current talos.zip was used rather than the one that anode posted on bug 658392. I am setting http://people.mozilla.com/~armenzg/talos with all the production bundle and replaced the talos.zip bundle with the one that anode posted. I will trigger soon a new set of tp jobs and see what happens.
Sweet. This seems to be running now. We should see the results in anytime soon: http://tinderbox.mozilla.org/showbuilds.cgi?tree=MozillaTest anode I am using: * attachment 531788 [details] [diff] [review] * attachment 537840 [details] [diff] [review] * http://people.mozilla.org/~anodelman/taloszips/c84f630d576f/talos.zip * local hack to point to http://people.mozilla.com/~armenzg/talos I am going to post now a comment assuming that this cannot really wait until Monday. If it can wait until Monday please let's do so as I don't see why I have to put burden on other team members and IT when I can resume it on Monday (I am done with the release). I will check results tonight and perhaps I can send the email mentioned at the end myself. anode can you check the results on MozillaTest tree once they are done? anyone from releng, if anode approves the results on tinderbox could you please land the two attachments and deploy the new talos.zip? We also have to check: * that tbpl shows tp4 and tp5 properly * that graph server shows both tp4 and tp5 If everything goes well we can close bug 663192. Otherwise use discernment on what to do next. If everything lands properly for tracemonkey tomorrow could we send an email to dev.planning saying: "We are ready to enable tp5 with tp4 for a 2 weeks period or less to set a baseline and give enough time for devs to start using tp5 instead tp4. releng has added for few weeks more rev3 machines from the win7 64-bit until Win 64-bit builds are fully supported. This gives us enough capacity to handle this extra load and the general bad wait times. Please feel free to raise your questions/concerns. Remember that this is only for 2 weeks and everything will be going back to normal"
I noticed these just in case are important: > Running test tp5: > Started Thu, 09 Jun 2011 15:50:34 > LoadPlugin: failed to initialize shared library libXt.so [libXt.so: cannot open shared object file: No such file or directory] > LoadPlugin: failed to initialize shared library libXext.so [libXext.so: cannot open shared object file: No such file or directory] > LoadPlugin: failed to initialize shared library /tmp/tmphprZP2/profile/plugins/libflashplayer.so [/tmp/tmphprZP2/profile/plugins/libflashplayer.so: wrong ELF class: ELFCLASS32] > Screen width/height:1600/1200 > colorDepth:24 > Browser inner width/height: 1024/682 ... > NOISE: Cycle 10: loaded http://localhost/page_load_test/tp5/yandex.ru/yandex.ru/yandsearch@text=mozilla&lr=21215.html (next: http://localhost/page_load_test/tp5/cgi.ebay.com/cgi.ebay.com/ALL-NEW-KINDLE-3-eBOOK-WIRELESS-READING-DEVICE-W-WIFI-/130496077314@pt=LH_DefaultDomain_0&hash=item1e622c1e02.html) > Corrupt JPEG data: 8 extraneous bytes before marker 0xe1 ... > NOISE: Cycle 10: loaded http://localhost/page_load_test/tp5/goo.ne.jp/goo.ne.jp/index.html (next: http://localhost/page_load_test/tp5/alipay.com/www.alipay.com/index.html) > Corrupt JPEG data: 40 extraneous bytes before marker 0xee This question is unrelated to the deployment of this: * how is it that tp5 takes 10-15 mins on your machine while both tp4 & tp5 take more that an hour on the test machines? Are you cycling once over the pages instead of 10 times?
I am cycling 10 times. Can you link me to a log showing the longer running time?
From the two logs you pasted you can see that fed64 took more than 2 hours and leopard took more than an hour. TinderboxPrint: cycle time: 01:14:41 TinderboxPrint: cycle time: 02:27:33 Xp and Win7 took 36-38mins which is what we were expecting. I have re-triggered both jobs and see what happens. Running on the same slave might not yield a difference though.
I run the Leopard jobs again with different slaves and I get consistent cycle times: http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTest/1307985208.1307989774.14518.gz&fulltext=1 http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTest/1307985560.1307990254.16006.gz&fulltext=1 I have setup staging to trigger: 1) tp4 2) tp5 3) tp4 & tp5 Let's see what is going on.
What I am trying to answer is this: On production: > Completed test tp4: > Stopped Thu, 09 Jun 2011 17:49:18 > RETURN: cycle time: 00:11:48<br> On staging: > Running test tp4: > Started Mon, 13 Jun 2011 10:23:00 > Completed test tp4: > Stopped Mon, 13 Jun 2011 10:59:59 That seems to me like a 25 mins increase which is more than 200%. On Fedora 64 is even worst.
It seems that the cycle time of tp4 without tp5 is: cycle time: 00:36:48 [1] That cycle time should be ~11mins. The only two differences I can think of are that there is something weird with the new talos.zip or that the two staging slaves have an issue. I have moved a 3rd leopard machine (talos-r3-leopard-003) to staging to discard that both these slaves are on a weird state. I have also changed the support base url to point to the current talos.zip [3] bundle rather than the new one that was posted in bug 658392 [4](according to anode: the only addition to talos.zip was an entry in the config file so it seems pretty unlikely). [1] http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTest/1307994525.1307996840.8883.gz&fulltext=1 [2] http://people.mozilla.org/~anodelman/taloszips/c84f630d576f/talos.zip [3] [armenzg@dm-wwwbuild01 zips]$ sha1sum *zip d0df5cea24790dd29de825ce4b77864876a09a2b pagesets.zip 90c3dbfe022fb0e854f7af0329f7036d88461d54 plugins.zip 89e9ef8e23a96fd29978d6c0f696be543c2b3fb6 talos.zip 3f04e7bc80b7bf7add552382802e31ef29133de3 tp4.zip 7be6c7f8ab05416e8ef246b7bb850f293dd53ab7 tp5.zip [4] [armenzg@dm-peep01 zips]$ sha1sum *zip d0df5cea24790dd29de825ce4b77864876a09a2b pagesets.zip 90c3dbfe022fb0e854f7af0329f7036d88461d54 plugins.zip f2b45a7f42056b897104d46317902bb5468596a9 talos.zip 3f04e7bc80b7bf7add552382802e31ef29133de3 tp4.zip 7be6c7f8ab05416e8ef246b7bb850f293dd53ab7 tp5.zip
I just got a snow leopard tp4+tp5 run [1]: > Running test tp4: > Started Mon, 13 Jun 2011 13:07:31 > > Completed test tp4: > Stopped Mon, 13 Jun 2011 13:19:42 > Running test tp5: > Started Mon, 13 Jun 2011 13:19:42 > Completed test tp5: > Stopped Mon, 13 Jun 2011 13:56:51 > RETURN: cycle time: 00:49:19<br> * tp4 -> ~12mins * tp5 -> ~37mins It seems that tp4 takes around the same amount as it takes on production but tp5 takes around *3x* as tp4 normally takes. [1] http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTest/1307995374.1307998618.16096.gz&fulltext=1
I tried to re-create the leopard results that you are seeing. But it is going quick on my staging leopard box. tools-r3-leopard-001: Running test tp4: Started Mon, 13 Jun 2011 14:11:52 Completed test tp4: Stopped Mon, 13 Jun 2011 14:23:19 Running test tp5: Started Mon, 13 Jun 2011 14:23:19 Completed test tp5: Stopped Mon, 13 Jun 2011 14:36:07 So tp4 of 11 minutes and tp5 of 13 minutes.
I can work on further results tomorrow for the other slow systems that you are seeing, but I'm thinking you have staging issues.
talos-r3-leopard-003 has been able to pick up few tp4 jobs and scored good timings with the production zips. I have disabled the other two slaves and only run leopard-003.
No longer blocks: 658392
Depends on: 658392
Attachment #531787 - Attachment is obsolete: true
Attachment #531788 - Attachment description: buildbotcustom fix for tp5 (take 2) → (checked-in) buildbotcustom fix for tp5 (take 2)
Attachment #537840 - Attachment description: enable tp5 on tracemonkey only → (checked-in) enable tp5 on tracemonkey only
We can land this as soon as everything looks good on TraceMonkey. I have triggered all the builders for TraceMonkey and tomorrow we should see the results on tbpl. I used builder_list.py and vimdiff to help me compare old list of builders with the new one: > python ~/repos/releng/braindump/buildbot-related/builder_list.py master.cfg > new_builders L s: talos-r3-fed-054 s: talos-r3-fed-054 id:20110613181633 rev:3acacde59381 cycle time: 00:23:20 tp5: 391.98 tp5_pbytes: 439.5MB tp5_xres: 18.6MB tp5_rss: 141.3MB tp5_shutdown: 920.0 tp4: 349.42 tp4_pbytes: 159.7MB tp4_xres: 427.5KB tp4_rss: 48.9MB tp4_shutdown: 728.0 Details: tp5 tp5_pbytes tp5_xres tp5_rss tp5_shutdown tp4 tp4_pbytes tp4_xres tp4_rss tp4_shutdown
Attachment #539260 - Flags: review?(anodelman)
Depends on: 664371
I can see tp5 showing up on graphs: http://graphs-new.mozilla.org/graph.html#tests=[[89,4,1],[89,4,12],[89,4,13],[89,4,15],[89,4,14]]&sel=none&displayrange=7&datatype=running anode can you please have a look at bug 664371? We are ready to go ahead and enable this on every branch as soon as you give the go/no-go. FTR I landed this patch http://hg.mozilla.org/build/buildbot-configs/rev/fb8a29ea0773#l1.99 which happened to fix that tp was being half-enabled on project_branches without being explicit. I added 'tp' to the list of suites to be disabled by default on line 1.99.
I'm not too worried about an intermittent fail on a new pageset - these are pages that we haven't been testing against yet so it is probably finding new and wonderful code paths in the browser.
Attachment #539260 - Flags: review?(anodelman) → review+
FTR this is the time that the jobs take: Rev3 Fedora 12 tracemonkey talos tp 0:32:08 Rev3 Fedora 12x64 tracemonkey talos tp 0:25:30 Rev3 MacOSX Leopard 10.5.8 tracemonkey talos tp 0:36:35 Rev3 MacOSX Snow Leopard 10.6.2 tracemonkey talos tp 0:36:48 Rev3 WINNT 5.1 tracemonkey talos tp 0:33:43 Rev3 WINNT 6.1 tracemonkey talos tp 0:32:05 Which are decent.
Comment on attachment 539260 [details] [diff] [review] (checked-in) [configs] disable tp4 everywhere except older release branches & enable tp (tp4+tp5) everywhere except older release branches Enabled everywhere except older release branches: http://hg.mozilla.org/build/buildbot-configs/rev/48bf23b49d2a This will be picked up in tomorrow's scheduled reconfig. anode I leave it into your hands to come back to us when we are ready to disable tp4 and follow-up any bugs. Closer to that time I will raise it to dev.planning and the Tuesday call (June 28th). Sounds good?
Attachment #539260 - Attachment description: [configs] disable tp4 everywhere except older release branches & enable tp (tp4+tp5) everywhere except older release branches → (checked-in) [configs] disable tp4 everywhere except older release branches & enable tp (tp4+tp5) everywhere except older release branches
This got deployed to production a couple of hours ago and it is showing up on tbpl. http://hg.mozilla.org/build/buildbot-configs/rev/8942ffd33487 I have also announced it on dev.planning and dev.-tree-management: http://groups.google.com/group/mozilla.dev.tree-management/browse_thread/thread/7ad2ba7f8f006d65#
Blocks: 664831
I've compared the Tracemonkey tp4 and tp5 results and things look good. I see them reflecting the same wins/regressions over time and I believe that tp5 is ready to go it alone. Kill off tp4 at will on those branches where the side-by-side testing has been occurring.
Sweet. The scheduled date is June 30th. I have announced it on the Tuesday's meeting and the mailing lists. I guess this bug is done ("creating tp5 and deploying it"), right?
I am using bugzilla since just to keep everything centralized. See issue opened on bitbucket as well: https://bitbucket.org/mconnor/compare-talos/issue/12/add-tp5-support
Attachment #541136 - Flags: review?(mconnor)
Depends on: 666711
What's left in here? IIUC this is done.
Tp5 has been deployed so this is now complete.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: