Closed Bug 702351 Opened 13 years ago Closed 13 years ago

deploy talos.zip which includes responsiveness

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jmaher, Assigned: armenzg)

References

Details

(Whiteboard: [talos] This bug is only waiting for attachment 576132 make it into production masters.)

Attachments

(2 files)

bug 696810 failed because we turned on talos responsiveness for all branches. We should ensure that we only turn it on for mozilla-central when we deploy: http://people.mozilla.org/~jmaher/taloszips/c7c8935034a4/talos.zip Now we need to add a '--responsiveness' commandline to the talos options in order to turn it on.
Blocks: 631571, 697555
Double checking: is this safe? no numbers will be shifted, right?
correct, the numbers will be untouched.
is this a non-mobile update? we can't deploy talos to tegras until I fix bug 701979
correct, I will have a talos update for mobile later this week, but this isn't for mobile :)
Assignee: nobody → armenzg
[armenzg@dm-wwwbuild01 ~]$ cd /var/www/html/build/talos/zips/ [armenzg@dm-wwwbuild01 zips]$ ls ahal-peptest-18141e6.zip old talos.bug696810.zip flash32_10_3_183_5.zip pagesets.zip talos.zip flash64_11_0_d1_98.zip peptest.zip tp4.zip mozbase.zip plugins.zip tp5.zip mozilla-mozbase-61b09a2.zip talos.bug694579.zip [armenzg@dm-wwwbuild01 zips]$ wget -Otalos.bug702351.zip http://people.mozilla.org/~jmaher/taloszips/c7c8935034a4/talos.zip --2011-11-16 06:57:41-- http://people.mozilla.org/~jmaher/taloszips/c7c8935034a4/talos.zip Resolving people.mozilla.org... 10.2.74.108 Connecting to people.mozilla.org|10.2.74.108|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 6006106 (5.7M) [application/zip] Saving to: `talos.bug702351.zip' 100%[======================================>] 6,006,106 --.-K/s in 0.1s 2011-11-16 06:57:42 (52.0 MB/s) - `talos.bug702351.zip' saved [6006106/6006106] [armenzg@dm-wwwbuild01 zips]$ ls -l talos.zip lrwxrwxrwx 1 jford build 19 Oct 26 12:46 talos.zip -> talos.bug694579.zip [armenzg@dm-wwwbuild01 zips]$ mv talos.bug69 talos.bug694579.zip talos.bug696810.zip [armenzg@dm-wwwbuild01 zips]$ mv talos.bug69* old/ && rm talos.zip && ln -s talos.bug702351.zip talos.zip
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Whiteboard: [talos]
This has caused a regression: https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=tp Let's back it out. armenzg: "Error: The name org.freedesktop.UPower was not provided by any .service files" jmaher: self.primaryPid = self.ffprocess.GetPidsByName(process)[-1] https://tbpl.mozilla.org/php/getParsedLog.php?id=7434413&tree=Mozilla-Inbound&full=1
Status: RESOLVED → REOPENED
Priority: -- → P1
Resolution: FIXED → ---
Last login: Wed Nov 16 06:56:01 2011 from bm-vpn01.build.sjc1.mozilla.com [armenzg@dm-wwwbuild01 ~]$ cd /var/www/html/build/talos/zips/ [armenzg@dm-wwwbuild01 zips]$ ls -l talos.zip lrwxrwxrwx 1 armenzg build 19 Nov 16 06:58 talos.zip -> talos.bug702351.zip [armenzg@dm-wwwbuild01 zips]$ mv old/talos.bug694579.zip . [armenzg@dm-wwwbuild01 zips]$ rm talos.zip && ln -s talos.bug694579.zip talos.zip [armenzg@dm-wwwbuild01 zips]$ ls talos.* talos.bug694579.zip talos.bug702351.zip talos.zip [armenzg@dm-wwwbuild01 zips]$ ls -l talos.zip lrwxrwxrwx 1 armenzg build 19 Nov 16 13:41 talos.zip -> talos.bug694579.zip I am going to be re-triggering talos jobs.
> self.primaryPid = self.ffprocess.GetPidsByName(process)[-1] I notice immediately that ffprocess uses subprocess in launch which will have the subprocess's PID. Instead we search through a subshelled 'ps' output to get the information. This seems problematic. I don't know why we don't find e.g. 'firefox' as a process. That would be good to know. I have not seen this locally.
(In reply to Jeff Hammel [:jhammel] from comment #8) > > self.primaryPid = self.ffprocess.GetPidsByName(process)[-1] > > I notice immediately that ffprocess uses subprocess in launch which will > have the subprocess's PID. Instead we search through a subshelled 'ps' > output to get the information. This seems problematic. > > I don't know why we don't find e.g. 'firefox' as a process. That would be > good to know. I have not seen this locally. Is this being done because of the fact that on Foopy's we can have a lot of processes that are the same?
We're back to normal. I can see green tp jobs: https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=tp&rev=d0c677daedff FIXED as in "we're back to normal" not as "the requested talos.zip is deployed" Things like this won't happen once bug 673131 is fixed. I'm starting that bug very soon. Please open a new bug or re-open this once there is a new talos.zip to deploy.
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
OK, uploaded a new build of talos without the python error and with linux disabled for responsiveness only, this should be green: http://people.mozilla.org/~jmaher/taloszips/9baf50c14041/talos.zip To clarify this won't be changing the numbers, so we will not need side by side staging. We will need a change to config.py to add the --responsiveness flag for m-c only.
Status: RESOLVED → REOPENED
Priority: P1 → --
Resolution: FIXED → ---
Comment on attachment 575206 [details] [diff] [review] patch to turn on responsiveness for tp on m-c (1.0) Can the new talos.zip be deployed first? and the this config change?
Attachment #575206 - Flags: review?(armenzg) → review+
yes, we can deploy the talos.zip first, then do the config change afterwards.
this is the talos.zip which just finished in staging: http://people.mozilla.org/~jmaher/taloszips/5dfaf26ff78a/talos.zip this talos.zip includes the fix from bug 429592.
(In reply to Mike Taylor [:bear] from comment #9) > (In reply to Jeff Hammel [:jhammel] from comment #8) > > > self.primaryPid = self.ffprocess.GetPidsByName(process)[-1] > > > > I notice immediately that ffprocess uses subprocess in launch which will > > have the subprocess's PID. Instead we search through a subshelled 'ps' > > output to get the information. This seems problematic. > > > > I don't know why we don't find e.g. 'firefox' as a process. That would be > > good to know. I have not seen this locally. > > Is this being done because of the fact that on Foopy's we can have a lot of > processes that are the same? Probably the opposite. This was from fallout of Bug 700722 which is needed for bug 650887. We grep each line for ps for (in this case 'firefox'). This will match anything with 'firefox' anywhere in the command (and the less said about how we find PIDs the better, lets just say that if we look for PID=2334 we might get the ps entry for PID 12334). So the fix in bug 700722 more precisely looks at the basename of the actual executable. For some reason this breaks. No one (AFAIK) knows why, but you can see that the two methods are quite different in effect. I.e. this is covering up a bug, something like we're actually looking for firefox-bin but cmanager is looking for firefox and just happening to find it with its over-ambitious approach. There are actually a whole lot of other bugs hiding in this mess but that's the basic idea.
[armenzg@dm-wwwbuild01 zips]$ rm talos.bug702351.zip [armenzg@dm-wwwbuild01 zips]$ wget -Otalos.bug702351.zip http://people.mozilla.org/~jmaher/taloszips/5dfaf26ff78a/talos.zip --2011-11-18 06:47:55-- http://people.mozilla.org/~jmaher/taloszips/5dfaf26ff78a/talos.zip Resolving people.mozilla.org... 10.2.74.108 Connecting to people.mozilla.org|10.2.74.108|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 6007768 (5.7M) [application/zip] Saving to: `talos.bug702351.zip' 100%[======================================>] 6,007,768 33.2M/s in 0.2s 2011-11-18 06:47:55 (33.2 MB/s) - `talos.bug702351.zip' saved [6007768/6007768] [armenzg@dm-wwwbuild01 zips]$ ls -l talos.zip lrwxrwxrwx 1 armenzg build 19 Nov 16 13:41 talos.zip -> talos.bug694579.zip [armenzg@dm-wwwbuild01 zips]$ mv talos.bug694579.zip old && rm talos.zip && ln -s talos.bug702351.zip talos.zip [armenzg@dm-wwwbuild01 zips]$ ls -l talos.zip lrwxrwxrwx 1 armenzg build 19 Nov 18 06:48 talos.zip -> talos.bug702351.zip I don't want to start a reconfig since an imminent 8.0.1 go to build might come this morning. The talos.zip can be reverted very easily and reconver from it and should only affect day-to-day development rather than a release.
Comment on attachment 575206 [details] [diff] [review] patch to turn on responsiveness for tp on m-c (1.0) Landed on "default": http://hg.mozilla.org/build/buildbot-configs/rev/635eb26ac667 This will be picked up on the next reconfig.
Attachment #575206 - Flags: checked-in+
Whiteboard: [talos] → [talos] talos.zip has been landed - waiting for next reconfig
This went to production today.
Are bug 704010, bug 703996, and bug 704380 the result of improved talos code detecting issues, or a regression in talos causing problems ? They all seem to have started after the deployment here.
Attachment #576132 - Flags: review?(jmaher)
jmaher I would like to backout the talos.zip for a day or two to make sure the bugs that nthomas mention are not cause by us. The creation date of those bugs plays well with the theory. Makes sense?
interesting bugs and they all started showing up at the same time. Timeline: 11/18 - landed talos.zip 11/20 - 3 bugs showed up 11/21 - did reconfig I would think something else is causing a problem here as we are only seeing these issues on m-c and inbound. If I had data points to show problems on other branches that would be more convincing that this problem is related to the talos.zip. We have improved our overall detection of processes, but the timeline and branches don't show enough evidence for me to assume talos.zip. Another thing, don't we reboot between test runs? For the two bugs that are related to talos, this error is during initial startup (no test has ran yet).
Comment on attachment 576132 [details] [diff] [review] We missed "_tests" in our patch Review of attachment 576132 [details] [diff] [review]: ----------------------------------------------------------------- good catch!
Attachment #576132 - Flags: review?(jmaher) → review+
[11:10am] armenzg: jmaher: the problem is that one job gets into trouble and prevents the machine from rebooting [11:10am] armenzg: which means that it takes the next job without having rebooted [11:10am] armenzg: jmaher: I know it would be slowing us down but would you be OK if we backed out talos.zip for a day or two? [11:10am] jmaher: armenzg: oh, I didn't realize that; [11:10am] philor: if bmo cooperates, I'll be putting a couple of Windows crashes from fx-team in that bug [11:11am] philor: but yeah, the only way talos is "responsible" is that it's now giving a clear message about why the mochitest crash is screwing it over, instead of the previous message [11:11am] jmaher: armenzg: I don't have enough evidence to lean me towards the talos.zip [11:12am] philor: talos has objected to running processes forever, that's not new in the responsiveness thing, is it? [11:12am] philor: why did I ask that as a question? [11:12am] philor: that's not new. [11:13am] jmaher: that is not new [11:13am] philor: the new thing is saying what process, and wlach landed that in... September? [11:13am] jmaher: philor: correct [11:13am] jmaher: we still check for the same processes and use the same discovery technique [11:13am] ted: the responsiveness thing just fiddles some stuff inside the Tp5 run [11:14am] jmaher: instead of a true/false return value, we return which process is still running [11:15am] philor: we're seeing dwwin still running because we're crashing Windows while running mochitests [11:15am] jmaher: philor: so if that is the case we shouldn't be backing talos.zip out [11:16am] philor: only to get out of the blame-hose [11:17am] philor: nobody will want to investigate a thing like bug 704010, only to look for someone to blame, that's why I didn't even file it the first time we started seeing it [11:18am] jmaher: philor: yeah, we need to figure out how to get the slaves to reboot [11:18am] jmaher: so our tests are clean [11:19am] philor: jmaher: well, I'd vote for them not crashing [11:20am] jmaher: philor: I vote for both [11:21am] philor: but you could wallpaper over that part by doing an auto-retry when you find a running process [11:21am] philor: sadly, "Automation Error" went away as a generic retry, not sure what else will do it [11:25am] armenzg: philor: jmaher what would you suggest me do? [11:25am] armenzg: I would like to post the convo on the bug [11:26am] armenzg: philor: which of those 3 bugs you had seen before? [11:27am] armenzg: so, the newer talos.zip is not the problem but a code change? [11:27am] philor: armenzg: bug 704010, the M4 flavor, dying in dbaron's huge and over-verbose CSS tests [11:28am] philor: the new talos.zip is a red herring, I think [11:29am] philor: yeah, and I'm pretty sure your change didn't reverse time's arrow and cause the mochitest job the slave previously did to crash Windows
Bug 429592 depended on the new talos drop, can I go ahead and land it or do I need to hold off pending some additional decision here?
Comment on attachment 576132 [details] [diff] [review] We missed "_tests" in our patch This landed in "default": http://hg.mozilla.org/build/buildbot-configs/rev/de57a624cc81 There will be another scheduled reconfig on Thursday if it doesn't happen earlier. Once it happen we will make mention of it.
Attachment #576132 - Flags: checked-in+
(In reply to Benjamin Smedberg [:bsmedberg] from comment #26) > Bug 429592 depended on the new talos drop, can I go ahead and land it or do > I need to hold off pending some additional decision here? I answered this on IRC. We're not expecting to backout the talos.zip. This bug is only waiting for attachment 576132 [details] [diff] [review] make it into production masters. If the talos.zip was to be backout we would let you know.
Whiteboard: [talos] talos.zip has been landed - waiting for next reconfig → [talos] This bug is only waiting for attachment 576132 make it into production masters.
There's a load of red on inbound/m-c now, might this be the cause?
No, not related since it got deployed on Friday. Can you please remove it from the tree status?
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #30) > No, not related since it got deployed on Friday. Ok, was looking at the date of the followup in comment 27. Have adjusted the tbpl status message accordingly.
This made it to production today.
(In reply to Ben Hearsum [:bhearsum] from comment #32) > This made it to production today. .... by which I mean the patch to buildbot-configs landed. I haven't deployed talos.zip yet.
There is no talos.zip to deploy in here anymore. Just the buildbot-configs changed had to land. Thanks Ben.
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: