Closed
Bug 702351
Opened 13 years ago
Closed 13 years ago
deploy talos.zip which includes responsiveness
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jmaher, Assigned: armenzg)
References
Details
(Whiteboard: [talos] This bug is only waiting for attachment 576132 make it into production masters.)
Attachments
(2 files)
2.13 KB,
patch
|
armenzg
:
review+
armenzg
:
checked-in+
|
Details | Diff | Splinter Review |
1.16 KB,
patch
|
jmaher
:
review+
armenzg
:
checked-in+
|
Details | Diff | Splinter Review |
bug 696810 failed because we turned on talos responsiveness for all branches. We should ensure that we only turn it on for mozilla-central when we deploy:
http://people.mozilla.org/~jmaher/taloszips/c7c8935034a4/talos.zip
Now we need to add a '--responsiveness' commandline to the talos options in order to turn it on.
Reporter | ||
Updated•13 years ago
|
Assignee | ||
Comment 1•13 years ago
|
||
Double checking: is this safe? no numbers will be shifted, right?
Reporter | ||
Comment 2•13 years ago
|
||
correct, the numbers will be untouched.
Comment 3•13 years ago
|
||
is this a non-mobile update?
we can't deploy talos to tegras until I fix bug 701979
Reporter | ||
Comment 4•13 years ago
|
||
correct, I will have a talos update for mobile later this week, but this isn't for mobile :)
Assignee | ||
Updated•13 years ago
|
Assignee: nobody → armenzg
Assignee | ||
Comment 5•13 years ago
|
||
[armenzg@dm-wwwbuild01 ~]$ cd /var/www/html/build/talos/zips/
[armenzg@dm-wwwbuild01 zips]$ ls
ahal-peptest-18141e6.zip old talos.bug696810.zip
flash32_10_3_183_5.zip pagesets.zip talos.zip
flash64_11_0_d1_98.zip peptest.zip tp4.zip
mozbase.zip plugins.zip tp5.zip
mozilla-mozbase-61b09a2.zip talos.bug694579.zip
[armenzg@dm-wwwbuild01 zips]$ wget -Otalos.bug702351.zip http://people.mozilla.org/~jmaher/taloszips/c7c8935034a4/talos.zip
--2011-11-16 06:57:41-- http://people.mozilla.org/~jmaher/taloszips/c7c8935034a4/talos.zip
Resolving people.mozilla.org... 10.2.74.108
Connecting to people.mozilla.org|10.2.74.108|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6006106 (5.7M) [application/zip]
Saving to: `talos.bug702351.zip'
100%[======================================>] 6,006,106 --.-K/s in 0.1s
2011-11-16 06:57:42 (52.0 MB/s) - `talos.bug702351.zip' saved [6006106/6006106]
[armenzg@dm-wwwbuild01 zips]$ ls -l talos.zip
lrwxrwxrwx 1 jford build 19 Oct 26 12:46 talos.zip -> talos.bug694579.zip
[armenzg@dm-wwwbuild01 zips]$ mv talos.bug69
talos.bug694579.zip talos.bug696810.zip
[armenzg@dm-wwwbuild01 zips]$ mv talos.bug69* old/ && rm talos.zip && ln -s talos.bug702351.zip talos.zip
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Whiteboard: [talos]
Assignee | ||
Comment 6•13 years ago
|
||
This has caused a regression:
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=tp
Let's back it out.
armenzg: "Error: The name org.freedesktop.UPower was not provided by any .service files"
jmaher: self.primaryPid = self.ffprocess.GetPidsByName(process)[-1]
https://tbpl.mozilla.org/php/getParsedLog.php?id=7434413&tree=Mozilla-Inbound&full=1
Status: RESOLVED → REOPENED
Priority: -- → P1
Resolution: FIXED → ---
Assignee | ||
Comment 7•13 years ago
|
||
Last login: Wed Nov 16 06:56:01 2011 from bm-vpn01.build.sjc1.mozilla.com
[armenzg@dm-wwwbuild01 ~]$ cd /var/www/html/build/talos/zips/
[armenzg@dm-wwwbuild01 zips]$ ls -l talos.zip
lrwxrwxrwx 1 armenzg build 19 Nov 16 06:58 talos.zip -> talos.bug702351.zip
[armenzg@dm-wwwbuild01 zips]$ mv old/talos.bug694579.zip .
[armenzg@dm-wwwbuild01 zips]$ rm talos.zip && ln -s talos.bug694579.zip talos.zip
[armenzg@dm-wwwbuild01 zips]$ ls talos.*
talos.bug694579.zip talos.bug702351.zip talos.zip
[armenzg@dm-wwwbuild01 zips]$ ls -l talos.zip
lrwxrwxrwx 1 armenzg build 19 Nov 16 13:41 talos.zip -> talos.bug694579.zip
I am going to be re-triggering talos jobs.
Comment 8•13 years ago
|
||
> self.primaryPid = self.ffprocess.GetPidsByName(process)[-1]
I notice immediately that ffprocess uses subprocess in launch which will have the subprocess's PID. Instead we search through a subshelled 'ps' output to get the information. This seems problematic.
I don't know why we don't find e.g. 'firefox' as a process. That would be good to know. I have not seen this locally.
Comment 9•13 years ago
|
||
(In reply to Jeff Hammel [:jhammel] from comment #8)
> > self.primaryPid = self.ffprocess.GetPidsByName(process)[-1]
>
> I notice immediately that ffprocess uses subprocess in launch which will
> have the subprocess's PID. Instead we search through a subshelled 'ps'
> output to get the information. This seems problematic.
>
> I don't know why we don't find e.g. 'firefox' as a process. That would be
> good to know. I have not seen this locally.
Is this being done because of the fact that on Foopy's we can have a lot of processes that are the same?
Assignee | ||
Comment 10•13 years ago
|
||
We're back to normal. I can see green tp jobs:
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=tp&rev=d0c677daedff
FIXED as in "we're back to normal" not as "the requested talos.zip is deployed"
Things like this won't happen once bug 673131 is fixed. I'm starting that bug very soon.
Please open a new bug or re-open this once there is a new talos.zip to deploy.
Status: REOPENED → RESOLVED
Closed: 13 years ago → 13 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 11•13 years ago
|
||
OK, uploaded a new build of talos without the python error and with linux disabled for responsiveness only, this should be green:
http://people.mozilla.org/~jmaher/taloszips/9baf50c14041/talos.zip
To clarify this won't be changing the numbers, so we will not need side by side staging. We will need a change to config.py to add the --responsiveness flag for m-c only.
Status: RESOLVED → REOPENED
Priority: P1 → --
Resolution: FIXED → ---
Reporter | ||
Comment 12•13 years ago
|
||
Attachment #575206 -
Flags: review?(armenzg)
Assignee | ||
Comment 13•13 years ago
|
||
Comment on attachment 575206 [details] [diff] [review]
patch to turn on responsiveness for tp on m-c (1.0)
Can the new talos.zip be deployed first? and the this config change?
Attachment #575206 -
Flags: review?(armenzg) → review+
Reporter | ||
Comment 14•13 years ago
|
||
yes, we can deploy the talos.zip first, then do the config change afterwards.
Reporter | ||
Comment 15•13 years ago
|
||
this is the talos.zip which just finished in staging:
http://people.mozilla.org/~jmaher/taloszips/5dfaf26ff78a/talos.zip
this talos.zip includes the fix from bug 429592.
Comment 16•13 years ago
|
||
(In reply to Mike Taylor [:bear] from comment #9)
> (In reply to Jeff Hammel [:jhammel] from comment #8)
> > > self.primaryPid = self.ffprocess.GetPidsByName(process)[-1]
> >
> > I notice immediately that ffprocess uses subprocess in launch which will
> > have the subprocess's PID. Instead we search through a subshelled 'ps'
> > output to get the information. This seems problematic.
> >
> > I don't know why we don't find e.g. 'firefox' as a process. That would be
> > good to know. I have not seen this locally.
>
> Is this being done because of the fact that on Foopy's we can have a lot of
> processes that are the same?
Probably the opposite. This was from fallout of Bug 700722 which is needed for bug 650887. We grep each line for ps for (in this case 'firefox'). This will match anything with 'firefox' anywhere in the command (and the less said about how we find PIDs the better, lets just say that if we look for PID=2334 we might get the ps entry for PID 12334). So the fix in bug 700722 more precisely looks at the basename of the actual executable. For some reason this breaks. No one (AFAIK) knows why, but you can see that the two methods are quite different in effect. I.e. this is covering up a bug, something like we're actually looking for firefox-bin but cmanager is looking for firefox and just happening to find it with its over-ambitious approach. There are actually a whole lot of other bugs hiding in this mess but that's the basic idea.
Reporter | ||
Updated•13 years ago
|
Blocks: hang-detector
Assignee | ||
Comment 17•13 years ago
|
||
[armenzg@dm-wwwbuild01 zips]$ rm talos.bug702351.zip
[armenzg@dm-wwwbuild01 zips]$ wget -Otalos.bug702351.zip http://people.mozilla.org/~jmaher/taloszips/5dfaf26ff78a/talos.zip
--2011-11-18 06:47:55-- http://people.mozilla.org/~jmaher/taloszips/5dfaf26ff78a/talos.zip
Resolving people.mozilla.org... 10.2.74.108
Connecting to people.mozilla.org|10.2.74.108|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6007768 (5.7M) [application/zip]
Saving to: `talos.bug702351.zip'
100%[======================================>] 6,007,768 33.2M/s in 0.2s
2011-11-18 06:47:55 (33.2 MB/s) - `talos.bug702351.zip' saved [6007768/6007768]
[armenzg@dm-wwwbuild01 zips]$ ls -l talos.zip
lrwxrwxrwx 1 armenzg build 19 Nov 16 13:41 talos.zip -> talos.bug694579.zip
[armenzg@dm-wwwbuild01 zips]$ mv talos.bug694579.zip old && rm talos.zip && ln -s talos.bug702351.zip talos.zip
[armenzg@dm-wwwbuild01 zips]$ ls -l talos.zip
lrwxrwxrwx 1 armenzg build 19 Nov 18 06:48 talos.zip -> talos.bug702351.zip
I don't want to start a reconfig since an imminent 8.0.1 go to build might come this morning. The talos.zip can be reverted very easily and reconver from it and should only affect day-to-day development rather than a release.
Assignee | ||
Comment 18•13 years ago
|
||
Comment on attachment 575206 [details] [diff] [review]
patch to turn on responsiveness for tp on m-c (1.0)
Landed on "default":
http://hg.mozilla.org/build/buildbot-configs/rev/635eb26ac667
This will be picked up on the next reconfig.
Attachment #575206 -
Flags: checked-in+
Assignee | ||
Updated•13 years ago
|
Whiteboard: [talos] → [talos] talos.zip has been landed - waiting for next reconfig
Comment 19•13 years ago
|
||
This went to production today.
Comment 20•13 years ago
|
||
Are bug 704010, bug 703996, and bug 704380 the result of improved talos code detecting issues, or a regression in talos causing problems ? They all seem to have started after the deployment here.
Assignee | ||
Comment 21•13 years ago
|
||
Attachment #576132 -
Flags: review?(jmaher)
Assignee | ||
Comment 22•13 years ago
|
||
jmaher I would like to backout the talos.zip for a day or two to make sure the bugs that nthomas mention are not cause by us.
The creation date of those bugs plays well with the theory.
Makes sense?
Reporter | ||
Comment 23•13 years ago
|
||
interesting bugs and they all started showing up at the same time. Timeline:
11/18 - landed talos.zip
11/20 - 3 bugs showed up
11/21 - did reconfig
I would think something else is causing a problem here as we are only seeing these issues on m-c and inbound. If I had data points to show problems on other branches that would be more convincing that this problem is related to the talos.zip. We have improved our overall detection of processes, but the timeline and branches don't show enough evidence for me to assume talos.zip.
Another thing, don't we reboot between test runs?
For the two bugs that are related to talos, this error is during initial startup (no test has ran yet).
Reporter | ||
Comment 24•13 years ago
|
||
Comment on attachment 576132 [details] [diff] [review]
We missed "_tests" in our patch
Review of attachment 576132 [details] [diff] [review]:
-----------------------------------------------------------------
good catch!
Attachment #576132 -
Flags: review?(jmaher) → review+
Assignee | ||
Comment 25•13 years ago
|
||
[11:10am] armenzg: jmaher: the problem is that one job gets into trouble and prevents the machine from rebooting
[11:10am] armenzg: which means that it takes the next job without having rebooted
[11:10am] armenzg: jmaher: I know it would be slowing us down but would you be OK if we backed out talos.zip for a day or two?
[11:10am] jmaher: armenzg: oh, I didn't realize that;
[11:10am] philor: if bmo cooperates, I'll be putting a couple of Windows crashes from fx-team in that bug
[11:11am] philor: but yeah, the only way talos is "responsible" is that it's now giving a clear message about why the mochitest crash is screwing it over, instead of the previous message
[11:11am] jmaher: armenzg: I don't have enough evidence to lean me towards the talos.zip
[11:12am] philor: talos has objected to running processes forever, that's not new in the responsiveness thing, is it?
[11:12am] philor: why did I ask that as a question?
[11:12am] philor: that's not new.
[11:13am] jmaher: that is not new
[11:13am] philor: the new thing is saying what process, and wlach landed that in... September?
[11:13am] jmaher: philor: correct
[11:13am] jmaher: we still check for the same processes and use the same discovery technique
[11:13am] ted: the responsiveness thing just fiddles some stuff inside the Tp5 run
[11:14am] jmaher: instead of a true/false return value, we return which process is still running
[11:15am] philor: we're seeing dwwin still running because we're crashing Windows while running mochitests
[11:15am] jmaher: philor: so if that is the case we shouldn't be backing talos.zip out
[11:16am] philor: only to get out of the blame-hose
[11:17am] philor: nobody will want to investigate a thing like bug 704010, only to look for someone to blame, that's why I didn't even file it the first time we started seeing it
[11:18am] jmaher: philor: yeah, we need to figure out how to get the slaves to reboot
[11:18am] jmaher: so our tests are clean
[11:19am] philor: jmaher: well, I'd vote for them not crashing
[11:20am] jmaher: philor: I vote for both
[11:21am] philor: but you could wallpaper over that part by doing an auto-retry when you find a running process
[11:21am] philor: sadly, "Automation Error" went away as a generic retry, not sure what else will do it
[11:25am] armenzg: philor: jmaher what would you suggest me do?
[11:25am] armenzg: I would like to post the convo on the bug
[11:26am] armenzg: philor: which of those 3 bugs you had seen before?
[11:27am] armenzg: so, the newer talos.zip is not the problem but a code change?
[11:27am] philor: armenzg: bug 704010, the M4 flavor, dying in dbaron's huge and over-verbose CSS tests
[11:28am] philor: the new talos.zip is a red herring, I think
[11:29am] philor: yeah, and I'm pretty sure your change didn't reverse time's arrow and cause the mochitest job the slave previously did to crash Windows
Comment 26•13 years ago
|
||
Bug 429592 depended on the new talos drop, can I go ahead and land it or do I need to hold off pending some additional decision here?
Assignee | ||
Comment 27•13 years ago
|
||
Comment on attachment 576132 [details] [diff] [review]
We missed "_tests" in our patch
This landed in "default":
http://hg.mozilla.org/build/buildbot-configs/rev/de57a624cc81
There will be another scheduled reconfig on Thursday if it doesn't happen earlier.
Once it happen we will make mention of it.
Attachment #576132 -
Flags: checked-in+
Assignee | ||
Comment 28•13 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #26)
> Bug 429592 depended on the new talos drop, can I go ahead and land it or do
> I need to hold off pending some additional decision here?
I answered this on IRC. We're not expecting to backout the talos.zip.
This bug is only waiting for attachment 576132 [details] [diff] [review] make it into production masters.
If the talos.zip was to be backout we would let you know.
Assignee | ||
Updated•13 years ago
|
Whiteboard: [talos] talos.zip has been landed - waiting for next reconfig → [talos] This bug is only waiting for attachment 576132 make it into production masters.
Comment 29•13 years ago
|
||
There's a load of red on inbound/m-c now, might this be the cause?
Assignee | ||
Comment 30•13 years ago
|
||
No, not related since it got deployed on Friday. Can you please remove it from the tree status?
Comment 31•13 years ago
|
||
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #30)
> No, not related since it got deployed on Friday.
Ok, was looking at the date of the followup in comment 27.
Have adjusted the tbpl status message accordingly.
Comment 32•13 years ago
|
||
This made it to production today.
Comment 33•13 years ago
|
||
(In reply to Ben Hearsum [:bhearsum] from comment #32)
> This made it to production today.
.... by which I mean the patch to buildbot-configs landed. I haven't deployed talos.zip yet.
Assignee | ||
Comment 34•13 years ago
|
||
There is no talos.zip to deploy in here anymore.
Just the buildbot-configs changed had to land.
Thanks Ben.
Status: REOPENED → RESOLVED
Closed: 13 years ago → 13 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•