Last Comment Bug 1050769 - upload a new mobile_tp4.zip pageset to the 3 headed remote talos server (round 2)
: upload a new mobile_tp4.zip pageset to the 3 headed remote talos server (roun...
Status: VERIFIED FIXED
:
Product: Release Engineering
Classification: Other
Component: Buildduty (show other bugs)
: other
: ARM Android
-- normal (vote)
: ---
Assigned To: Simone Bruno [:simone]
: Justin Wood (:Callek)
: Chris AtLee [:catlee]
Mentors:
Depends on: 1050161
Blocks: 1026970 1051993
  Show dependency treegraph
 
Reported: 2014-08-08 06:45 PDT by Joel Maher ( :jmaher)
Modified: 2014-08-19 09:33 PDT (History)
3 users (show)
See Also:
Crash Signature:
(edit)
Machine State: ---
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
Content of top level folder (230.88 KB, text/plain)
2014-08-19 05:45 PDT, Simone Bruno [:simone]
no flags Details

Description User image Joel Maher ( :jmaher) 2014-08-08 06:45:09 PDT
+++ This bug was initially created as a clone of Bug #1030166 +++

^ that bug would have relevant information for updating this.

Thanks to the work of :edmorley in bug 1050161, we have another round of cleaned up network access, here is the updated mobile_tp4.zip:
http://people.mozilla.org/~jmaher/taloszips/zips/mobile_tp4.zip

shasum mobile_tp4.zip 
7373b491baf27dda89f47365142ba0d2ff6df1c6  mobile_tp4.zip
Comment 1 User image Justin Wood (:Callek) 2014-08-08 13:19:13 PDT
Done per https://bugzilla.mozilla.org/show_bug.cgi?id=1030166#c1
Comment 2 User image Ed Morley (UK bank holiday, away until 30th May) [:emorley] 2014-08-12 03:39:42 PDT
It looks like this didn't work - in bug 1051993 I've pushed to try again, and I'm getting the same external connections as before. I've just re-downloaded the zip in comment 0 here and it does include my changes, so I think perhaps the wrong zip was uploaded to relengwebadm.private.scl3

Please could we try the upload again? :-)
Comment 3 User image Ed Morley (UK bank holiday, away until 30th May) [:emorley] 2014-08-18 04:33:59 PDT
(In reply to Ed Morley [:edmorley] from comment #2)
> It looks like this didn't work - in bug 1051993 I've pushed to try again,
> and I'm getting the same external connections as before. I've just
> re-downloaded the zip in comment 0 here and it does include my changes, so I
> think perhaps the wrong zip was uploaded to relengwebadm.private.scl3
> 
> Please could we try the upload again? :-)
Comment 4 User image Justin Wood (:Callek) 2014-08-18 21:38:09 PDT
Simone, can you take this on "today", if not n-i me back and I'll do it during my day.
Comment 5 User image Justin Wood (:Callek) 2014-08-18 21:39:27 PDT
Hey Ed,

Can you provide a single "changed file[name]" and a related shasum of said file to so we can verify against the server/extracted fileset that your change is in place as well.
Comment 6 User image Ed Morley (UK bank holiday, away until 30th May) [:emorley] 2014-08-19 01:07:39 PDT
(In reply to Justin Wood (:Callek) from comment #5)
> Hey Ed,
> 
> Can you provide a single "changed file[name]" and a related shasum of said
> file to so we can verify against the server/extracted fileset that your
> change is in place as well.

One of the changed files:

[/c/tpn]$ shasum amazon.com/www.amazon.com/index.html
9b132c56328821a1ab7d1c5d48d769328061f66a  amazon.com/www.amazon.com/index.html
Comment 7 User image Simone Bruno [:simone] 2014-08-19 01:59:02 PDT
I uploaded and extracted the zip file. The change seems to be in place:

# sha1sum amazon.com/www.amazon.com/index.html
9b132c56328821a1ab7d1c5d48d769328061f66a  amazon.com/www.amazon.com/index.html
Comment 8 User image Ed Morley (UK bank holiday, away until 30th May) [:emorley] 2014-08-19 03:05:18 PDT
Thank you :-)
Comment 9 User image Ed Morley (UK bank holiday, away until 30th May) [:emorley] 2014-08-19 03:52:58 PDT
I'm still seeing the same failures in bug 1051993's new try run.
I've gone over the relevant files with a fine toothcomb and am pretty sure I've not missed any external connections - I just think the pandas are not running the same pageset as the one that was updated in comment 7.

I think we need to rule out:
1) That the zip isn't extracting to the wrong directory structure (ie one too deep or something like that - and we then have a double pageset, one inside the other).
2) That the pandas are in fact using this server/pageset for the remote-tp4m_nochrome run
3) That there's no caching going on
Comment 10 User image Ed Morley (UK bank holiday, away until 30th May) [:emorley] 2014-08-19 03:54:49 PDT
For #1, an |ls -al| from the top level of the talos repo, adding the output as a private file here would be good?
Comment 11 User image Simone Bruno [:simone] 2014-08-19 04:16:54 PDT
:edmorley:

After extracting, the mobile_tp4 folder is located here: /data/releng/src/talos-remote/www/talos-repo/talos/page_load_test/mobile_tp4 (as per instructions provided in https://bugzilla.mozilla.org/show_bug.cgi?id=1030166#c0). The old version of that folder was also located there, so basically I just replaced it with the new version.
Comment 12 User image Ed Morley (UK bank holiday, away until 30th May) [:emorley] 2014-08-19 05:04:55 PDT
Yeah I know it should have worked, and that yeah files were overwritten, but that could just mean the previous uploads were to the wrong place too though. As |ls -al| would at least help with #1 below.
Comment 13 User image Simone Bruno [:simone] 2014-08-19 05:45:00 PDT
Created attachment 8475118 [details]
Content of top level folder
Comment 14 User image Joel Maher ( :jmaher) 2014-08-19 05:51:19 PDT
so the top level folder shows this:
./talos/page_load_test:
total 64
drwxr-xr-x 13 root root 4096 Aug 19 01:59 .
drwxr-xr-x 14 root root 4096 Jun 25 13:44 ..
drwxr-xr-x  2 root root 4096 Jun 20  2013 a11y
drwxr-xr-x  4 root root 4096 May  2 09:57 canvasmark
drwxr-xr-x  3 root root 4096 Dec  1  2011 dhtml
drwxr-xr-x  5 root root 4096 May  2 09:57 dromaeo
drwxr-xr-x  2 root root 4096 Oct  4  2012 kraken
lrwxrwxrwx  1 root root   19 Apr 24 12:04 mobile_tp4 -> ../../../mobile_tp4
-rw-r--r--  1 root root 4281 Jun 29  2012 quit.js
drwxr-xr-x  2 root root 4096 Jun 25 13:44 scroll
drwxr-xr-x  3 root root 4096 Dec  1  2011 svg
drwxr-xr-x  2 root root 4096 Dec  1  2011 svg_opacity
drwxr-xr-x  3 root root 4096 Jul 19  2013 svgx
drwxr-xr-x  4 root root 4096 Jun 25 13:44 tart
lrwxrwxrwx  1 root root   12 Apr 24 12:04 tp4 -> ../../../tp4
-rw-r--r--  1 root root 1781 Dec  1  2011 tp4m.manifest
drwxr-xr-x  2 root root 4096 Oct  4  2012 v8_7

but we want to see the contents of the mobile_tp4 directory.
Comment 15 User image Ed Morley (UK bank holiday, away until 30th May) [:emorley] 2014-08-19 06:03:11 PDT
Might unzip not be handling the symlinks correctly?

If I've understood the paths in previous comments correctly, the actual pageset is at:

/data/releng/src/talos-remote/www/talos-repo/talos/page_load_test/../../../mobile_tp4

ie:
/data/releng/src/talos-remote/www/mobile_tp4
Comment 16 User image Simone Bruno [:simone] 2014-08-19 06:14:00 PDT
Oh! I just extracted the content of zip file in /data/releng/src/talos-remote/www/mobile_tp4
Comment 17 User image Ed Morley (UK bank holiday, away until 30th May) [:emorley] 2014-08-19 06:16:26 PDT
By "I just extracted" do you mean in previous comments you had extracted the zip there, or you've just re-extracted now to this location? :-)
Comment 18 User image Joel Maher ( :jmaher) 2014-08-19 06:17:50 PDT
we also want to make sure that the unzipping puts the data in mobile_tp4, not mobile_tp4/mobile_tp4.
Comment 19 User image Simone Bruno [:simone] 2014-08-19 06:22:56 PDT
"just extracted" means I have re-extracted it now to this location (sorry for the ambiguity).

Data is now under /data/releng/src/talos-remote/www/mobile_tp4 (not mobile_tp4/mobile_tp4)

I don't think this re-extraction will solve the problem, though, since check in Ccomment 7 was performed in /data/releng/src/talos-remote/www/talos-repo/talos/page_load_test/mobile_tp4.
Comment 20 User image Ed Morley (UK bank holiday, away until 30th May) [:emorley] 2014-08-19 06:59:43 PDT
Thank you - I've retriggered the jobs again, but as you say, if the shasum worked fine then that was unlikely the issue.

Failing that I guess this leaves:

(In reply to Ed Morley [:edmorley] from comment #9)
> 2) That the pandas are in fact using this server/pageset for the
> remote-tp4m_nochrome run
> 3) That there's no caching going on
Comment 21 User image Ed Morley (UK bank holiday, away until 30th May) [:emorley] 2014-08-19 07:03:23 PDT
https://tbpl.mozilla.org/php/getParsedLog.php?id=46259039&full=1&branch=try
06:52:07     INFO -  08-19 06:50:57.085 I/Gecko   ( 2199): FATAL ERR_R: Non-local network connections are disabled and a connection attempt to g-ecx.images-amazon.com (54.239.132.83) was made.

I guess we could do something radical, like rename the pageset directory for a few mins and see if we get any failures? That would rule out us modifying the wrong location?
Comment 22 User image Ed Morley (UK bank holiday, away until 30th May) [:emorley] 2014-08-19 07:04:19 PDT
Justin, any ideas? :-)
Comment 23 User image Ed Morley (UK bank holiday, away until 30th May) [:emorley] 2014-08-19 08:14:17 PDT
15:17 <simone|buildduty> edmorley|sheriffduty: I am ready to try the "rename the pageset directory for a few minutes" idea, if you want
15:21 <edmorley|sheriffduty> simone|buildduty: sure let's give it a go :-)
15:28 <simone|buildduty> edmorley|sheriffduty: page_load_test/mobile_tp4 renamed to m_tp4_

These jobs completed successfully after that:
https://tbpl.mozilla.org/php/getParsedLog.php?id=46262984&tree=Try
https://tbpl.mozilla.org/php/getParsedLog.php?id=46264271&tree=Try
https://tbpl.mozilla.org/php/getParsedLog.php?id=46264306&tree=Try

So the panda talos jobs aren't using the symlink at /data/releng/src/talos-remote/www/talos-repo/talos/page_load_test/mobile_tp4, when they refer to URLs such as:
http://talos-remote.pvt.build.mozilla.org/page_load_test/mobile_tp4/amazon.com/www.amazon.com/index.html
Comment 24 User image Justin Wood (:Callek) 2014-08-19 08:28:57 PDT
(In reply to Ed Morley [:edmorley] from comment #23)
> 15:17 <simone|buildduty> edmorley|sheriffduty: I am ready to try the "rename
> the pageset directory for a few minutes" idea, if you want
> 15:21 <edmorley|sheriffduty> simone|buildduty: sure let's give it a go :-)
> 15:28 <simone|buildduty> edmorley|sheriffduty: page_load_test/mobile_tp4
> renamed to m_tp4_
> 
> These jobs completed successfully after that:
> https://tbpl.mozilla.org/php/getParsedLog.php?id=46262984&tree=Try

This one at least seems to have failed...

07:44:37     INFO -  INFO : RSS: Main: 149086208
07:44:37     INFO -  Cycle 1(1): loaded http://talos-remote.pvt.build.mozilla.org/page_load_test/mobile_tp4/news.google.com/news.google.com/index.html (next: http://talos-remote.pvt.build.mozilla.org/page_load_test/mobile_tp4/m.news.google.com/news.google.com/index.html)
07:44:37     INFO -  RSS: Main: 167141376
07:44:37     INFO -  Cycle 1(1): loaded http://talos-remote.pvt.build.mozilla.org/page_load_test/mobile_tp4/m.news.google.com/news.google.com/index.html (next: http://talos-remote.pvt.build.mozilla.org/page_load_test/mobile_tp4/amazon.com/www.amazon.com/index.html)
07:44:37     INFO -  RSS: Main: 165490688
07:44:37     INFO -  __startBeforeLaunchTimestamp1408459391330__endBeforeLaunchTimestamp
07:44:37     INFO -  __startAfterTerminationTimestamp1408459417181__endAfterTerminationTimestamp
07:44:37     INFO -  Failed tp4m:
07:44:37     INFO -  		Stopped Tue, 19 Aug 2014 07:44:37
07:44:37    ERROR -  Traceback (most recent call last):


> So the panda talos jobs aren't using the symlink at
> /data/releng/src/talos-remote/www/talos-repo/talos/page_load_test/mobile_tp4,
> when they refer to URLs such as:
> http://talos-remote.pvt.build.mozilla.org/page_load_test/mobile_tp4/amazon.
> com/www.amazon.com/index.html

Which makes me think this is false
Comment 25 User image Ed Morley (UK bank holiday, away until 30th May) [:emorley] 2014-08-19 08:30:36 PDT
(In reply to Justin Wood (:Callek) from comment #24)
> (In reply to Ed Morley [:edmorley] from comment #23)
> > 15:17 <simone|buildduty> edmorley|sheriffduty: I am ready to try the "rename
> > the pageset directory for a few minutes" idea, if you want
> > 15:21 <edmorley|sheriffduty> simone|buildduty: sure let's give it a go :-)
> > 15:28 <simone|buildduty> edmorley|sheriffduty: page_load_test/mobile_tp4
> > renamed to m_tp4_
> > 
> > These jobs completed successfully after that:
> > https://tbpl.mozilla.org/php/getParsedLog.php?id=46262984&tree=Try
> 
> This one at least seems to have failed...

For a non-local connection (this is a try run of the bug 1051993 patch), see the full log:

07:44:37     INFO -  08-19 14:43:29.697 I/GeckoDump( 2192): Cycle 1(1): loaded http://talos-remote.pvt.build.mozilla.org/page_load_test/mobile_tp4/m.news.google.com/news.google.com/index.html (next: http://talos-remote.pvt.build.mozilla.org/page_load_test/mobile_tp4/amazon.com/www.amazon.com/index.html)
07:44:37     INFO -  08-19 14:43:29.697 I/GeckoDump( 2192):
07:44:37     INFO -  08-19 14:43:30.869 I/GeckoDump( 2192): RSS: Main: 165490688
07:44:37     INFO -  08-19 14:43:30.869 I/GeckoDump( 2192):
07:44:37     INFO -  08-19 14:43:30.986 D/GeckoSuggestedSites( 2192): Number of suggested sites: 4
07:44:37     INFO -  08-19 14:43:31.267 D/GeckoSuggestedSites( 2192): Number of suggested sites: 4
07:44:37     INFO -  08-19 14:43:31.283 D/GeckoSuggestedSites( 2192): Number of suggested sites: 4
07:44:37     INFO -  08-19 14:43:31.439 V/GeckoFavicons( 2192): Cancelling favicon load 36.
07:44:37     INFO -  08-19 14:43:31.712 I/SUTAgentAndroid( 1889): 10.26.128.20 : activity
07:44:37     INFO -  08-19 14:43:32.048 I/Gecko   ( 2192): FATAL ERR_R: Non-local network connections are disabled and a connection attempt to g-ecx.images-amazon.com (54.230.119.189) was made.

So this is still true:
> > So the panda talos jobs aren't using the symlink at
> > /data/releng/src/talos-remote/www/talos-repo/talos/page_load_test/mobile_tp4,
> > when they refer to URLs such as:
> > http://talos-remote.pvt.build.mozilla.org/page_load_test/mobile_tp4/amazon.
> > com/www.amazon.com/index.html
Comment 26 User image Ed Morley (UK bank holiday, away until 30th May) [:emorley] 2014-08-19 08:31:57 PDT
If we were using this pageset, renaming it should result in 404s, not successfully loading the page & hitting the external network.
Comment 27 User image Justin Wood (:Callek) 2014-08-19 08:38:57 PDT
BAH!!!!

So, it looks like we effectively have broken docs/made a mistake. While we did update the location and the zip file, we *did not* deploy it to the webheads only the admin node.

(I had to run ./update at /data/releng/src/talos-remote)

I'm going to get a task on myself to update our docs *today*.

That said, this bug should now be fixed, see before/after:

[jwood@foopy72.p5.releng.scl3.mozilla.com ~]$ curl http://talos-remote.pvt.build.mozilla.org/page_load_test/mobile_tp4/amazon.com/www.amazon.com/index.html 2>/dev/null | shasum
95e7711e074e22c4df1bae8fd0fd23fdb6a2f3b3  -
[jwood@foopy72.p5.releng.scl3.mozilla.com ~]$ curl http://talos-remote.pvt.build.mozilla.org/page_load_test/mobile_tp4/amazon.com/www.amazon.com/index.html 2>/dev/null | shasum
9b132c56328821a1ab7d1c5d48d769328061f66a  -
Comment 28 User image Ed Morley (UK bank holiday, away until 30th May) [:emorley] 2014-08-19 09:15:33 PDT
Thank you - has worked :-)

(green runs on https://tbpl.mozilla.org/?tree=Try&rev=863127ae067d)
Comment 29 User image Justin Wood (:Callek) 2014-08-19 09:33:29 PDT
(In reply to Justin Wood (:Callek) from comment #27)
> BAH!!!!
> 
> I'm going to get a task on myself to update our docs *today*.

Untested doc:
https://wiki.mozilla.org/index.php?title=ReleaseEngineering%2FBuildduty%2FOther_Duties&action=historysubmit&diff=1007067&oldid=1000544

Note You need to log in before you can comment on or make changes to this bug.