Closed Bug 1255582 Opened 5 years ago Closed 4 years ago

come up with a way to measure the infrastructure so we have more data when random issues crop up

Categories

(Infrastructure & Operations :: MOC: Problems, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jmaher, Assigned: Usul)

References

Details

We continue to hit issues with our talos tests where they become greatly unstable either on weekends, or for random events.  These events are tied 100% to infrastructure, not a specific in-tree patch.

Infrastructure can mean:
* buildbot
* taskcluster
* puppet
* network
* power
* hardware
* new software on machines
* stability of the OS (reboots, etc.)
* cpu/disk/ram of the OS

we have seen cases in the past where a test fills the disk up and starts affecting all tests on the platform in the future (an example of Android tests not cleaning themselves up properly)

we have patterns of different behaviour on weekends vs weekdays:
https://elvis314.wordpress.com/2014/10/30/a-case-of-the-weekends/

we have seen a 36 hour spike in data noise which fixed itself with no apparent reason (bug 1253715)

there are other unresolved and unknown issues.

I do know that AWFY (http://arewefastyet.com/) runs a lot of the exact same tests with almost no noise.  The difference is they are run on the same machine under a desk.  

To make this actionable, I believe we need data:
* network packet rate (by type)
* uptime/reboot logs graphed for each machine
* cpu of each machine over time
* memory usage of each machine over time
* disk usage of each machine over time

having some type of resolution (every minute, or few minutes) would give us a lot more data and something to help us rule out and focus investigations in the future.
Hi Joel,

We already have tens of thousands of monitoring points in place today.  What I'm personally unsure of is the talos architecture (what is talking to what, how and when) to know how best to use those monitors in troubleshooting this.

I'm going to loop in Linda here who manages the MOC and owns our monitoring in case there is something to add - in the meantime if you can point us to the workflow that is getting tripped up here we can look into it.
Assignee: infra → nobody
Component: Infrastructure: Other → MOC: Problems
QA Contact: cshields → lypulong
(In reply to Corey Shields [:cshields] from comment #1)
> Hi Joel,
> 
> We already have tens of thousands of monitoring points in place today.  What
> I'm personally unsure of is the talos architecture (what is talking to what,
> how and when) to know how best to use those monitors in troubleshooting this.
> 
> I'm going to loop in Linda here who manages the MOC and owns our monitoring
> in case there is something to add - in the meantime if you can point us to
> the workflow that is getting tripped up here we can look into it.

Thanks Corey for the heads up - Joel I am going to assign this bug to one of my engineers to get with you to have an understanding of the underlying systems and architecture as well as the behavior so we can get a plan for monitoring with appropriate fix actions, escalations, communication plan and an understanding of impact when issues arise.
Assignee: nobody → ludovic
sent an invite for an talk next week over vidyo.
Thanks all for jumping on this.  Ludovic and I have a meeting setup next week to walk through this.

We might have a lot of this data already, if so- maybe in the past we have looked in the wrong areas- looking forward to figuring out a way to monitor this information.

as a note, some of AWFY runs in the datacenter (I had called it out as running under a desk somewhere- that is only partially true).
example machines (linux is the worse offender, osx has had some, windows seems less so):
talos-linux64-ix-015 
talos-linux64-ix-030
talos-linux64-ix-036
t-yosemite-r7-0102
t-yosemite-r7-0044
t-yosemite-r7-0077
t-xp32-ix-070 
t-xp32-ix-054 
t-xp32-ix-121 


graphite could have a lot of information in it.
papertrail has syslog data for osx/linux- we could determine reboots from that.  It also appears that we have uptime in the graphite database, the interface to graphite is:
https://graphite-scl3.mozilla.org/  (vpn is needed)
Joel are you getting the things you want need ?
Flags: needinfo?(jmaher)
Depends on: 1271948
Flags: needinfo?(jmaher)
today I had the opportunity to dive in.  I did find out that the cpu is higher during the times of noisy data (with some help from :arr and :catlee), but I still have no idea what is going on.

I think we need to go back to the drawing board and figure this out.  I have no idea what is using the cpu when looking at data in the graphite database, and it was a lot of hacky work for :arr to get me access to the machine.
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.