Closed Bug 976138 Opened 11 years ago Closed 11 years ago

stackato apps being marked as down by router2g

Categories

(Infrastructure & Operations :: IT-Managed Tools, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cturra, Assigned: cturra)

Details

this bug is to track the investigation of why some applications are being marked as down by the router node in the production stackato cluster. this doesn't appear to be happening in our dev cluster, which is running the same version of stackato in a almost identical configuration. there has been a couple bugs on this topic in the past, bug 972589 is the latest for reference. i will be working with activestate to investigate this further.
Let me know if there's anything I can do to help debug - like ping you on IRC when it's down or something.. There's a CRON job (hopefully) set up - a Python script that runs to pull data out of Bugzilla and cache it locally. The only potential reason I've been able to think of is perhaps that CRON job sometimes fails and the framework marks the site as down as a result?
This is getting really frustrating - today it the site really was down almost every time I tried to load it.
(In reply to Hallvord R. M. Steen from comment #2) > This is getting really frustrating - today it the site really was down > almost every time I tried to load it. i totally understand and am working with activestate on this. bare with us - we'll get this sorted!
activestate has suggested updating our `procps` system packages across the DEAs in the prod cluster. i will be doing a rolling upgrade of those today to see if this sorts us out. Hallvord - currently, the only app that is being repeatedly marked as down is yours (tho, we've seen others in the past). can you point me to the repo for this project so i can also run a copy of the app? i'd like to see if i can dig any deeper into that also.
as promised earlier, i have gone through all the dea/stager nodes and upgraded procps. they're now all running version 1:3.2.8-11ubuntu6.3. i will continue to monitor to see if this makes a difference :) # dpkg -l | grep procps ii procps 1:3.2.8-11ubuntu6.3 /proc file system utilities
unfortunately this package upgrade has not resolved out issue. i have scheduled some time tomorrow morning to review our production environment onsite at activestate's office. will report back on the outcome of that meeting.
The code for this site is here https://github.com/hallvors/mobilewebcompat/ I don't think I have anything significant locally that isn't on Github. This script runs as a CRON job every 2 hours or so to fetch data from Bugzilla and re-build the JSON file that contains all the data the site presents: https://github.com/hallvors/mobilewebcompat/blob/master/preproc/buildlists.py Might some failure here cause the framework to mark the site as being down? It should be perfectly fine to just try again two hours later if something fails.. Just means the data will be slightly more stale than usual but most likely nobody will even notice..
i spent some time working onsite with activestate this morning and we've found the root cause. essentially, when puppet was running it was blowing away the iptable nat rules on the dea nodes. this cased the router to make the nodes as down since it was indeed losing it's network routes to the lxc containers. i will work with infra to sort this out and report back. in the mean time, i have disabled puppet so this will no longer happen.
good news! i have worked with the infra team and we did a bunch of work around how puppet managed users/internal nat routes/etc and have now deployed new host management that does *NOT* removing any necessary stackato services. this has now been running for a bit without any observed negative impact to the stackato applications and services. marking as r/fixed :)
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.