Closed
Bug 510952
Opened 15 years ago
Closed 14 years ago
more n810 monitoring
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
INCOMPLETE
People
(Reporter: mozilla, Assigned: mozilla)
References
Details
Attachments
(3 files)
5.33 KB,
application/octet-stream
|
Details | |
272 bytes,
application/octet-stream
|
Details | |
2.39 KB,
text/plain
|
jhford
:
review+
mozilla
:
checked-in+
|
Details |
* is standalone.txt there * is buildbot running ** is buildbot connected to a master? which one? *** does the master think it's connected? * is hostname valid * disk usage * /media/mmc2 read-only * uptime * system time possibly more. that's a good start.
Assignee | ||
Comment 1•15 years ago
|
||
* free: both free mem & verify that swap is there
Assignee | ||
Comment 2•15 years ago
|
||
I'm thinking a shell script that we ssh in and run 1/hr? 1/day? jhford is thinking about putting in start.sh and pinging a cgi and monitoring last heard from timestamp. Whatever works...
Assignee: aki → jford
Assignee | ||
Comment 3•15 years ago
|
||
* is it connected to power
Comment 4•15 years ago
|
||
This is a log scraper to tell us some basic things. Some areas for enhancement are: -Storing information in a sqlite db -Implementing a JSON interface on /buildslaves -web interface
Comment 5•15 years ago
|
||
This script uses the scraper to give a quick overview of production's health
Comment 6•15 years ago
|
||
These scripts are running at http://maemo-flashing.mv.mozilla.com/n810-production-list and http://maemo-flashing.mv.mozilla.com/n810-status.txt every 5 minutes. You will need VPN access to the MV office to see the files.
Comment 7•15 years ago
|
||
Moved these scripts to mobile-master at http://mobile-master/n810-production-lists http://mobile-master/n810-status and turning off the maemo-flashing ones.
Assignee | ||
Comment 8•15 years ago
|
||
Bob -- this is the list I've got so far. There may be some new things, like is the filesystem/profile corrupt. And you won't have to worry about the buildbot specific ones. I'm going to guess you've got most of this already.
Comment 9•15 years ago
|
||
As we are now a lot more stable and we have some device status reporting. I am going to call this fixed. We are getting more than a week for each re-image and we have some reporting scripts running to find bad n810s. If this becomes a major problem again, I will be happy to do some more work on the monitoring.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 10•15 years ago
|
||
I think we need these. The scraper is handy, but has a number of issues: * doesn't work when one of the buildbot masters is down (empty list) * no history * only determines whether the device is connected & heard from recently ** idle devices are seen as "down" ** doesn't detect things like corrupt filesystems We need these, but who does the work is definitely negotiable; I can pick this up if you're not interested.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 11•15 years ago
|
||
re: is hostname valid, I just found maemo-n810-14 had a hostname of 'maemo-n810-11' which was causing all sorts of issues on staging. Probably my f-up, but I'd like to catch that sort of thing. Again, not something that would be caught by the scraper.
Comment 12•15 years ago
|
||
In response to these requests I have added the following commands to the SUTAgent. devinfo uptime - returns the current uptime for the device. devinfo systime - returns the current system time as set on the device. prune <directory> - deletes the directory passed in even if not empty. dirwritable <directory> - validates that a temp file can be created in the directory passed in. To test for the existence of a file "standalone.txt" or any other you can execute the ls command in the correct directory and check for its existance in the returned list. The "disk" command will return total and free disk space. The "devinfo memory" command will return the installed, available, and used memory on the device. Typing help in a telnet window attached to the device on port 20701 will give you a complete list of the commands currently available. This is true for any device running the SUTAgent (currently WinMo and WinCE).
Comment 13•15 years ago
|
||
I have the script I was working on for Maemo SD V5 which can be modified to do these checks. For the host name stuff, it is a matter of running dig -x on the device and parsing the output. We could set the host names automatically at boot which would remove any human element involved in flashing.
Assignee | ||
Comment 14•15 years ago
|
||
Bob -- thank you. Not sure how in sync your version and blassey's versions are, but we can deal with that. John -- cool. My main objection to switching over was losing functionality; if the script has all the functionality we want, then we should go with that. re: dig -x, we should fail gracefully when we add devices that aren't in dns/static ip yet... for new devices especially.
Assignee | ||
Updated•15 years ago
|
Assignee: jhford → aki
Assignee | ||
Comment 15•14 years ago
|
||
I bet you'll love this. Definitely missed functions.
Attachment #424700 -
Flags: review?(jhford)
Comment 16•14 years ago
|
||
Comment on attachment 424700 [details]
quick n dirty monitor
looks good. is this going to be run through ssh from the master?
Attachment #424700 -
Flags: review?(jhford) → review+
Assignee | ||
Comment 17•14 years ago
|
||
Schweet. Yup.
Assignee | ||
Comment 18•14 years ago
|
||
Comment on attachment 424700 [details] quick n dirty monitor http://hg.mozilla.org/build/tools/rev/2f4749fe9663
Attachment #424700 -
Flags: checked-in+
Assignee | ||
Comment 19•14 years ago
|
||
When we first set these up, just keeping them up and running was a herculean task. At this point, with mostly stable power and strong wifi connections I think stability of numbers is a higher priority. I've reduced the ssh-in from every several minutes to several hours apart. We'll install dropbear instead of opensshd in bug 546702. I'm going to say new fixes will be wanted in future SD imaging bugs.
Status: REOPENED → RESOLVED
Closed: 15 years ago → 14 years ago
Resolution: --- → INCOMPLETE
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•