Closed Bug 510952 Opened 15 years ago Closed 14 years ago

more n810 monitoring

Categories

(Release Engineering :: General, defect)

ARM
Maemo
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: mozilla, Assigned: mozilla)

References

Details

Attachments

(3 files)

* is standalone.txt there
* is buildbot running
** is buildbot connected to a master? which one?
*** does the master think it's connected?
* is hostname valid
* disk usage
* /media/mmc2 read-only
* uptime
* system time

possibly more. that's a good start.
Blocks: 499334
* free: both free mem & verify that swap is there
I'm thinking a shell script that we ssh in and run 1/hr? 1/day?
jhford is thinking about putting in start.sh and pinging a cgi and monitoring last heard from timestamp.

Whatever works...
Assignee: aki → jford
* is it connected to power
Attached file scrape.py
This is a log scraper to tell us some basic things.  Some areas for enhancement are:
 -Storing information in a sqlite db
 -Implementing a JSON interface on /buildslaves
 -web interface
Attached file production-view
This script uses the scraper to give a quick overview of production's health
These scripts are running at http://maemo-flashing.mv.mozilla.com/n810-production-list and http://maemo-flashing.mv.mozilla.com/n810-status.txt every 5 minutes.  You will need VPN access to the MV office to see the files.
Moved these scripts to mobile-master at http://mobile-master/n810-production-lists
http://mobile-master/n810-status and turning off the maemo-flashing ones.
Bob -- this is the list I've got so far.

There may be some new things, like is the filesystem/profile corrupt.  And you won't have to worry about the buildbot specific ones.  I'm going to guess you've got most of this already.
As we are now a lot more stable and we have some device status reporting.  I am going to call this fixed.  We are getting more than a week for each re-image and we have some reporting scripts running to find bad n810s.  If this becomes a major problem again, I will be happy to do some more work on the monitoring.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
I think we need these.
The scraper is handy, but has a number of issues:

* doesn't work when one of the buildbot masters is down (empty list)
* no history
* only determines whether the device is connected & heard from recently
** idle devices are seen as "down"
** doesn't detect things like corrupt filesystems

We need these, but who does the work is definitely negotiable; I can pick this up if you're not interested.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
re: is hostname valid, I just found maemo-n810-14 had a hostname of 'maemo-n810-11' which was causing all sorts of issues on staging.  Probably my f-up, but I'd like to catch that sort of thing.  Again, not something that would be caught by the scraper.
In response to these requests I have added the following commands to the SUTAgent.

devinfo uptime - returns the current uptime for the device.
devinfo systime - returns the current system time as set on the device.
prune <directory> - deletes the directory passed in even if not empty.
dirwritable <directory> - validates that a temp file can be created in the directory passed in.

To test for the existence of a file "standalone.txt" or any other you can execute the ls command in the correct directory and check for its existance in the returned list.

The "disk" command will return total and free disk space.

The "devinfo memory" command will return the installed, available, and used memory on the device.

Typing help in a telnet window attached to the device on port 20701 will give you a complete list of the commands currently available. This is true for any device running the SUTAgent (currently WinMo and WinCE).
I have the script I was working on for Maemo SD V5 which can be modified to do these checks.

For the host name stuff, it is a matter of running dig -x on the device and parsing the output.  We could set the host names automatically at boot which would remove any human element involved in flashing.
Bob -- thank you. Not sure how in sync your version and blassey's versions are, but we can deal with that.

John -- cool. My main objection to switching over was losing functionality; if the script has all the functionality we want, then we should go with that.

re: dig -x, we should fail gracefully when we add devices that aren't in dns/static ip yet... for new devices especially.
Assignee: jhford → aki
Attached file quick n dirty monitor
I bet you'll love this.
Definitely missed functions.
Attachment #424700 - Flags: review?(jhford)
Comment on attachment 424700 [details]
quick n dirty monitor

looks good.  is this going to be run through ssh from the master?
Attachment #424700 - Flags: review?(jhford) → review+
Schweet. Yup.
When we first set these up, just keeping them up and running was a herculean task.  At this point, with mostly stable power and strong wifi connections I think stability of numbers is a higher priority.

I've reduced the ssh-in from every several minutes to several hours apart. We'll install dropbear instead of opensshd in bug 546702.  I'm going to say new fixes will be wanted in future SD imaging bugs.
Status: REOPENED → RESOLVED
Closed: 15 years ago14 years ago
Resolution: --- → INCOMPLETE
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: