Monitor files on BrickFTP server

ASSIGNED
Assigned to

Status

ASSIGNED
2 years ago
6 days ago

People

(Reporter: lypulong, Assigned: fauweh)

Tracking

Details

Attachments

(2 attachments)

(Reporter)

Description

2 years ago
Monitoring is being requested for the Vending machines residing at the following office

Please work with Jen and Mike Poessy to get technical data on the machines

Jen/Mike we will need some of the following information:

    Name of Service
    Scenario: What this system or service would trigger for support (examples)
    Event Source: Where the trigger would come from (nagios alert, bugzilla ticket, vendor email)
    What the requesting team will be responsible for
    What the MOC will be responsible for 
    Any relevant SLA or other metrics
    Escalation and Support plan
    Deliverables and timelines for the request

Runbook  (Runbooks definition) templates will be provided and jointly authored by a MOC representative and a representative of the requesting team

    Internal System or Service Runbook Template: Runbook Template

(this is an excerpt from https://mana.mozilla.org/wiki/display/MOC/How+to+request+MOC+support+for+a+new+production+system+or+service

We will get this set up as quickly as possible, is there a target date for completion.

THanks so much,
Linda
(Reporter)

Updated

2 years ago
Flags: needinfo?(mpoessy)
Assignee: nobody → rchilds
Status: NEW → ASSIGNED
(Reporter)

Updated

2 years ago
Assignee: rchilds → nobody
Status: ASSIGNED → NEW
Flags: needinfo?(jhayashi)
Picking this back up, awaiting details so we can proceed.
Assignee: nobody → rchilds
Status: NEW → ASSIGNED
(Reporter)

Comment 2

2 years ago
Hi Shraddha -

We spoke about this in January and I am trying to validate that we did not already complete this work.  We are looking for tghe bug but if we are already monitoring this flow I will update.

Thank you so much 
Linda
(Reporter)

Updated

2 years ago
Flags: needinfo?(spatil)
Flags: needinfo?(mpoessy)
(In reply to Linda Ypulong [:unixfairy] from comment #2)
> Hi Shraddha -
> 
> We spoke about this in January and I am trying to validate that we did not
> already complete this work.  We are looking for tghe bug but if we are
> already monitoring this flow I will update.
> 
> Thank you so much 
> Linda

HI Linda,

Thanks for bringing this across. Ill coordinate with Jen to provide necessary details aforementioned for moving this. I don't remember a bug already in place but this one is best to track the work.
Flags: needinfo?(spatil)
I have a question, which part do we want to monitor?  There are a lot of moving pieces to the vending machines can you help me understand which part we are concerned about?

1)  Ccure --> badge information to IVM so that only active badges can vend?
2)  Transaction data to Data Team?
3)  Data Team transaction data to Service Now to create the ticket?
Flags: needinfo?(jhayashi)
(Reporter)

Comment 5

2 years ago
(In reply to Jennifer Hayashi [:jen] from comment #4)
> I have a question, which part do we want to monitor?  There are a lot of
> moving pieces to the vending machines can you help me understand which part
> we are concerned about?
> 
> 1)  Ccure --> badge information to IVM so that only active badges can vend?
> 2)  Transaction data to Data Team?
> 3)  Data Team transaction data to Service Now to create the ticket?

Normally we want to monitor all parts above and have runbooks for the different components so we understand how critical the component is so we can develop appropriate responses ie try this and if it does not work page immediately to this person or please open a bug or service now ticket if this alert is received.

I hope that helps - but all critical functions you listed should be monitored
(Reporter)

Comment 6

2 years ago
Jen/Shraddha/Mike do we have architectural diagrams that show the hardware and dataflow for the Vending machine solution?
Flags: needinfo?(spatil)
Flags: needinfo?(mpoessy)
Flags: needinfo?(jhayashi)
Not from start to finish.  Do you have an good example you can share?  I'll work on getting that completed.
Flags: needinfo?(jhayashi)
(Reporter)

Comment 8

2 years ago
(In reply to Jennifer Hayashi [:jen] from comment #7)
> Not from start to finish.  Do you have an good example you can share?  I'll
> work on getting that completed.

https://mana.mozilla.org/wiki/display/SECURITY/IAM#IAM-Designarchitectureanddiagrams
https://mana.mozilla.org/wiki/display/EA/Moztrap+Architecture
https://mana.mozilla.org/wiki/display/SYSADMIN/How+our+puppet+works

are three examples

A good architecture diagram will include the physical architecture diagram with system names and IP and then a separate diagram that has logical data flows
Awesome - thanks Linda.
(Assignee)

Comment 10

2 years ago
Taking bug.

Richard gave us some good background on some of the monitoring concerns so I'll schedule a meeting with the various folks in this bug to chat about what the needs are and what we can provide.
Assignee: rchilds → kferrando
Flags: needinfo?(mpoessy)
Created attachment 8896492 [details]
Vending machine interactions
Keegan - let me know if that's too high level and you need more detail.
(In reply to Linda Ypulong [:unixfairy] from comment #6)
> Jen/Shraddha/Mike do we have architectural diagrams that show the hardware
> and dataflow for the Vending machine solution?

Second the diagram Jen attached. Its well descriptive and can work back with Jen if additional info needed. Thanks!
Flags: needinfo?(spatil)
(Assignee)

Comment 14

2 years ago
Meeting scheduled for Friday (8/18) @ 1PM PDT.
(Assignee)

Comment 15

a year ago
Created attachment 8900863 [details]
VendingMachineMonitoring-ProcessFlow.png

Jen/Shraddha, does this marked up diagram look correct based upon our meeting?

Specifically, I'd like to confirm the colored areas of ownership and comments I've added are accurate.

Shraddha, I'll work with you to figure out which files need to be monitored on the BrickFTP server.
Flags: needinfo?(spatil)
Flags: needinfo?(jhayashi)
The area marked on my end and the architecture diagram overall looks correct to me.
Flags: needinfo?(spatil)
yup!  looks good to me.  Thank you!
Flags: needinfo?(jhayashi)
(Assignee)

Comment 18

a year ago
Ok great! So MOC will only be monitoring file uploads to BrickFTP and then we'll close this bug.

I chatted with Shraddha and we will put some Nagios monitoring in place for the following conditions:

File #1
Path: Root Folder/etl/ivm 
Filename: Mozilla_CorporationYYYY_MM_DD_XX_XX_XX.xml   

Summary:
Where the date is current date and XX are randon numbers in filename but it has to be an xml file. The file lands at 2:00am Pacific

File #2
Path: Root Folder/etl/ccure/uploads/BadgeID
Filename: ccure_BadgeID_AllButVendor.txt

Summary: 
Lands at 2:02am Pacific, Boomi processes run 10-20 mins later after the file lands
(Assignee)

Comment 19

a year ago
Also, we'll restrict checking/alerting on this for Mon-Fri, business hours PDT and send to boomi@mozilla.pagerduty.com.
(In reply to Keegan Ferrando [:fauweh] from comment #19)
> Also, we'll restrict checking/alerting on this for Mon-Fri, business hours
> PDT and send to boomi@mozilla.pagerduty.com.

One correction: restrict checking/alerting on this for Tue-Fri, business hours PDT
Since file doesn't land on weekends, Sun and Mon will show no files which is as expected.
(Assignee)

Comment 23

9 months ago
(In reply to Keegan Ferrando [:fauweh] from comment #18)
> Ok great! So MOC will only be monitoring file uploads to BrickFTP and then
> we'll close this bug.
> 
> I chatted with Shraddha and we will put some Nagios monitoring in place for
> the following conditions:
> 
> File #1
> Path: Root Folder/etl/ivm 
> Filename: Mozilla_CorporationYYYY_MM_DD_XX_XX_XX.xml   
> 
> Summary:
> Where the date is current date and XX are randon numbers in filename but it
> has to be an xml file. The file lands at 2:00am Pacific
> 
> File #2
> Path: Root Folder/etl/ccure/uploads/BadgeID
> Filename: ccure_BadgeID_AllButVendor.txt
> 
> Summary: 
> Lands at 2:02am Pacific, Boomi processes run 10-20 mins later after the file
> lands

Shraddha - We're circling back on this. There had been some discussion while you were out on parental leave that Boomi would be replaced and that could impact this monitoring need.

Is this something we should implement, or will/have these processes gone away? This should be pretty low-effort, we'll just need to get the BrickFTP credentials from you and into the puppet secrets repo.
Flags: needinfo?(spatil)
Yes please go forward to implement this task. Boomi decommissioning is in our H2 planning phase but is not concrete yet.

I'll be happy to help if there are any questions during implementation. Thank you.
Flags: needinfo?(spatil)
(Assignee)

Comment 25

8 months ago
I've got the credentials in hiera, looks like we don't currently do any FTP checks like we need here and Nagios doesn't ship with anything exciting (check_ftp is actually just a symlink to check_tcp for checking a server responds, only).
(Assignee)

Comment 26

8 months ago
Shraddha: There are 65,152 files in the ./ivm/ directory, and that's causing this script to take over a full minute to run as we have to look at each file and try to match on the partially known filename. This is really inefficient. 

Is there any way we can move/delete some of these older files? Perhaps we could move older files to something like ivm-archive or ivm/archive?
Flags: needinfo?(spatil)
Archiving is the best option. I wouldn't mind deleting older files too from 2016 and with *.tmp extention

Would that help, let me know. Thanks!
Flags: needinfo?(spatil)
(Assignee)

Comment 28

8 months ago
(In reply to Shraddha Patil [:Shraddha Patil] from comment #27)
> Archiving is the best option. I wouldn't mind deleting older files too from
> 2016 and with *.tmp extention
> 
> Would that help, let me know. Thanks!

Alright, about 64k of the files were temp files so I got those deleted. I have archived the remaining xml files from 2016 and 2017 into gzip'd tar files in the 'Archived' directory. This leaves us with 115 files, much more manageable to check now.

I'll start working on the SFTP check now.
appreciate it.
(Assignee)

Comment 30

8 months ago
(In reply to Keegan Ferrando [:fauweh] from comment #18)
> File #1
> Path: Root Folder/etl/ivm 
> Filename: Mozilla_CorporationYYYY_MM_DD_XX_XX_XX.xml   
> 
> Summary:
> Where the date is current date and XX are randon numbers in filename but it
> has to be an xml file. The file lands at 2:00am Pacific
> 
> File #2
> Path: Root Folder/etl/ccure/uploads/BadgeID
> Filename: ccure_BadgeID_AllButVendor.txt
> 
> Summary: 
> Lands at 2:02am Pacific, Boomi processes run 10-20 mins later after the file
> lands

Shraddha, since these files are only generated once a day around 2am, but escalation shouldn't start until business hours, I'm planning on setting this to only check for files at 9am PST and alert if the files have a create date older than 7 hours (so, if they were created before 2am). I only plan to have this check/alert for a few hours in the morning. There may be some tuning around the file age parameter but the time at which this checks will remain static.

Does this sound reasonable?
Flags: needinfo?(spatil)
Hi Keegan,

Yes sounds reasonable. Please continue with the effort. Appreciate it.
Flags: needinfo?(spatil)
(Assignee)

Comment 32

8 months ago
Should have this wrapped up next week.

[kferrando@nagios1.private.scl3 bin]$ python brick.py
OK: Mozilla_Corporation2018_06_01_09_00_00.xml is 13 hours old and 3565 bytes. - ccure_BadgeID_AllButVendor.txt file is 6 hours old and 238311 bytes.
(Assignee)

Comment 33

13 days ago

I reviewed with Shraddha that the brickftp service is still part of this flow despite the removal of Boomi. We have added the MDC1/2 NAT addresses to the whitelist for access and will work on getting this monitoring completed and pushed out.

(Assignee)

Comment 34

12 days ago

There were some changes in the files needing to be monitored in brickftp and have modified the script to work with that.

When all is well (only short-names provided):

$./brickftp.py OK: ivm is OK, journal is OK, siemens is OK

Age issues (actual filename is provided):

$./brickftp.py CRITICAL: ivm is OK, ccure_BadgeID_AllButVendor test for Siemens.txt is 20 hours old, siemens is OK

File not found
$./brickftp.py CRITICAL: ivm is OK, FAKE ccure_BadgeID_AllButVendor test for Siemens.txt not found!, siemens is OK

I'll get this out next week.

(Assignee)

Updated

6 days ago
Summary: Monitoring Request: set up monitoring for IT Vending machines → Monitor files on BrickFTP server
You need to log in before you can comment on or make changes to this bug.