Closed Bug 1054050 Opened 10 years ago Closed 8 years ago

rsync should have a timeout in new vcs sync

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: pmoore, Assigned: mcc.ricardo)

References

Details

(Whiteboard: [good first bug][lang=python])

In bug 1053960 we had vcs sync stuck due to an rsync getting stuck. We should have a timeout for the rsync commands used in vcs sync (and indeed any other operation that potentially can get stuck).
See Also: → 1053960
Whiteboard: [good first bug][lang=python]
Hi Pete. I would like to take a shot at this, but I would require some guidance.
Flags: needinfo?(pmoore)
That's fantastic! I'm very happy to help out. I'm tied up at a planning workweek this week but I can certainly answer questions you might have now, and next week I can hook up with you to give you e.g. a face-to-face overview over skype etc, if that helps.

It might also be worth hooking up with contacting Kartik from bug 1020613 as he is a new contributor that has just set up a local vcs sync dev environment, so he might be able to help you out with starter questions. Next week my calendar is much emptier so I should have more time. :)

Pete
Flags: needinfo?(pmoore)
Absolutely Pete. I'm pretty swamped myself this week too. But I'll contact Kartik and get this ball running by getting setting up a local dev environment.
Hi Ricardo,

Things are a bit quieter now - let me know if you'd still like to hook up. I'm on build duty next week, but I'm sure we'll be able to find for example a 30 min slot to skype. I am around usually between 00:00 PDT and 09:00 PDT - does this suit your timezone?

Thanks,
Pete
Flags: needinfo?(mcc.ricardo)
Hi Pete,

That would be great. I was also busy, but next week I'll have more time. 00:00 PDT and 09:00 PDT corresponds to 08:00 BST and 17:00 BST. I think we can book a Skype call for 08:30 BST (00:30 PDT). What do you think?

I will be on holiday from the 27/09 to 12/10. I would like to get this ball running before that.
Flags: needinfo?(mcc.ricardo)
That sounds good. Monday I'm not around - shall we say Tuesday 23 Sep at 08:30 BST? My skype name is pete-rwe.

Thanks,
Pete
Perfect Pete. 23/09 8:30 BST.

My Skype name is odracirortsac.
Hi Pete,

Sorry this is taking longer than I anticipated. I've just been swamped with work.

We've been chatting the past few days but I just wanted to leave a note here for reference, for future progress.

I'll be on holiday for the next two weeks, so it might take a little bit longer for this to get completed.

Ricardo
No problem at all Ricardo! Many thanks for your support on this topic! =)

Pete
Hi Pete,

I'm back from holiday. I'll be resuming work on this ticket.

Ricardo
Hey Pete,

I just ran vcs_sync and synced it to this Github repo test. Can you please check it out?

https://github.com/mccricardo/mozharness-test

Ricardo
Hey Ricardo!

It looks pretty good!

Since you were on holiday, I discovered: https://github.com/mozilla/build-puppet/blob/23aa7ab10f14f81c09c8caf67536fa65329b393b/modules/aws_manager/templates/kill_long_running.sh.erb which is in our puppet repo, which might help us solve this bug more generically, .... maybe.

Basically we have a mechanism in our puppet code to set up cron jobs that automatically have watcher cron jobs monitoring them, and killing them if they take to long. This would capture any hangs from any part of the process, I believe, so we wouldn't have to special-case every possible hg command that could hang in the vcs sync code.

For example, see the process_timeout entries in https://github.com/mozilla/build-puppet/blob/23aa7ab10f14f81c09c8caf67536fa65329b393b/modules/aws_manager/manifests/cron.pp where the cron jobs are configured.

We could potentially set this up for our vcs sync jobs too.

I have to go now - but maybe something to think about as a more generic solution to this problem.

Disadvantage: no special handling of problems, such as retry - that said, if vcs sync fails due to a hang, on the next iteration it should retry anyway.
Hey Pete,

Just checking out some of the code.

So the basic idea would be that each job would be monitored by cron job that would have a time out which would eventually kill it after a certain timeout period. 

Since I'm not, yet, totally at ease with vcs_sync, let me fire up some questions.

Since I'm running vcs_sync similarly to:

python scripts/vcs-sync/vcs_sync.py  -c configs/vcs_sync/build-repos.py

the idea would be to have a cron job watch that execution and kill it after a timeout period?

I believe vcs_sync has to execute it's actions (defined in the config file) and the we should monitor via the cron job it it's hanged up or not.
Hey Pete,

I added the script you mentioned above to our code. I tested it and with this script in place, each time vcs_sync is launched, a cron job can be setup to monitor it's elapsed time, terminating it if it's taking too long.

I ran vcs_sync on my local machine and it took roughly 25min, so I set it up to a 1h timeout. I don't know if that's the most sensible timeout. You probably have a more sensible of notion for a it.

You can find the code here: https://github.com/mccricardo/build-mozharness/compare/vcs-sync-timeout
Assignee: nobody → mcc.ricardo
(In reply to Ricardo Castro from comment #13)
> Hey Pete,
> 
> Just checking out some of the code.
> 
> So the basic idea would be that each job would be monitored by cron job that
> would have a time out which would eventually kill it after a certain timeout
> period. 

Yes, that's right.

> 
> Since I'm not, yet, totally at ease with vcs_sync, let me fire up some
> questions.
> 
> Since I'm running vcs_sync similarly to:
> 
> python scripts/vcs-sync/vcs_sync.py  -c configs/vcs_sync/build-repos.py
> 
> the idea would be to have a cron job watch that execution and kill it after
> a timeout period?

Yes.

> 
> I believe vcs_sync has to execute it's actions (defined in the config file)
> and the we should monitor via the cron job it it's hanged up or not.

Yes, exactly.
(In reply to Ricardo Castro from comment #14)
> Hey Pete,
> 
> I added the script you mentioned above to our code. I tested it and with
> this script in place, each time vcs_sync is launched, a cron job can be
> setup to monitor it's elapsed time, terminating it if it's taking too long.
> 
> I ran vcs_sync on my local machine and it took roughly 25min, so I set it up
> to a 1h timeout. I don't know if that's the most sensible timeout. You
> probably have a more sensible of notion for a it.
> 
> You can find the code here:
> https://github.com/mccricardo/build-mozharness/compare/vcs-sync-timeout

Thanks for doing this! This works out quite well, because at the moment, Callek is working on puppetizing vcs sync over in bug 927199 - so when this work is done, we will be in a position to update the cron jobs in puppet to use this mechanism. Therefore I will put 927199 as a dependency for this bug, as we first should get vcs sync running in puppet, in order to take advantage of the puppet module we already have for creating monitoring cron jobs, as per e.g. https://github.com/mozilla/build-puppet/blob/23aa7ab10f14f81c09c8caf67536fa65329b393b/modules/aws_manager/manifests/cron.pp
Depends on: 927199
Excellent Pete!
Status: NEW → ASSIGNED
Hey Ricardo,
While we are waiting for bug 927199, would you be interested in getting involved with other bugs? Sorry this one has to stall for a little bit.
Pete
Flags: needinfo?(mcc.ricardo)
Hey Pete,

I'm currently working on Pulse and Mozbench. Do you have any bug you think I might be able to give you a hand?
Flags: needinfo?(pmoore)
Flags: needinfo?(mcc.ricardo)
Hey Ricardo, that's very kind. We have some bugs here:

https://bugzilla.mozilla.org/buglist.cgi?cmdtype=runnamed&namedcmd=RelEng%20good%20first%20bugs

For example https://bugzilla.mozilla.org/show_bug.cgi?id=1072026 might be quite straightforward, if you fancy a stab!

If you prefer working on Pulse and Mozbench, no worries. Any contributions you are making to Mozilla are gratefully received. Thanks again for your support! =)

Pete
Flags: needinfo?(pmoore)
Cool :) I'll give it a look!
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.