Change pvtbuilds NFS mount to a transitional volume

RESOLVED FIXED

Status

RESOLVED FIXED
3 years ago
2 years ago

People

(Reporter: gcox, Assigned: cknowles)

Tracking

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/2149] )

(Reporter)

Description

3 years ago
When: next TCW, roughly 15 minutes in duration
System(s) affected: builds / treeherder / partner builds
Notifs: usual TCW comms
Point: cknowles, selenamarie-or-delegate

Plan: To help with the evacuation of product delivery, we are going to unmount the existing pvtbuilds NFS mount (living on soon-to-be-off-warranty hardware), and replace it with a same-named, smaller, empty pvtbuilds mount on supported hardware.  This will buy time for legacy code to be migrated, past our warranty deadline.

"Unmount from all boxes, switch the volume on the filer, remount" covers the window; rollback is same in the other direction.  The original volume will be kept, offline but recoverable, for ~1 week before being deleted.
(Reporter)

Updated

3 years ago
Change Request: --- → ?
Blocks: 1214712

Updated

3 years ago
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/2149]
Reviewed 11/18 and scheduled for 11/21/2015 TCW
Change Request: ? → approved
(Assignee)

Comment 2

3 years ago
Work completed on schedule - :selenamarie confirmed that things look good post the remount - closing out.
(Assignee)

Updated

3 years ago
Assignee: server-ops-webops → cknowles
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED

Comment 3

3 years ago
Just wondering if the remount was mounted with the same permissions

It seems we have quite a few errors like this

Return code: 1
Failed to log stats. Exception = [Errno 185090050] _ssl.c:340: error:0B084002:x509 certificate routines:X509_load_cert_crl_file:system lib
Return code: 1
rsync error: error in file IO (code 11) at main.c(587) [Receiver=3.0.9]
rsync: connection unexpectedly closed (9 bytes received so far) [sender]
rsync error: error in rsync protocol data stream (code 12) at io.c(600) [sender=3.0.6]
Return code: 12
Unable to rsync /builds/slave/b2g_b2g-in_nexus-4_dep-0000000/build/upload to pvtbuilds.pvt.build.mozilla.org:/pvt/mozilla.org/b2gotoro/tinderbox-builds/b2g-inbound-nexus-4/20151121143005!
Failed to upload /builds/slave/b2g_b2g-in_nexus-4_dep-0000000/build/upload to b2gbld@pvtbuilds.pvt.build.mozilla.org:/pvt/mozilla.org/b2gotoro/tinderbox-builds/b2g-inbound-nexus-4/20151121143005! 

http://ftp.mozilla.org/pub/mozilla.org/b2g/tinderbox-builds/b2g-inbound-nexus-4/1448145005/b2g_b2g-inbound_nexus-4_dep-bm73-build1-build139.txt.gz

or 

Cron <b2gbld@upload-cron> nice -n 19 find /mnt/pvt_builds/pvt/mozilla.org/b2gotoro/tinderbox-builds -mindepth 2 -maxdepth 2 -not -wholename '*/mozilla-b2g30_v1_4-hamachi*' -not -wholename '*/*-flame*' -type d -mtime +20 -print0 | xargs -0 rm -rf
Inbox
	x
Cron Daemon
Cron Daemon <root@upload-cron.private.scl3.mozilla.com>
	
7:00 PM (3 hours ago)
		
to release
rm: cannot remove `/mnt/pvt_builds/pvt/mozilla.org/b2gotoro/tinderbox-builds/b2g-inbound-nexus-5-l-eng/20151031143003/logs/localconfig.json': Permission denied
rm: cannot remove `/mnt/pvt_builds/pvt/mozilla.org/b2gotoro/tinderbox-builds/b2g-inbound-nexus-5-l-eng/20151031143003/logs/log_critical.log': Permission denied
rm: cannot remove `/mnt/pvt_builds/pvt/mozilla.org/b2gotoro/tinderbox-builds/b2g-inbound-nexus-5-l-eng/20151031143003/logs/log_error.log': Permission denied
rm: cannot remove `/mnt/pvt_builds/pvt/mozilla.org/b2gotoro/tinderbox-builds/b2g-inbound-nexus-5-l-eng/20151031143003/logs/log_fatal.log': Permission denied
rm: cannot remove `/mnt/pvt_builds/pvt/mozilla.org/b2gotoro/tinderbox-builds/b2g-inbound-nexus-5-l-eng/20151031143003/logs/log_info.log': Permission denied
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Reporter)

Comment 4

3 years ago
The vol is exported from the filer with the same filer perms, and mounted with the same client perms.

However, it looks like the data copied over did not retain the perms of the original:

A temporary / read-only copy of the old volume:
[root@pvtbuilds2.dmz.scl3 ~]# ls -l /tmp/qq/pvt/mozilla.org/b2gotoro/tinderbox-builds/b2g-inbound-nexus-5-l-eng/20151031143003/logs/localconfig.json
-rw-rw-r-- 1 b2gbld b2gbld 5928 Oct 31 23:11 /tmp/qq/pvt/mozilla.org/b2gotoro/tinderbox-builds/b2g-inbound-nexus-5-l-eng/20151031143003/logs/localconfig.json

The new prod volume:
[root@pvtbuilds2.dmz.scl3 ~]# ls -l /mnt/pvt_builds/pvt/mozilla.org/b2gotoro/tinderbox-builds/b2g-inbound-nexus-5-l-eng/20151031143003/logs/localconfig.json
-rw-rw-r-- 1 root root 5928 Oct 31 23:11 /mnt/pvt_builds/pvt/mozilla.org/b2gotoro/tinderbox-builds/b2g-inbound-nexus-5-l-eng/20151031143003/logs/localconfig.json

I can't change these (well, I COULD but I don't know what I'm doing there).  Basically, you probably have some mass chowns needed.
Flags: needinfo?(sdeckelmann)
How was the data copied that lost the perms in the first place? If this is causing errors, can we match the file permissions from the old volume (is that the desired end state here)? Either by using rsync (if it's some known exact subset), or by using a script that looks at the files in the new mount point and matches the perms from the old one?
Hi, this impacts now mozilla-central, mozilla-inbound and b2g-inbound tree with the device builds at least like 

https://treeherder.mozilla.org/logviewer.html#?job_id=3415635&repo=b2g-inbound

 23:49:28 INFO - rsync: mkdir "/pvt/mozilla.org/b2gotoro/tinderbox-builds/b2g-inbound-flame-kk-eng/20151122215327" failed: Permission denied (13)

so raising this as blocker, since this is a perma failure on the affect trees
Severity: normal → blocker
closed affected trees due to mass perma failures of the affected buildbot device builds
I'd like to suggest we run these commands to get the tree re-opened:
chown -R b2gbld:b2gbld /mnt/pvt_builds/pvt/mozilla.org/b2gotoro/tinderbox-builds
chown -R b2gbld:b2gbld /mnt/pvt_builds/pvt/mozilla.org/b2gotoro/nightly

There may be other issues but that should cover the bulk of immediate problem. It's simply setting the group on the top level of the given directories, then fixing root:root ownership of everything within that.
That's on pvtbuilds2.dmz.scl3.

Comment 10

3 years ago
=============================
old permissions before change 
===============================
root@pvtbuilds2.dmz.scl3 ~]# ls -ld /mnt/pvt_builds/pvt/mozilla.org/b2gotoro/tinderbox-builds
drwxr-s--- 38 b2gbld root 4096 Nov 22 00:48 /mnt/pvt_builds/pvt/mozilla.org/b2gotoro/tinderbox-builds
[root@pvtbuilds2.dmz.scl3 ~]# ls -ld /mnt/pvt_builds/pvt/mozilla.org/b2gotoro/nightly
drwxr-s--- 20 b2gbld root 4096 Nov 20 19:28 /mnt/pvt_builds/pvt/mozilla.org/b2gotoro/nightly
[
root@pvtbuilds2.dmz.scl3 ~]# chown -R b2gbld:b2gbld /mnt/pvt_builds/pvt/mozilla.org/b2gotoro/tinderbox-builds

[root@pvtbuilds2.dmz.scl3 ~]# chown -R b2gbld:b2gbld /mnt/pvt_builds/pvt/mozilla.org/b2gotoro/nightly

====================
Permissions after change 
========================
[root@pvtbuilds2.dmz.scl3 ~]# ls -ld /mnt/pvt_builds/pvt/mozilla.org/b2gotoro/tinderbox-builds
drwxr-s--- 38 b2gbld b2gbld 4096 Nov 22 00:48 /mnt/pvt_builds/pvt/mozilla.org/b2gotoro/tinderbox-builds

[root@pvtbuilds2.dmz.scl3 ~]# ls -ld /mnt/pvt_builds/pvt/mozilla.org/b2gotoro/nightly
drwxr-s--- 20 b2gbld b2gbld 4096 Nov 20 19:28 /mnt/pvt_builds/pvt/mozilla.org/b2gotoro/nightly
[root@pvtbuilds2.dmz.scl3 ~]#
reopened the trees for now and retriggerd builds
(Assignee)

Comment 12

3 years ago
Alright - Let me know if you need anything from the storage side.
(Reporter)

Comment 13

3 years ago
[15:41:10] <gcox> Tomcat|Sheriffduty: Heya, bug 1223956 was marked a blocker overnight.  Is it still blocking, did the chowns fix it, or are we still waiting to learn more?
[15:41:41] <Tomcat|Sheriffduty> gcox: oh its ok now, the fix fixed this 
[15:42:13] <Tomcat|Sheriffduty> and trees are now open again
Severity: blocker → normal
Status: REOPENED → RESOLVED
Last Resolved: 3 years ago3 years ago
Flags: needinfo?(sdeckelmann)
Resolution: --- → FIXED
Blocks: 1227170
34 automation job failures were associated with this bug yesterday.

Repository breakdown:
* b2g-inbound: 22
* mozilla-inbound: 6
* mozilla-central: 6

Platform breakdown:
* b2g-device-image: 34

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1223956&startday=2015-11-23&endday=2015-11-23&tree=all
36 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* b2g-inbound: 25
* mozilla-central: 6
* mozilla-inbound: 5

Platform breakdown:
* b2g-device-image: 36

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1223956&startday=2015-11-23&endday=2015-11-29&tree=all
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.