Closed Bug 1323771 Opened 3 years ago Closed 3 years ago

Create replacement VM for bm-l10n-dashboard01 to run l10n jobs and host shared data store

Categories

(Infrastructure & Operations :: IT-Managed Tools, task)

task
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Pike, Assigned: ericz)

References

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/3876])

+++ This bug was initially created as a clone of Bug #1319603 +++

bm-l10n-dashboard01 is in bad shape infra and sec-wise, see bug 1304413.

We have a VM now from bug 1319603, l10n-dashboard2.webapp.scl3.mozilla.com, with the right version python, and I have access, too.

Known issues that we need to track down still:

<ericz> Puppet is a bit unhappy about nfs mounts and collectd so I have some work to do still but this is a decent starting point.

Also, the network routes to our mysql server don't work yet, generic-rw-zeus.db.scl3.mozilla.com and probably stage-rw-vip.db.scl3.mozilla.com.

On my side, I need at least to get the git submodules off of git: and over to https:.
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/3876]
Depends on: 1323780
Assignee: server-ops-webops → eziegenhorn
Commit pushed to develop at https://github.com/mozilla/elmo

https://github.com/mozilla/elmo/commit/acb9cf18307300a414b03f7951f285e9b8f2d777
bug 1323771, configure travis, r=mathjazz

Also remove the old hudson helper script, now that we use travis.
Wrong bug, sorry.
Making good progress now.

I got the local master and slave set up, they're talking to the stage db and stage ES, and start up OK with the most recent versions of the libs.

One thing that I notice is that apparently something about the new mounted storage is significantly slower than it is on the old VM.

A mere `hg ident -i` takes 2 seconds on the old setup, https://l10n.mozilla.org/builds/builders/compare/634825.
On the new one, it takes 25, https://l10n.allizom.org/builds/builders/compare/634827.

Given that the actual comparison tool isn't impacted, it seems it's not file reads, but maybe stats?

Eric, is that something you can ask the storage folks about?

Left-over steps AFAICT right now:
- check IO perf of /mnt/l10n_shared
- clone all needed repos to /mnt/l10n_shared/repos

Once that's running, I'll also need us to set up the cron jobs from the old machine, but we shouldn't run those yet.

I also see that the VM doesn't have a lot of memory free, but while running my test jobs, it didn't start paging, so that seems all good.

I'd like to hold off on the cloning until we know if we can tune IO, as the clones take a very long time right now.
Flags: needinfo?(eziegenhorn)
The volume, export, permissions, IP are exactly the same for all users of /mnt/l10n_shared, so I'm at a loss for what the difference is at first glance.
In the dark, my suspicions are 'some software is different between versions' or 'local disk vs NFS'.  Could you demo the storage difference with a safe+repeatable command-line between the two boxes?
Doh, you're right, I forgot that the share is local on the old box. That explains a lot.

I did actually run `hg ident -i` on both shares on the production elmo VM (the one that serves l10n.m.o) so that both are remote, and there the new share is actually faster than the old. 30 seconds on the new, 4 seconds on the old.

I'll mull over workarounds, of which I can think of a few:
- run ident -r ., that helps on the one hand.
- possibly using hg share to have the working dir be local. I'll test this and see how much diskspace that takes.

Unblocked for now, though, thanks for the quick follow-up.
Flags: needinfo?(eziegenhorn)
Eric, can we create a user 'a10n:a10n' with 1000:1000?

Benefits: 
We have a 1000:1000 user on a10n handling the interactions with the hg repo, using the same uid:gid on the buildbot box makes the .hg/hgrc files trusted on both ends. That triggered my thoughts.

More so, we don't have to run an old version of buildbot etc as root :-)

Note, I figured as this and a10n are both in puppet, it makes more sense to have the user and group names be consistent among those. We could also call it dashboard:dashboard, as it is on bm-l10n-dashboard01, but that seems the wrong direction to be compatible with.
Flags: needinfo?(eziegenhorn)
Is UID/GID 1000 within the defined range of system UIDs on both boxes?

:jabba, would UID/GID 1000 conflict with LDAP?
Flags: needinfo?(jdow)
There was a user with uidNumber=1000 in LDAP, however it was a community member that hasn't been active since 2010, so I've re-numbered that user in LDAP, which frees up 1000 for generic use outside of LDAP.
Flags: needinfo?(jdow)
Thanks!
a10n user and group created with uid and gid 1000.
Flags: needinfo?(eziegenhorn)
Similar to bug 1305973 comment 2, I need libffi-devel and libffi.

Eric, can you add those to the install?
Flags: needinfo?(eziegenhorn)
Added
Flags: needinfo?(eziegenhorn)
There's some follow-up configuration work like bug 1343898, but as the machine is now in production, I think this is good to mark FIXED.

Thanks for the help.
Blocks: 1343898
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.