High load on git1.dmz.scl3.mozilla.com

RESOLVED WONTFIX

Status

Developer Services
Git
RESOLVED WONTFIX
4 years ago
2 years ago

People

(Reporter: w0ts0n, Assigned: hwine)

Tracking

Details

Attachments

(2 attachments)

(Reporter)

Description

4 years ago
Seeing lots of high load alerts on git this morning:
nagios-scl3	Wed 02:49:06 PST [5650] git1.dmz.scl3.mozilla.com:Load is CRITICAL: CRITICAL - load average: 132.56, 162.98, 177.21
(Assignee)

Comment 1

4 years ago
Created attachment 8552360 [details]
git1-load-201401210301PT.png

/me got paged by failures to push to git from the vcs-sync system as load spiked to 300

Looks to have started around 0930 UTC
(Assignee)

Comment 2

4 years ago
During load, seeing quite a few requests from an older osx git client via "tail -f access_log" -- looks like that just started "recently":

[root@git1.dmz.scl3 httpd]# egrep -c "\(Apple Git-33\)\"$" access_log*
access_log:1641
access_log-20141228:0
access_log-20150104:0
access_log-20150111:0
access_log-20150118:0
(Assignee)

Comment 3

4 years ago
/me notes box is configured with only 2GB swap, may want to try increase for peak loads like this 

also "khugepaged" makes an appearance in top -- issues with at reported on web seem to match what we're seeing:
 https://bugzilla.redhat.com/show_bug.cgi?id=879801

trying https://bugzilla.redhat.com/show_bug.cgi?id=879801#c17
(Assignee)

Comment 4

4 years ago
Applied:

[root@git1.dmz.scl3 httpd]# cat /sys/kernel/mm/redhat_transparent_hugepage/defrag
[always] madvise never
[root@git1.dmz.scl3 httpd]# echo never > !$
echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag
[root@git1.dmz.scl3 httpd]# !cat
cat /sys/kernel/mm/redhat_transparent_hugepage/defrag
always madvise [never]
(Assignee)

Comment 5

4 years ago
Created attachment 8552419 [details]
git1-load-20150121T0419PT.png

Okay, I'm happy with that result :)
(Assignee)

Comment 6

4 years ago
Hmm, less sure comment 4 has anything to do with it. The end of the last event had a similar drop off, and it doesn't appear we took any action. 

See bug 1087640 attachment 8510038 [details]

ni :bkero & :gps to render opinion on leaving change in comment 4 applied, which was based on https://bugzilla.redhat.com/show_bug.cgi?id=879801#c17
Assignee: nobody → hwine
Status: NEW → ASSIGNED
Flags: needinfo?(gps)
Flags: needinfo?(bkero)
OS: Mac OS X → All
Hardware: x86 → All
See Also: → bug 1087640

Comment 7

4 years ago
I don't have an opinion on the kernel change because I'm not familiar with the subject matter.

I reckon this is Git doing repacks somewhere.

Do we have a CRON job doing periodic repacks? This would help prevent random repacks on client-initiated server-side operations and would put us in more control of server behavior.
Flags: needinfo?(gps)
(Assignee)

Comment 8

4 years ago
(In reply to Gregory Szorc [:gps] from comment #7)
> I reckon this is Git doing repacks somewhere.
> 
> Do we have a CRON job doing periodic repacks? This would help prevent random
> repacks on client-initiated server-side operations and would put us in more
> control of server behavior.

No - opened bug 1124754 for this work
See Also: → bug 1124754

Comment 9

3 years ago
I too don't know enough about the effects of hugepage defragging on system performance on loaded systems to advise on whether to keep it on. Likely if it is still in this state now it doesn't make much difference in performance.
Flags: needinfo?(bkero)

Comment 10

3 years ago
Socket timeout errors 

8:42 AM <@nagios-scl3> Tue 08:42:48 PDT [5194] git1.dmz.scl3.mozilla.com:http - gitweb Port 80 is CRITICAL: CRITICAL - Socket timeout after 60 seconds (http://m.mozilla.org/http+-+gitweb+Port+80)

& 

Host: git-zlb.vips.scl3.mozilla.com
Service: HTTP - Port 80
Service State: CRITICAL
(Assignee)

Comment 12

2 years ago
no longer meaningful in light of bug 1277297
Status: ASSIGNED → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.