Closed Bug 1124130 Opened 9 years ago Closed 8 years ago

High load on git1.dmz.scl3.mozilla.com

Categories

(Developer Services :: Git, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: rwatson, Assigned: hwine)

References

Details

Attachments

(2 files)

Seeing lots of high load alerts on git this morning:
nagios-scl3	Wed 02:49:06 PST [5650] git1.dmz.scl3.mozilla.com:Load is CRITICAL: CRITICAL - load average: 132.56, 162.98, 177.21
/me got paged by failures to push to git from the vcs-sync system as load spiked to 300

Looks to have started around 0930 UTC
During load, seeing quite a few requests from an older osx git client via "tail -f access_log" -- looks like that just started "recently":

[root@git1.dmz.scl3 httpd]# egrep -c "\(Apple Git-33\)\"$" access_log*
access_log:1641
access_log-20141228:0
access_log-20150104:0
access_log-20150111:0
access_log-20150118:0
/me notes box is configured with only 2GB swap, may want to try increase for peak loads like this 

also "khugepaged" makes an appearance in top -- issues with at reported on web seem to match what we're seeing:
 https://bugzilla.redhat.com/show_bug.cgi?id=879801

trying https://bugzilla.redhat.com/show_bug.cgi?id=879801#c17
Applied:

[root@git1.dmz.scl3 httpd]# cat /sys/kernel/mm/redhat_transparent_hugepage/defrag
[always] madvise never
[root@git1.dmz.scl3 httpd]# echo never > !$
echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag
[root@git1.dmz.scl3 httpd]# !cat
cat /sys/kernel/mm/redhat_transparent_hugepage/defrag
always madvise [never]
Okay, I'm happy with that result :)
Hmm, less sure comment 4 has anything to do with it. The end of the last event had a similar drop off, and it doesn't appear we took any action. 

See bug 1087640 attachment 8510038 [details]

ni :bkero & :gps to render opinion on leaving change in comment 4 applied, which was based on https://bugzilla.redhat.com/show_bug.cgi?id=879801#c17
Assignee: nobody → hwine
Status: NEW → ASSIGNED
Flags: needinfo?(gps)
Flags: needinfo?(bkero)
OS: Mac OS X → All
Hardware: x86 → All
See Also: → 1087640
I don't have an opinion on the kernel change because I'm not familiar with the subject matter.

I reckon this is Git doing repacks somewhere.

Do we have a CRON job doing periodic repacks? This would help prevent random repacks on client-initiated server-side operations and would put us in more control of server behavior.
Flags: needinfo?(gps)
(In reply to Gregory Szorc [:gps] from comment #7)
> I reckon this is Git doing repacks somewhere.
> 
> Do we have a CRON job doing periodic repacks? This would help prevent random
> repacks on client-initiated server-side operations and would put us in more
> control of server behavior.

No - opened bug 1124754 for this work
See Also: → 1124754
I too don't know enough about the effects of hugepage defragging on system performance on loaded systems to advise on whether to keep it on. Likely if it is still in this state now it doesn't make much difference in performance.
Flags: needinfo?(bkero)
Socket timeout errors 

8:42 AM <@nagios-scl3> Tue 08:42:48 PDT [5194] git1.dmz.scl3.mozilla.com:http - gitweb Port 80 is CRITICAL: CRITICAL - Socket timeout after 60 seconds (http://m.mozilla.org/http+-+gitweb+Port+80)

& 

Host: git-zlb.vips.scl3.mozilla.com
Service: HTTP - Port 80
Service State: CRITICAL
no longer meaningful in light of bug 1277297
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: