Closed Bug 1029206 Opened 10 years ago Closed 10 years ago

vertica4.metrics.scl3 is reporting as Down

Categories

(Data & BI Services Team :: DB: MySQL, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mpressman, Assigned: mpressman)

Details

[dbadmin@vertica4.metrics.scl3 ~]$ /opt/vertica/bin/admintools -t view_cluster
DB      | Host          | State 
---------+---------------+-------
 metrics | 192.168.100.7 | DOWN    
 metrics | 192.168.100.8 | UP    
 metrics | 192.168.100.9 | UP
After restarting the node on 192.168.100.7 (vertica4.metrics.scl3) the host now reports as up:
[dbadmin@vertica4.metrics.scl3 ~]$ /opt/vertica/bin/admintools -t view_cluster
 DB      | Host | State
---------+------+-------
 metrics | ALL  | UP
Assignee: server-ops-database → mpressman
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
A couple of points from the log. Shutdown occurred at:
2014-06-23 22:25:09.456 unknown:0x7f71b7a60700 [Init] <INFO> Shutdown complete. Exiting.

Startup occurred at 2014-06-23 22:40:
2014-06-23 22:40:52.823 INFO New log
2014-06-23 22:40:52.823 unknown:0x7fa4240dd700 [Init] <INFO> Log /data/metrics/v_metrics_node0001_catalog/vertica.log opened; #1
2014-06-23 22:40:52.823 unknown:0x7fa4240dd700 [Init] <INFO> Processing command line: /opt/vertica/bin/vertica -C metrics -D /data/metrics/v_metrics_node0001_catalog -h 192.168.100.7 -p 5433
2014-06-23 22:40:52.823 unknown:0x7fa4240dd700 [Init] <INFO> Starting up Vertica Analytic Database v6.1.1-0
2014-06-23 22:40:52.823 unknown:0x7fa4240dd700 [Init] <INFO> Project Codename: Bulldozer
2014-06-23 22:40:52.823 unknown:0x7fa4240dd700 [Init] <INFO> vertica(v6.1.1-0) built by release@build2.verticacorp.com from releases/VER_6_1_RELEASE_BUILD_1_0_20130214@109264 on 'Thu Feb 14 14:43:35 2013' $BuildId$
2014-06-23 22:40:52.823 unknown:0x7fa4240dd700 [Init] <INFO> 64-bit Optimized Build
2014-06-23 22:40:52.823 unknown:0x7fa4240dd700 [Init] <INFO> Compiler Version: 4.1.2 20080704 (Red Hat 4.1.2-52)
2014-06-23 22:40:52.825 unknown:0x7fa4240dd700 <LOG> @[initializing]: 00000/5081: Total swap memory used: 0
2014-06-23 22:40:52.825 unknown:0x7fa4240dd700 <LOG> @[initializing]: 00000/4435: Process size resident set: 24100864
2014-06-23 22:40:52.825 unknown:0x7fa4240dd700 <LOG> @[initializing]: 00000/5075: Total Memory free + cache: 59904757760
2014-06-23 22:40:52.836 unknown:0x7fa4240dd700 [Txn] <INFO> Looking for catalog at: /data/metrics/v_metrics_node0001_catalog/Catalog
2014-06-23 22:40:52.837 unknown:0x7fa4240dd700 [Catalog] <INFO> Loading Checkpoint
2014-06-23 22:40:55.899 unknown:0x7fa4240dd700 [Catalog] <INFO> Replaying 1 Txnlogs
2014-06-23 22:40:56.879 unknown:0x7fa4240dd700 [Txn] <INFO> Installing objects...
2014-06-23 22:40:57.708 unknown:0x7fa4240dd700 [Txn] <INFO> Catalog loaded from path: /data/metrics/v_metrics_node0001_catalog/Catalog [379726 objects, GLOBAL version 492036, LOCAL version 388536] (no checkpoint needed)
2014-06-23 22:40:57.711 unknown:0x7fa4240dd700 [Comms] <INFO> Changing my node name to: v_metrics_node0001
2014-06-23 22:40:57.711 unknown:0x7fa4240dd700 [Txn] <INFO> switchToLocalNode: v_metrics_node0001 with path /data/metrics/v_metrics_node0001_catalog/Catalog
2014-06-23 22:40:57.712 unknown:0x7fa4240dd700 [Txn] <INFO> Transaction sequence set, seq num=d6e633, nodeID=a
2014-06-23 22:40:57.712 unknown:0x7fa4240dd700 [Txn] <INFO> Catalog sequence set, seq num=315632f, nodeID=a
2014-06-23 22:40:57.712 unknown:0x7fa4240dd700 [Txn] <INFO> Found my node (v_metrics_node0001) in the catalog
2014-06-23 22:40:57.712 unknown:0x7fa4240dd700 [Txn] <INFO> Catalog info: version=0x78204, number of nodes=3, permanent #=3, K=1
2014-06-23 22:40:57.712 unknown:0x7fa4240dd700 [Txn] <INFO> Catalog info: current epoch=0x41532
2014-06-23 22:40:57.797 unknown:0x7fa4240dd700 [Init] <INFO> Catalog loaded
2014-06-23 22:40:57.797 unknown:0x7fa4240dd700 [Init] <INFO> Listening on port: 5433
2014-06-23 22:40:57.797 unknown:0x7fa4240dd700 [Init] <INFO> About to fork
2014-06-23 22:40:57.799 unknown:0x7fa4240dd700 [Init] <INFO> About to fork again
2014-06-23 22:40:57.802 unknown:0x7fa4240dd700 [Init] <INFO> Completed forking
2014-06-23 22:40:57.802 unknown:0x7fa4240dd700 [Init] <INFO> PID=1469
2014-06-23 22:40:57.802 unknown:0x7fa4240dd700 [Init] <INFO> Start reading DataCollector information
Adding a couple of items for posterity since we've encountered some unexplained downtimes for hosts in the vertica cluster. Although some of those cases seem to be due to bad memory, there have been other transient crashes that have not been fully whose causes have not been fully identified. I noticed in dmesg a kernel message that has been reported to cause instability issues: 
do_IRQ: 2.162 No irq handler for vector (irq -1)
do_IRQ: 3.211 No irq handler for vector (irq -1)
do_IRQ: 22.86 No irq handler for vector (irq -1)

/var/log/messages shows:
Jun 23 19:19:48 vertica4.metrics.scl3.mozilla.com kernel: do_IRQ: 3.211 No irq handler for vector (irq -1)
This occurred just under three hours before the vertica "crash"

Another output shows a similar issue after restarting:
Jun 23 23:45:48 vertica4.metrics.scl3.mozilla.com kernel: do_IRQ: 22.86 No irq handler for vector (irq -1)

If we do see another "crash" then this may be worth investigating further. https://bugzilla.redhat.com/show_bug.cgi?id=225399 provides some workarounds and patches to alleviate the issue
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.