Nagios Check NRPE : Check Mining RIG Hash

Prerequisites: Mining RIG using ethOS + a Nagios server already setup and Nagios-NRPE-server installed on your ethOS mining rig. ( this way for help regarding this: Installing Nagios-Nrpe-Server on ethOS )

 

This procedure will help you fetching the Hash rate data from your ethOS RIG and alert you if you are loosing cards in your mining RIG or if your cards are mining bellow your defined threshold.

This script has to be set in your Mining RIG, then called by your Nagios Server remotely using NRPE.

 

REQUIREMENTS:

Let’s set it up in your nrpe.cfg file ( usually in /etc/nagios/nrpe.cfg )

1
2
# RIG
command[check_rig-hash]=python /usr/lib/nagios/plugins/check_rig-hash

As described in requirements section of the check/script, if you want it to be able to restart your rig you will need to :

  • Create a gpu_crashReboot.log file with 0 as value inside and set it with the right permissions :

(as ethos user:)

1
2
sudo echo 0 > /usr/lib/nagios/plugins/gpu_crashReboot.log
sudo chown nagios:nagios /usr/lib/nagios/plugins/gpu_crashReboot.log
  • Edit Sudoers file using :

(as ethos user:)

1
sudo visudo
  • Add the following at the end:
1
2
# Nagios NRPE/Check_rig-hash
nagios ALL=(ALL) NOPASSWD: ALL

Do not forget to adjust your Global rig hashrate value [rigMinHashRate] to your needs (with no units):

  rigMinHashRate = XX.X

 

And the check itself has to be placed in /usr/lib/nagios/plugins/check_rig-hash

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
#!/usr/bin/python
#*****************************************************************
# Author: David Bayle
# Contact: contact@davidbayle.com
# This python scripts cheks your ethOS mining rig Hash for Nagios Monitoring
#
# Usage: If the script detects hashrate lower than your threshold, or crashed GPU(s)
# it will start counting up to 5 (5 x 1 min by defaults) and then force reboot.
#
#
# Settings:
# Do not forget to adjust your global minimum hashrate value by rig [rigMinHashRate] to your needs :
# # rigMinHashRate = 110.0
#
# Requirements:
#
# R1) Create a gpu_crashReboot.log file with 0 as value inside and set it with the right permissions :
# (as ethos user:)
# # sudo echo 0 > /usr/lib/nagios/plugins/gpu_crashReboot.log
# # sudo chown nagios:nagios /usr/lib/nagios/plugins/gpu_crashReboot.log
#
# R2) edit Sudoers file using :
# (as ethos user:)
# # sudos -s
# # visudo
#
# and add the following at the end:
# # Nagios NRPE/Check_rig-hash
# nagios ALL=(ALL) NOPASSWD: ALL
#
# Once done, save and quit.
# Script will now be allowed to restart your rig if something goes wrong as explained in Usage section.
# Enjoy ;)
#
#

import os
import sys

STATE_OK = 0
STATE_WARNING = 1
STATE_CRITICAL = 2
STATE_UNKNOWN = 3

RETURN_STATE = STATE_OK

rigMinHashRate = 110.0
rigRebootLogFile = "/usr/lib/nagios/plugins/gpu_crashReboot.log"
rigStatusLogFile = "/var/run/ethos/status.file"
rigStatsLogFile = "/var/run/ethos/stats.file"


# ================================   functions  =============================
def PrintOutput(dumpStr):
  print dumpStr


def return_state(state):
  global RETURN_STATE
  RETURN_STATE = state


def ReadStatusFile():
  try:
    # read rig hash rate from ethos status file
    pStatusLogFile = open(rigStatusLogFile, "r")
    returnedStatus = pStatusLogFile.read()
    return returnedStatus
  except:
    print "File read error in - " + rigStatusLogFile


def ReadRigName():
  try:
    # read rig name from ethos stats file
    pStatsLogFile = open(rigStatsLogFile, "r")
    for line in open(rigStatsLogFile, "r"):
      if "rack_loc" in line:
        returnedString = line.split(":", 1)
        returnedName = returnedString[1]
        rName = returnedName.replace("\n", "")
        return rName
  except:
    print "File read error in - " + rigStatsLogFile


def WriteRebootCount(count):
  #print count
  try:
    # writes reboot counter in a file
    pLogFile = open(rigRebootLogFile, "w")
    pLogFile.write("%i" % int(count))
    pLogFile.close()
  except:
    print "File write error in - " + rigRebootLogFile


def ReadRebootCount():
  try:
    # read reboot counter from a file
    pLogFile = open(rigRebootLogFile, "r")
    returnedValue = pLogFile.read(1)
#    print returnedValue
    return returnedValue
  except:
    print "File read error in - " + rigRebootLogFile


def IsNumber(n):
    is_number = True
    try:
        num = float(n)
        # check for "nan" floats
        is_number = num == num   # or use `math.isnan(num)`
    except ValueError:
        is_number = False
    return is_number


#================================ RUN ======================================== #

try:
# try reading status file
    rigStats = ReadStatusFile()
# try reading status file
    rigName = ReadRigName()
# gpu reboot count init
    gpuRebootCount = int(ReadRebootCount())
except:
    PrintOutput("UNKNOWN - Invalid Status File / Reboot counter read")
    return_state(STATE_UNKNOWN)

# extract data
try:
    hashRateData =  rigStats.split(" ", 1)
#    print hashRateData[0]
    hashRate = hashRateData[0]
except:
    PrintOutput("UNKNOWN - Invalid RIG Hashrate reading")
    return_state(STATE_UNKNOWN)


if (IsNumber(hashRateData[0])):

  if (float(hashRate) <= float(rigMinHashRate)):
    PrintOutput("WARNING: RIG HASHRATE LOWER THAN THRESHOLD (" + str(rigMinHashRate) + ") : " + hashRate + " (MH/s)")
    gpuRebootCount = gpuRebootCount + 1
    WriteRebootCount(int(gpuRebootCount))
    return_state(STATE_WARNING)
    if (gpuRebootCount >= 5):
      PrintOutput("CRITICAL: REBOOTING: MINER CRASHED FOR 5 MINs : RIG HASHRATE LOWER THAN THRESHOLD (" + str(rigMinHashRate) + ") : " + hashRate + " (MH/s)")
      WriteRebootCount(0)
      os.system("/opt/ethos/bin/r")
      return_state(STATE_CRITICAL)
      sys.exit(STATE_CRITICAL)

  else:
    WriteRebootCount(0)
    PrintOutput("OK - [" + str(rigName) + "] Global Rig hashrate : " + hashRate + " (MH/s) [Threshold: " +str(rigMinHashRate) + "]")
    return_state(STATE_OK)
    sys.exit(STATE_OK)

else:
  if (hashRateData[0].strip() == "gpu" and "clock problem" in hashRateData[1].strip()):
    PrintOutput("WARNING: A GPU CRASHED !!!")
    gpuRebootCount = gpuRebootCount + 1
    WriteRebootCount(int(gpuRebootCount))
    if (gpuRebootCount >= 5):
      PrintOutput("CRITICAL: REBOOTING: MINER GPU CRASHED FOR 5 MINs !!!")
      WriteRebootCount(0)
      os.system("/opt/ethos/bin/r")
      return_state(STATE_CRITICAL)
      sys.exit(STATE_CRITICAL)

  if (hashRateData[0].strip() == "possible" and "miner stall" in hashRateData[1].strip()):
    PrintOutput("WARNING: POSSIBLE MINER CRASH !!!")
    gpuRebootCount = gpuRebootCount + 1
    WriteRebootCount(int(gpuRebootCount))
    if (gpuRebootCount >= 5):
      PrintOutput("CRITICAL: REBOOTING: MINER GPU CRASHED FOR 5 MINs !!!")
      WriteRebootCount(0)
      os.system("/opt/ethos/bin/r")
      return_state(STATE_CRITICAL)
      sys.exit(STATE_CRITICAL)

  if (hashRateData[0].strip() == "miner" and "started" in hashRateData[1].strip()):
    PrintOutput("WARNING: Miner Starting")
    gpuRebootCount = gpuRebootCount + 1
    WriteRebootCount(int(gpuRebootCount))
    return_state(STATE_WARNING)
    if (gpuRebootCount >= 5):
      PrintOutput("CRITICAL: REBOOTING: MINER DIDN'T START IN 5 MINs !!!")
      WriteRebootCount(0)
      os.system("/opt/ethos/bin/r")
      return_state(STATE_CRITICAL)
      sys.exit(STATE_CRITICAL)

 

Which will produce for example :

 

1
2
 # sudo python /usr/lib/nagios/plugins/check_rig-hash
OK - [RIGten1] Global Rig hashrate : 244.9 (MH/s) [Threshold: 230.0]

Side note: This check/script, doesn t use EthOS API and Panel anymore, but local file stats, this in order to avoid being block by their site stats/api.
 

Tested against ethOS 1.2.3 to 1.3.0

https://github.com/davidbayle/ethos_GPUmonitoring-Nrpe