BenV's notes

Check_MK diskstats on Xen virtual hosts

by on Jun.14, 2012, under Software

Hej look! A new WordPress release…. 3.4…. and it automagically updates, nice going guys 🙂
What’s new? A bunch of stuff I don’t care about, a few more rounded corners … meh.
And apparently they’re green. Oh well. I can be green too, see? 😉

Back to stuff I do care about: Check MK released a new major version a few days ago – it’s now on version 1.2p1.
Among the new stuff some shiny interface updates (you know, rounded corners and the like), a ton of fixes and new agents/checks (postgresql is among them), a Logwatch Pattern Analyzer and tons more.

Because the local nagios machine at work is behind a flaky ADSL line and therefore often reports flapping remote services due to the ADSL (modem) failing I decided to move those checks to a new nagios machine outside the office. After creating a new Xen domU with the latest slackware64 and installing my usual tools I happily followed my old guide to get the latest Nagios/Check_MK up and running.

Of course instead of Nagios 3.2.2 we are now running version 3.4.1, but their plugins are still stuck on 1.4.15. RRDTool got improved as well (even though it still needs some kicking when compiling on slackware64), they’re up to version 1.4.7 with no real changes except for some segfaults etc fixed.
PNP4Nagios also didn’t sit still as they jumped from version 0.6.6 to a whopping 0.6.17, but those are almost all bugfixes as well with the exception of a new jquery version (which only introduces new bugs :-p).

Anyhow, after installing all those things and configuring them (this time I decided to put some stuff in /opt instead of /usr because these programs are really bad at listening to where to put their garbage. I figured if they want to make a mess, they can do so in their own prefixes under /opt/nagios and /opt/check_mk) most things worked instantly. However, the “Disk IO Summary” in Check MK broke on half my hosts.
Why half? I figured it had something to do with half of them being old Debian/Pokemon OS hosts, but no.

Check MK complained:

UNKNOWN - invalid output from agent, invalid check parameters or error in implementation of check diskstat. Please set debug_log to a filename in main.mk for enabling exception logging.

The log contained something like this:

Invalid output from plugin or error in check:
Check_MK Version: 1.2.0p1
Date: 2012-14-06 10:17:03
Host: pokemon01.os.org
Service: Disk IO SUMMARY
Check type: diskstat
Item: 'SUMMARY'
Parameters: {}
Traceback (most recent call last):
File "/opt/check_mk/var/precompiled/pokemon01.os.org", line 675, in do_all_checks_on_host
^A
File "/opt/check_mk/var/precompiled/pokemon01.os.org", line 2025, in check_diskstat
File "/opt/check_mk/var/precompiled/pokemon01.os.org", line 1943, in check_diskstat_generic
File "/opt/check_mk/var/precompiled/pokemon01.os.org", line 1900, in check_diskstat_line
ValueError: too many values to unpack

Agent info: [['1339662006']]

Hmmm…. checking the agent source code reveals the problem:

echo '<<>>'
date +%s
egrep ' (x?[shv]d[a-z]*|cciss/c[0-9]+d[0-9]+|dm-[0-9]+|VxVM.*) ' < /proc/diskstats if which dmsetup >/dev/null ; then
echo '[dmsetup_info]'
dmsetup info -c --noheadings --separator ' ' -o name,devno,vg_name,lv_name
fi
if [ -d /dev/vx/dsk ] ; then
echo '[vx_dsk]'
stat -c "%t %T %n" /dev/vx/dsk/*/*
fi

Basically it checks /proc/diskstats for a set of known disk names, including the xen virtual disks ‘xvda’ etc. However, it’s pretty common to hand out single partitions to xen instead of complete disks, for example /dev/xvda1 to boot from. Since this is a partition and not a disk it’s skipped. This causes the check to see no disks but only the output of the date command, which causes the error above (this error is fixed in GIT, but then you still have no diskstats)

I’ve read some chatter on the mailinglist about “Why do you want it?” and “We don’t want to use stats on separate partitions“, so I won’t bother with them.

My fix
Add another regex to check_mk_agent after the first one, search for the diskstats part and just paste the next line in there.

egrep ' (xvd[a-z]*[0-9]+) ' < /proc/diskstats

Let's check out the Postgres plugin!
Simply copy the plugins/mk_postgres to the host that's running PostgreSQL and put it in the agent's plugin dir. To find out what that is simply run: check_mk_agent | grep -i pluginsdir. After that run the check_mk inventory and restart it:

nagios@nagios:~$ check_mk -I
postgres_sessions 1 new checks
postgres_stat_database 5 new checks
postgres_stat_database.size 5 new checks
nagios@nagios:~$ check_mk -O
root@nagios:/usr/src/nagios/check_mk-1.2.0p1# check_mk -O
Generating Nagios configuration...OK
Validating Nagios configuration...OK
Precompiling host checks...OK
Reloading Nagios...OK

And voila! After a few seconds we see a bunch of errors 😉
The postgresql sessions and stats seem to work, but the size checks fail.
The debug output looks like this:

Invalid output from plugin or error in check:
Check_MK Version: 1.2.0p1
Date: 2012-14-06 11:18:13
Host: postgresql.somewhere
Service: PostgreSQL DB postgres Size
Check type: postgres_stat_database.size
Item: 'postgres'
Parameters: {}
Traceback (most recent call last):
File "/opt/check_mk/var/precompiled/postgresql.somewhere", line 675, in do_all_checks_on_host
File "/opt/check_mk/var/precompiled/postgresql.somewhere", line 1392, in check_postgres_stat_database_size
File "/opt/check_mk/var/precompiled/postgresql.somewhere", line 901, in get_bytes_human_readable
TypeError: unsupported operand type(s) for /: 'str' and 'float'

Agent info: [['datid',
'datname',
'numbackends',
'xact_commit',
'xact_rollback',
'blks_read',
'blks_hit',
'tup_returned',
'tup_fetched',
'tup_inserted',
'tup_updated',
'tup_deleted',
'conflicts',
'stats_reset',
'datsize'],
# ... snipped out a bunch of number rows

First thing we notice is that the check uses 'datsize' to get the size, which seems reasonable. The get_bytes_human_readable(size) call seems to barf because it receives a string but expects a float.
Let's debug:

nagios@nagios:~$ check_mk --debug --checks=postgres_stat_database.size my.postgres.host
Reading default settings from /opt/check_mk/share/modules/defaults
Reading config file /etc/check_mk/main.mk...
Reading config file /etc/check_mk/conf.d/distributed_wato.mk...
Reading config file /etc/check_mk/conf.d/parents.mk...
Reading config file /etc/check_mk/conf.d/wato/rules.mk...
Check_mk version 1.2.0p1
Calling external programm ssh -i /etc/check_mk/check_mk.key -l root 192.168.1.11 check_mk_agent
Traceback (most recent call last):
File "/opt/check_mk/share/modules/check_mk.py", line 4782, in
do_check(hostname, ipaddress, check_types)
File "/opt/check_mk/share/modules/check_mk_base.py", line 703, in do_check
agent_version, num_success, error_sections, problems = do_all_checks_on_host(hostname, ipaddress, only_check_types)
File "/opt/check_mk/share/modules/check_mk_base.py", line 860, in do_all_checks_on_host
result = check_funktion(item, params, info)
File "/opt/check_mk/share/checks/postgres_stat_database", line 122, in check_postgres_stat_database_size
return (0, "OK - Size is %s" % get_bytes_human_readable(size), [("size", size)])
File "/opt/check_mk/share/modules/check_mk_base.py", line 1134, in get_bytes_human_readable
return '%s%.2fGB' % (prefix, b / base / base / base)
TypeError: unsupported operand type(s) for /: 'str' and 'float'

Jup, python refuses to divide a string with a float. Can't blame it.
Let's see what 'size' actually has. After changing the line to: return (0, "OK - Size is %s" % size) it actually runs and outputs:

# check_mk stuff
PostgreSQL DB postgres Size OK - Size is 14:23:35.602872+02

Aha, that looks like the 'stats_reset' field instead of size! Guess that's what you get for using select *, datsize, my version of PostgreSQL (9.1.2 on that machine) has a new field there. The agent output looks like:

datid datname numbackends xact_commit xact_rollback blks_read blks_hit tup_returned tup_fetched tup_inserted tup_updated tup_deleted conflicts stats_reset datsize
11945 postgres 1 200898 0 1691 5434137 64791851 1016690 0 147 0 0 2012-04-05 14:23:35.602872+02 6095672

The problem is immediately obvious as 'stats_reset' returns a date that includes a SPACE. Don't you just love space-separated output? This is a good reason to prefer tabs :p

Anyhow, the simple fix for now is to either swap the columns or simply not select the stats_reset column. Of course one could also properly fix it by using a proper CSV output or something, but I can't be bothered.
My new plugins/mk_postgres looks like this:

#!/bin/bash

if id postgres >/dev/null ; then
echo '<<>>'
echo "select current_query = '', count(*) from pg_stat_activity group by (current_query = '');" | su - postgres -c "psql -A -t -F' '"
echo '<<>>'
echo 'select datid, datname, numbackends, xact_commit, xact_rollback, blks_read, blks_hit, tup_returned, tup_fetched, tup_inserted, tup_updated, tup_deleted, conflicts, pg_database_size(datname) "datsize" from pg_stat_database;' \
| su - postgres -c "psql -A -F' '" | sed '$d'
fi

And shortly after we have working services with decent output 🙂

Check MK mk_postgres

Check MK mk_postgres

Now if only someone added the perf-o-meter stuff and a nice PNP4Nagios template I'd be somewhat happy 🙂
Maybe I'll fix that some other day.




:, , ,

3 Comments for this entry

  • camypaj

    Hey Ben, thanks for the great insights in check_mk, it has been really helpful.

    Now, I wanted to aplly the postgres part, but I seem to be stuck (I’m a real n00b for python, that is probably the main reason :))
    Anyway, I’m running omd-0.55.20120629 on ubuntu server 12.04, postgres 9.1, and I’ve installed check_postgres version 1.0. manually.
    I’ve copied mk_postgres, inventory went ok, but I get a those: “UNKNOWN – invalid output from agent…”.
    After changing to return (0, “OK…”
    I get “PostgreSQL DB datname Size OK – Size is datsize”, in check_mk –debug, which tells me that it considers that header line as a db, and tries to evaluate database datname, and its datsize.
    I’ve tried to remove it by changing
    su – postgres -c “psql -A -F’ ‘” | sed ‘$d’
    to
    su – postgres -c “psql -A -t -F’ ‘” | sed ‘$d’
    but it seems that check_mk needs that header row for something..
    Any idea how can I fix it?
    Should I just ignore “db” datname?
    Is it even possible?
    Thanks! 🙂

  • camypaj

    to answer my own question: it’s a BAD idea to put backups of your mk_postgres script in /usr/lib/check_mk_agent/plugins, named mk_postgres.bak :)) It was creating double output, which seemed like more databases to check_mk. Sorry for the trouble, feel free to remove my comments 🙂
    Thanks again!

  • BenV

    Hej camypaj,

    Thanks for your feedback, which I won’t be removing since it might be helpful to someone else who runs into it 🙂

    Happy to hear you got it working 🙂

Leave a Reply

You must be logged in to post a comment.