MDADM – cannot open /dev/*: Device or resource busy

So, what a faff.. I recently had a RAID card fail in my server, causing 2 of my 3 disks in an array to dissappear, which MDADM didnt take too kindly too. After deducing it was this and not a double disk failure, I got them back online (Confirmed by BIOS) but now came the problem of trying to recover my RAID 5 array – which was a ROYAL faff..

So, here are my notes!

After getting an error around – ‘cannot open /dev/sdc: Device or resource busy’ and ‘mdadm exited with non-zero exit status 2: (udisks-error-quark, 0)’ via the disks-utility on Ubuntu, the following steps worked for me:

1. Stop the array – this will free up the disks from being ‘resource busy’. Bit daft given they arent actually running, but anyway:

mdadm --stop /dev/md127

2. Next, run a forced assemble (be sure to get the disks right!):

root@server:/var/log# mdadm --assemble --force /dev/md100 /dev/sdc /dev/sdd /dev/sdg
mdadm: forcing event count in /dev/sdg(0) from 142 upto 152
mdadm: forcing event count in /dev/sdd(1) from 142 upto 152
mdadm: /dev/md100 has been started with 3 drives.
root@server:/var/log# pvs
  PV         VG        Fmt  Attr PSize   PFree
  /dev/md100 VG001     lvm2 a--    3.64t  1.20t
  /dev/md127 VG002     lvm2 a--    2.73t  1.32t
  /dev/sdf5  server-vg lvm2 a--  465.52g 48.00m

And there you have it, your back up and running. Hopefully this helps if your stuck in the mire of syslog, lsof, etc trying to figure out what is keeping your disk open!

lm_sensors – Rename sensors

So recently I had temperature issues with my server; long story short – my fan controller molex fell out and thus my server got rather warm rather quickly – oops!

Problem rectified easily, however i wanted to add some more depth to my monitoring of my server temperatures with Opsview. To do this, i used lm_sensors to get the temperatures, which i can then turn into service checks (check the site for a blog on how to do this).

The problem i had however, was that there were 2 ‘temp1′ sensors, and it wasnt obvious what these were:

root@server:/media# sensors
coretemp-isa-0000
Adapter: ISA adapter
Core 0: +36.0°C (high = +82.0°C, crit = +100.0°C)
Core 1: +35.0°C (high = +82.0°C, crit = +100.0°C)
Core 2: +39.0°C (high = +82.0°C, crit = +100.0°C)
Core 3: +34.0°C (high = +82.0°C, crit = +100.0°C)

it8718-isa-0290
Adapter: ISA adapter
in0: +1.28 V (min = +0.00 V, max = +4.08 V)
in1: +1.86 V (min = +0.00 V, max = +4.08 V)
in2: +3.25 V (min = +0.00 V, max = +4.08 V)
+5V: +2.88 V (min = +0.00 V, max = +4.08 V)
in4: +0.64 V (min = +0.00 V, max = +4.08 V)
in5: +0.08 V (min = +0.00 V, max = +4.08 V)
in6: +0.11 V (min = +0.00 V, max = +4.08 V)
in7: +3.07 V (min = +0.00 V, max = +4.08 V)
Vbat: +3.28 V
fan1: 1268 RPM (min = 0 RPM)
fan2: 0 RPM (min = 0 RPM)
fan3: 1962 RPM (min = 10 RPM)
fan4: 0 RPM (min = 10 RPM)
temp1: +39.0°C (low = +127.0°C, high = +127.0°C) sensor = thermistor
temp2: +29.0°C (low = +127.0°C, high = +60.0°C) sensor = thermal diode
temp3: -2.0°C (low = +127.0°C, high = +127.0°C) sensor = thermistor
intrusion0: ALARM

nouveau-pci-0100
Adapter: PCI adapter
fan1: 0 RPM
temp1: +63.0°C (high = +95.0°C, hyst = +3.0°C)
(crit = +115.0°C, hyst = +2.0°C)
(emerg = +130.0°C, hyst = +10.0°C)

It turns out renaming these sensors is rather easy! Firstly, copy the name of the chip that the sensors are running on – in my case, i wanted to rename ‘temp1 and temp2′ from it8718-isa-0290 to ‘DIMM1′ and ‘DIMM2′ – so to do this, i added a new file in /etc/sensors.d/ called ‘mobo’ (you can call it anything you like), and in here i added the following lines:

root@server:/media# cat /etc/sensors.d/mobo
chip "it8718-isa-0290"
label temp1 "DIMM1Temperature"
label temp2 "DIMM2Temperature"

Now, when I run ‘sensors’ i get the correct output:

it8718-isa-0290
Adapter: ISA adapter
in0:                +1.28 V  (min =  +0.00 V, max =  +4.08 V)
in1:                +1.86 V  (min =  +0.00 V, max =  +4.08 V)
in2:                +3.25 V  (min =  +0.00 V, max =  +4.08 V)
+5V:                +2.88 V  (min =  +0.00 V, max =  +4.08 V)
in4:                +0.64 V  (min =  +0.00 V, max =  +4.08 V)
in5:                +0.08 V  (min =  +0.00 V, max =  +4.08 V)
in6:                +0.11 V  (min =  +0.00 V, max =  +4.08 V)
in7:                +3.07 V  (min =  +0.00 V, max =  +4.08 V)
Vbat:               +3.28 V
fan1:              1268 RPM  (min =    0 RPM)
fan2:                 0 RPM  (min =    0 RPM)
fan3:              1962 RPM  (min =   10 RPM)
fan4:                 0 RPM  (min =   10 RPM)
DIMM1 Temperature:  +39.0°C  (low  = +127.0°C, high = +127.0°C)  sensor = thermistor
DIMM2 Temperature:  +29.0°C  (low  = +127.0°C, high = +60.0°C)  sensor = thermal diode
temp3:               -2.0°C  (low  = +127.0°C, high = +127.0°C)  sensor = thermistor

And thats that – now when i run my checks via Opsview i can be sure im getting the temperatures from my DIMM’s and not from a northbridge sensor or something else:

root@server:/home/sam# sudo /usr/local/nagios/libexec/check_lm_sensors-3.1.1/check_lm_sensors --sanitize --high DIMM1Temperature=70,85
LM_SENSORS OK - DIMM1Temperature=39.0|DIMM1Temperature=39.0;70;85;;
root@server:/home/sam# sudo /usr/local/nagios/libexec/check_lm_sensors-3.1.1/check_lm_sensors --sanitize --high DIMM2Temperature=70,85
LM_SENSORS OK - DIMM2Temperature=29.0|DIMM2Temperature=29.0;70;85;;

Cool eh..

Ubuntu 13.10 server with VNC

Hi all,

Back from the brink with a few new blog posts. I’ve recently migrated my server from RHEL to Ubuntu – given i dont have an active subscription anymore, and for the ‘home user’ the packages available (repos, to be precise) for Ubuntu are far better than RHEL/Centos.

I made a bit of a booboo in installing Ubuntu 13.10 which installs headless by default – and i do like to have a VNC for that odd occassion you just cant get something working – KVM networking, for example! *Grumble grumble*.

So, what i did – install MATE (Gnome 2, i pine for the good old days), installed VNC, and then configure VNC to use MATE on a new session.

1. Install MATE:

sudo add-apt-repository "deb http://mirror1.mate-desktop.org/ubuntu saucy main"
sudo apt-get update
sudo apt-get --yes --quiet --allow-unauthenticated install mate-archive-keyring
sudo apt-get update
sudo apt-get install mate-core
sudo apt-get install mate-desktop-environment

This installs the MATE desktop, ready for use (note, if you want to convert your server into a desktop you’ll need to install some sort of login/display manager – gdm, for example). I’m only using MATE from a VNC invocation so i dont need that.

2. Install VNC Server

sudo apt-get install vnc4server

This simply installs a VNC server – not much to explain here really!

3. Configure new VNC sessions to use MATE

VNC stores its config file (xstartup) in the “~/.vnc/” folder. Therefore you need to navigate to that directory first:

cd ~/.vnc/

Then you will need to create a new file, or edit one (if its there already) called ‘xstartup’:

nano xstartup 

.. and paste the following into it:

#!/bin/sh
# Uncomment the following two lines for normal desktop:
unset SESSION_MANAGER
unset DBUS_SESSION_BUS_ADDRESS
# exec /etc/X11/xinit/xinitrc
[ -x /etc/vnc/xstartup ] && exec /etc/vnc/xstartup
[ -r $HOME/.Xresources ] && xrdb $HOME/.Xresources
xsetroot -solid grey
vncconfig -iconic &
# x-terminal-emulator -geometry 80x24+10+10 -ls -title "$VNCDESKTOP Desktop" &
# x-window-manager &
mate-session &
# gnome-session --session=ubuntu-2d &

The key line is ‘mate-session &’ – this tells your VNC server to create a new ‘desktop’ using mate, instead of X, Gnome, or any other desktop you have.

4. Wrap-up

Next, start up a VNC server and open up the firewall:

iptables -I INPUT 3 -s 192.168.0.0/24 -p udp –dport 5900:5904 -j ACCEPT

In my example, i’m going to be running 4 VNC servers – on port 5901, 5902, 5903 and 5904. We can create these 4 sessions using the colon-number approach:

vncserver :1
vncserver :2
vncserver :3
vncserver :4

(Im sure those of you who are of that mindset could do a ‘for i in..’, but im too lazy). 

Finally, get yourself a VNC client, i.e. RealVNC – connect to your server – ipaddress:5901, and voila, your desktop is alive!

File age monitoring with Opsview

Hello all,

No I am not dead – I have just moved into management… I’ll let you come up with the jokes!

Today I’m going to write a technical document on how to monitor the age of a file to ensure that it is newer than a certain criteria – i.e. make sure that file ‘X’ is newer than ’5 days’ for example. This came up during my day as I wanted to make sure that my diary that I use at home (running on WordPress) is backed up to a remote location successfully once a week – so it pays to be in monitoring today!

Crontab

Firstly, I setup my crontab entry:

[root@rhelserver log]# crontab -l
0 23 * * 0 mysqldump --single-transaction -u sam -p wpblog --password=removed > "/media/nfs2/Backups/Diary/Blog-$(date '+%Y%m%d').sql.gz"
0 23 * * 0 echo "Backup completed" > /var/log/diary-backup
[root@rhelserver log]#

Here we are essentially running a mysqldump against the MySQL DB that is running my wordpress installation (wpblog), and storing it on a remote NFS mount point as a .gz file, with a date modified file name (so i can roll back if needed).

Also, I am creating a new file in /var/log called ‘diary-backup’ – why? Because my plugin will be executed by the nagios user, and i dont really want to give it access to my nfs2 share (Plus, it is a hassle that i dont have time to play with) – so i’m creating a file in /var/log that im going to chmod 755, so that nagios can access it and scrutinize the file age -which, as the file is created after the backup job – will be a real world representation of the .gz file created.

Opsview plugin

For this exercise, I used the ‘check_file_age’ plugin that ships with Opsview – however the standard output was rather annoying and not very humanised – for example:

root@opsview-monitor:/usr/local/nagios/libexec# ./check_file_age -w 691199 -c 691200 /home/sam/.bash_history
FILE_AGE OK: /home/sam/.bash_history is 425197 seconds old and 434 bytes

This isnt very useful to me – as I am not a computer and cant work out if 425,000 seconds is a good thing or a bad thing :) So, i modified the check_file_age plugin using the help of this guide here - http://www.krzywanski.net/archives/429 – essentially, replace the line:

print "FILE_AGE $result: $opt_f is $age seconds old and $size bytes\n";

with

my $days = $age/86400;
$days = sprintf("%.1f", $days);
print "FILE_AGE $result: $opt_f is $age seconds ($days days) old and $size bytes\n";

So that we output  ‘days’ instead of seconds. So, next I tested my command locally on my wordpress server:

[root@rhelserver log]# su - nagios
[nagios@rhelserver ~]$ cd /usr/local/nagios/libexec/
[nagios@rhelserver libexec]$ ./check_file_age -c 691200 /var/log/diary-backup
FILE_AGE OK: /var/log/diary-backup is 961 seconds (0.0 days) old and 17 bytes

(If your curious, 691200 seconds is 8 days). So here we can see, the nagios user has access to the file in question – and we are getting data in a usable format i.e. days, not seconds.

Next, we need to create the NRPE entry, so this ^^ command can be executed remotely by the Opsview monitoring server. Doing this is very simple – just add a line similar to the below in your /usr/local/nagios/etc/nrpe_local/overrides.cfg file (if this doesnt exist, just create one):

nagios@rhelserver libexec]$ tail -n1 /usr/local/nagios/etc/nrpe_local/override.cfg
check_command[diary_backup]=/usr/local/nagios/libexec/check_file_age -c 691200 /var/log/diary-backup

The ‘diary_backup’ element is the command we will be executing from Opsview. Finally, give the opsview-agent a bounce to apply the changes:

[nagios@rhelserver libexec]$ exit
logout
[root@rhelserver log]# /etc/init.d/opsview-agent restart
NRPE stopped
NRPE started
[root@rhelserver log]#

We can now test this locally:

root@rhelserver log]# cd /usr/local/nagios/libexec/
[root@rhelserver libexec]# ./check_nrpe -H localhost -c diary_backup
FILE_AGE WARNING: /var/log/diary-backup is 1272 seconds (0.0 days) old and 17 bytes
[root@rhelserver libexec]#

Voila, its working.

Bring it all together in the GUI

So lastly, we need to login to the Opsview GUI and bring this all together. Firstly, create a new service check with the plugin as ‘check_nrpe’ and the arguments as ‘-H $HOSTADDRESS$ -c diary_backup’. Then, add this to your host (wordpress server in my example). Finally, give it a reload and it will now be running and monitoring your backup:

There are then hundreds of things you can do – for example be notified when it goes critical or warning (ignore the warning above, i didnt set a -w flag, whoops) – or show it in a keyword (Monitoring > Keywords) as i have done at home:

Conclusion

So there you have it – i am now monitoring my Diary backup cronjob to make sure it completes every week using Opsview. You can use this for anything – logs, files, logins, you name it. Happy hunting!