» Product Line        
» ISP Billing System UTM5        
» Other Products        
ISP Billing Software RUEN
  Products Services Customer Area Partners Contact

Documentation
         » Products
         » Documentation
         » Online Demo Version
         » Download Demo
         » Information Request
         » Pricelist
 

Setting Up a Fault-Tolerant Cluster for the ISP Billing System "UTM5"
on Basis of Linux Gentoo

In this article it is discussed a problem of setting up a fault tolerant cluster for the ISP billing system NetUP UTM5 operation on basis of two physical servers. As the operating system it is used Linux Gentoo. Database management system used is MySQL.

Figure 1. General cluster scheme

Each server has two Ethernet cards and two hard drives of the same capacity.

Internal communication between the servers is held on internal Ethernet cards at the rate of 1 Gb per second. At that for higher reliability it may be used an Ethernet crossover cable without an intermediate switch. External Ethernet cards are connected to a switch and via them the cluster is connected to the network. For external devices the cluster is accessible at the one common IP address - 192.168.0.200. The address is used by only one server at a moment. If a fault occurs on the server then the address is being automatically assigned to the other server which is a reserve one, and at that the cluster is accessible in the former mode. The diagnostics of faults is performed by the heartbeat package [2].

Network settings for the server #1:

Host name: netup1
IP address on the internal Ethernet card. Interface eth1: 172.16.0.1
IP address on the external interface: 192.168.0.200 (eth0:1), configured automatically by the heartbeat package.

Network settings for the server #2:

Host name: netup2
IP address on the internal Ethernet card. Interface eth1: 172.16.0.2
IP address on the external interface: 192.168.0.200 (eth0:1), configured automatically by the heartbeat package.

The operating system Linux Gentoo is installed on the first hard drive - /dev/sda. The second hard drive - /dev/sdb is used for data synchronization between the servers by using the drbd package [1]. The package is installed by using the command:

emerge drbd

On successful installation create the configuration file /etc/drbd.conf with the following content:

resource r0 {
  protocol C;
  incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f";

  startup {
    degr-wfc-timeout 120;    # 2 minutes.
  }

  disk {
    on-io-error   detach;
  }

  net {
  }

  syncer {
    rate 200M;
    group 1;
    al-extents 257;
  }

 

  on netup2 {
    device     /dev/drbd0;
    disk       /dev/sdb1;
    address    172.16.0.2:7788;
    meta-disk  internal;
  }

  on netup1 {
    device    /dev/drbd0;
    disk      /dev/sdb1;
    address   172.16.0.1:7788;
    meta-disk internal;
  }
}

An example of the configuration file with comments is given in the file /usr/share/doc/drbd-0.7.11/drbd.conf.gz.

According to the given settings the synchronization of data is performed on the partition /dev/sdb1. At that for accessing the partition it is necessary to use the device /dev/drbd0, otherwise the synchronization of data won't be performed.

For starting the drbd package run the command on the both servers:

/etc/init.d/drbd start

On the server #1 run the command:

drbdadm -- --do-what-I-say primary all

If all settings are correct then since this moment the partition /dev/sdb1 on both servers is synchronized. For viewing the status the command below may be used:

/etc/init.d/drbd status

Example of the output of the command:

drbd driver OK; device status:
version: 0.7.11 (api:77/proto:74)
SVN Revision: 1807 build by netup@netup1, 2006-01-17 00:52:49
0: cs:Connected st:Primary/Secondary ld:Consistent

In the current output the expression 'ld:Consistent' means that all data is synchronized between both servers in the cluster. In case of data is being synchronized, the output contains an estimated time and the current status of the process.

Then it is required to format the synchronized partition for the file system reiserfs and to create the directory in which the current partition has been mounted. For that run on the server #1 the following commands:

mkreiserfs /dev/drbd0
mkdir /mnt/sync

On the server #2 run the command:

mkdir /mnt/sync

The next step of the fault-tolerant cluster setting up is the installation and configuration of the heartbeat package. For installing the package run the command:

echo "sys-cluster/heartbeat ~x86" >> /etc/portage/package.keywords
emerge sys-cluster/heartbeat

After successful installation it is necessary to create configuration files.

On the server #1 create the file /etc/ha.d/ha.cf with the following content:

logfacility     local0
ucast eth1 172.16.0.2
auto_failback on
node netup1 netup2

On the server #2 create the file /etc/ha.d/ha.cf with the following content:

logfacility     local0
ucast eth1 172.16.0.1
auto_failback on
node netup1 netup2

In these files we have set names and IP addresses of servers used in the cluster. Then create the configuration file /etc/ha.d/haresources with the following content:

netup1 192.168.0.200/24/eth0:1 drbddisk Filesystem::/dev/drbd0::/mnt/sync::reiserfs apache2 mysql utm5_core utm5_radius

At that it is necessary to make sure that the file is of the same content on both servers. In the file we have set an external IP address of the cluster to 192.168.0.200, a subnet mask - 24, and an interface eth0:1 on which to use the IP address. Also we have specified packages that should run when the server becomes leading in the cluster. Services are being started in the same order they are specified in the file. According to the given file firstly it will be started the package drbddisk that will switch the server into the leading mode for the package drbd. After that it is possible to mount the partition /dev/drbd0. This procedure is performed by the second package - Filesystem. In its parameters the package accepts a pointer to the partition - /dev/drbd0, a directory for mounting - /mnt/sync and a type of the file system used – reiserfs. Thus, after start of these two packages the directory mnt/sync will contain data synchronized between these two servers. Data, recorded into the directory, will be automatically duplicated on the second reserve server. In case of faults on the leading server the reserve server will contain absolutely the same data as on the leading server before the fault.

Then application services are started on a list: the Web server 'apache2', the database server 'mysql', the Core of billing system 'utm5_core', the RADIUS server 'utm5_radius'. While working, the billing system NetUP UTM records all billing information into the mysql database; that's why for synchronization of these data it is necessary to move the directory /var/lib/mysql on the partition synchronized /mnt/sync. This operation should be performed only when the service mysql is stopped. Also, in the configuration file '/etc/mysql/my.cnf' partition [mysqld] set a new path:

datadir = /mnt/sync/mysql

Thus, after the service 'mysql' has been stopped, subscriber data, charge-offs and other billing information, stored in the database, is synchronized with the reserve server.

For the correct work of the package 'heartbeat' it is also required to create the file /etc/ha.d/authkeys with keys for safe communication between the servers. In the file it is specified a key type and a key itself:

auth 1
1 sha1 somethinglong

This file also should be identical on both servers in the cluster.

At that configuration of the package 'heartbeat' has been successfully completed and it's now possible to start it on both servers by using the command:

/etc/init.d/heartbeat start

For checking the cluster efficiency it is possible to use the utilities ifconfig, df, ps. The leading server should have the following:

  1. the configured interface eth0:1 with the IP address 192.168.0.200
  2. the mounted directory /mnt/sync
  3. running services apache2, mysql, utm5_core, utm5_radius

A reserve server at that shouldn't have the settings given above. For the reserve server to become leading, the service 'heartbeat' should be stopped on the leading server, or a hardware failure should be emulated on the server by shutting it down. At that the reserve server takes the common IP address 192.168.0.1, mounts the directory '/mnt/sync' and launchs the services apache2, mysql, utm5_core and utm5_radius. On a NetUP testing stand, the cluster work restoration after a fault on the leading server didn't exceed 30 seconds. Thus, the work of the fault tolerant cluster allows minimizing faults in work of the billing system and enhancing services rendered to subscribers.

For automatic services starting up after rebooting of the servers, run the commands on both servers:

rc-update add drbd default
rc-update add heartbeat default

Data Synchronization Problems and "Split-Brain" State

If, for some reason, the connection between the servers failed and in the moment both servers became leading, it may happen that data on the synchronized partition is different on the servers. This situation is called «split-brain». In this case an administrator should manually resolve the conflict.

The problem can be recognized by a status of the package drbd. The status can be seen by using the command:

/etc/init.d/drbd status

At the conflict the output on the leading server contains the following string:

0: cs:StandAlone st:Primary/Unknown ld:Consistent

At the conflict the output on the reserve server contains the following string:

0: cs:StandAlone st:Secondary/Unknown
ld:Consistent

In such a situation an administrator should find out which server contains the most actual data. This information is kept on the disks and copied to the second server. At that, changes, made on the other server, are considered as non-actual and will be lost. Below there are given commands that should be performed for resolving the conflict.

On the both servers run the commands:

drbdadm disconnect all
/etc/init.d/heartbeat stop

Then on the server with non-actual data run the command:

drbdadm secondary all

and after that on the server with actual data run the command:

drbdadm secondary all

Then on the server with actual data run the command:

drbdadm -- --human primary all

After these commands it is necessary to connect the devices up on both servers by using the command:

drbdadm connect all

At that non-actual data will be automatically deleted, and both servers will have actual and fully identical data. The result may be checked by using the command for status viewing:

/etc/init.d/drbd status

Output on the leading server should be the following:

drbd driver OK; device status:
version: 0.7.11 (api:77/proto:74)
SVN Revision: 1807 build by netup@netup1, 2006-01-17 00:52:49
0: cs:Connected st:Secondary/Primary ld:Consistent

The string «st:Secondary/Primary ld:Consistent» shows that the data is fully synchronized between the servers and the conflict has been successfully resolved.

 

[1]. The Web page of the package drbd on the Internet - http://www.drbd.org/

[2]. The Web page of the package heartbeat on the Internet - http://linux-ha.org/HeartbeatProgram

 

All rights are reserved by NetUP (c) 2001-2007 (www.netup.biz).
Republishing is not allowed without explicit permission of NetUP (info@netup.biz).
All trademarks used are properties of their respective owners.
© 2007 NetUP Inc. All Right reserved.

+7 (495) 543-9220

info@netup.biz