Heartbeat is a project from Linux-HA. http://www.linux-ha.org/wiki/Heartbeat.
Heartbeat will be helpful when we would have pair and/or more servers which are configured to replicate in circular fashion. I.e In MySQL Replication, if master will be fail we have to do manually failover or we have to make script which will manage this thing. But if we can use heartbeat than automatically it will be make slave as a master and the application will be directly go to the slave (which is now master as original master is crashed) and master will go to the offline mode.
How to install heartbeat on Linux OS?
How to configure it and how we can use it with MySQL Replication?
For example, I have database named “MyDB” in cluster “MyCluster”. It will be hosted on couple of servers like db01-a.mycluster.com and db01-b.mysqlcluster.com. There is already mysql replication setup between the servers. However, in some cases we need to be able to simulate a failure of the master host and be able to fail over the service to the slave server, then fail back to the master again. This is where HB(Heartbeat) becomes useful.
In general, networking setup on these servers will be look like this.I will use the db01-X names for the example each server has multiple network interfaces (at least two, most of the time three). on each server, one of those interfaces is used for serving data to the rest of the cluster; we call this the “production network” for that cluster. On each server, the second interface is used for “admin network” purposes — administrative monitoring etc. In most cases there is also a dedicated network connection over the third set of NICs, used for HB.
db01-a – prod IP: 10.0.0.10 – heartbeat IP: 172.30.0.10
db01-b – prod IP: 10.0.0.20 – heartbeat IP: 172.30.0.20
In DNS, db01-a.mycluster would be 10.0.0.10 and db01-b.mycluster would be 10.0.0.20. We also have a DNS entry for ‘db01.mysqlcluster’, with IP of 10.0.0.1. So at this point, we have a pair of servers which are supposed to provide database service for ‘db01.mycluster’, but we have nothing that connects the two things together. Here, each server would have 2 IPs on the prod network.
So now we have two database servers with the same data (db01-a and db01-b), and one of them is also “pretending” to be ‘db01.mycluster’. Now the web server can connect to the ‘db01.mycluster’ service. Any database changes that the web server makes are performed on db01-a, and because of replication they are also mirrored on db01-b right away.
When you configure HB, you need to tell it the following information:
1) what two machines are in a pair (db01-a and db01-b)
2) what service VIP they are responsible for (db01)
3) which one should be considered the primary server (by default, the ‘-a’ server is the primary one)
4) what path(s) should db01-a and db01-b use to communicate with each other and see if they are healthy
once you do that, heartbeat will use a NIC alias to associate db01-a with the db01 service automatically so instead of doing this manually, HB does it automatically. This is where we use the heartbeat IPs.
Each server runs a HB daemon; its purpose is to ‘ping’ the other server in the pair to check if it is alive. If one of the servers dies (e.g. kernel panic), it will stop responding to those ‘pings’ and the other server will assume that it has died.
How heartbeat will work?
As per our example, on both machines, we configure heartbeat that db01-a should be the primary server for the db01 service, and that the 172.30.0.X network should be used for heartbeat traffic (let us assume we do not have a serial connection). once we have heartbeat configured on both servers, we start heartbeat. when we start it on db01-a, this will happen:
- It will try to ping db01-b, its partner
- It will not get any responses from it (we have not started HB on db01-b yet)
- It will see that it is supposed to be the master, because its configuration tells it to be
- It will see that it is supposed to be the master, because its partner is not responding to heartbeat traffic.
- It will add a virtual NIC to its production network NIC and give it the IP of the db01.mycluster service. At this time it will become the primary server for the db01.mycluster service.
we have three IP addresses in DNS
db01.mycluster == 10.0.0.1
db01-a.mycluster == 10.0.0.10
db01-b.mycluster == 10.0.0.20
let’s assume eth0 is the NIC for the production network
on the first server, we configure eth0 to have IP == 10.0.0.10
on the second server, we configure eth0 to have IP == 10.0.0.20
on both servers, we configure HB to say VIP == 10.0.0.1
when HB starts on db01-a, it will see that it is the primary server, and will create a virtual NIC eth0:0 with IP == 10.0.0.1. So now db01-a has two NICs, the real NIC eth0 and the virtual NIC eth0:0. Now we start HB on db01-b. When we do, this will happen:
- It will try to ping db01-a, its partner
- It will get responses back from it (HB is already running on db01-a)
- It will see that it is not supposed to be the primary server, because its configuration says db01-a is the primary one.
- It will see that it is not supposed to be the primary server, because its partner (db01-a) is responding.
- It will assume the standby role for the db01 service.
Now, when you start HB on a machine, it doesn’t ‘ping’ just once, it continues to ping until you stop HB. When we turned on HB on db01-a, it continued to ping db01-b even after it assumed the role of the primary server. When we started HB on db01-b, it began to receive those ‘pings’ from db01-a and began to send responses, indicating that it was alive. At the same time, it began sending its own ‘pings’ to db01-a, which started receiving them and also started responding to them. When HB finished starting up on db01-b, both servers were able to ‘ping’ each other and respond that they are both ‘alive’.
Now, lets check what if db01-a server will be going down?
- heartbeat keeps pinging db01-a
- db01-a stops responding
- db01-b decides db01-a is dead
- db01-b decides it needs to become the primary server
- db01-b creates a virtual NIC eth0:0 and assigns it the db01.mycluster IP (10.0.0.1)
It now has 1 real, physical NIC eth0 with IP=10.0.0.20 and 1 virtual NIC eth0:0 with IP=10.0.0.1.
For all other machines on the production network, nothing has happened. The service ‘db01.mycluster’ still has the IP == 10.0.0.1. There is still some server on the production network which is responding to that service IP, and it is still serving mysql data. There is very little (almost zero) interruption of service) and no need to change DNS.
On the mysql side: since replication was turned on, db01-b has all the same data that db01-a had perhaps there is some, very little data, missing –something that was sent to db01-a but never was committed when we disconnected power but the service did not go away, and it was automatically “fixed”, without the need to call a human and without loss of time.
Don’t you think its great !!!! ?
Keep in Mind:
We should not allow automatic fail-back because on most servers, heartbeat service daemon starts up automatically when machine boots up, assume db01-a was turned off for 1 hour when we boot it up, heartbeat will automatically start up and if we allowed automatic fail-back, it would become primary but data will not be sync because mysql would not be ready. So first we have to finish “fixing” db01-a.
For this reason, we also do not allow mysql to automatically start replication perhaps the database is corrupted and needs to be refreshed. first we need to make sure mysql is ready to be started up on db01-a then, when we’re sure it is healthy, we start replication from db01-b to db01-a and we monitor when it is in sync again. Once db01-a is caught up to db01-b, we can switch back and we tell heartbeat to make db01-a primary again.