15. High Availability

15.1. High Availability of the Robin control plane

The Robin CNS management plane is comprised of services that manage the physical resources of a cluster, and the management of applications deployed to the cluster. These management services don’t run directly on the host. Rather, they run in containers deployed on each of the nodes in the cluster. A number of the the services are active on all nodes in the cluster. These services are responsible for handling the deployment of application resources on the node and for monitoring the health of the node. The Agent services include:

  • robin-agent

  • iomgr-server

  • monitor-server

Each master node in a Robin CNS cluster (to a maximum of three) is automatically assigned the role of Manager, with the first Manager becoming the MASTER and each additional Manager becoming a SLAVE. There are two management services running on all three Manager nodes:

  • The robin-watchdog service is responsible for maintaining integrity of the Robin CNS management plane. On the MASTER Manager, robin-watchdog makes sure that all management services that should be running are up and healthy. On SLAVE Managers, robin-watchdog monitors the state of the MASTER and waits in the wings for its chance to become the MASTER.

  • The postgresql-9.6 service acts as a datastore for all Robin Platform control data. The instance of postgresql-9.6 running on the MASTER Manager is responsible for committing all database transactions. The instances running on the other two Managers are read-only SLAVEs (one synchronous and the other asynchronous). Before a database transaction can be committed, the affected data blocks must be flushed to disk on the MASTER node and on the node where the synchronous SLAVE is running. This ensures that there will be no data loss should the MASTER node (the MASTER instance of postgresql-9.6) go down.

A number of Robin CNS management services only run on the MASTER Manager node. These services, which handle the management of all cluster resources, deployed applications, etc., include:

  • stormgr-server

  • robin-server

  • robin-file-server

  • robin-event-server

Disaster recovery

In the event that the MASTER Manager fails or becomes unhealthy (catastrophic hardware failure, network isolation, failure of a key management service, etc.), one of the SLAVE Manager nodes will take over as MASTER. This way, data integrity is maintained (for key metadata related to the management of storage for the Robin cluster and for deployed applications) and a mechanism is provided for recovering from hard failures. When a fail-over does occur, the central management plane will not be operational for a short while. It takes a little bit of time to convert one of the backup instances of the postgresql database into the master instance, etc.

Note

A Highly Available Robin CNS cluster can suffer the loss of one Manager node (MASTER or SLAVE). If a second node fails, then it will no longer be possible to commit database transactions, as there would not be a synchronous postgresql instance.