Skip to content

Clustering and Failover

Failover is the most simple form of resilience you can add to your Platform 6 deployment.

Failover provides a mechanism for a single instance to be ‘Active’, allowing one or more instance nodes to be waiting in ‘Passive’ mode to take over in the event of an active instance node failure.

A simple heartbeat is monitored by all passive instance nodes and should the heartbeat fail, one of the passive instance nodes is elected as the new active instance node and starts.

A restarted but failed active instance node will return to the cluster as a passive instance node.

Note

Failover should not be confused with clustering multiple active Platform 6 instances providing both horizontal scalability and resilience. Instance clustering providing horizontal scaling is part of the Platform 6 architectural design and will be delivered in a future release. Keep and eye on the product road map for details.

Requirements For Failover

  • a single secure PostgreSQL instance deployed on a separate host to any Platform 6 instance. Consider a database as a service provider or a privately managed failover cluster configuration such as: https://www.enterprisedb.com/enterprise-postgres/edb-postgres-failover-manager
  • at least two Platform 6 instances running on separate hosts and ideally separate providers and locations.
  • all instances configured to connect to a single PostgreSQL instance (or cluster)
  • all instances configured to use a common filesystem: $P6CORE_DATA
  • an external DNS based router/balancer to route all requests to the currently active instance
  • a resilient configuration of all external systems that may trigger routing events in Platform 6. Blockchain node or external database for example.

Failover Process Overview

Assuming we have two Platform 6 instances configured to connect to the same cluster and both are configured to connect to a single resilient PostgreSQL cluster or service:

  1. Start instance 1. It will block waiting for two instance nodes to join the cluster
  2. Start instance 2. It will block waiting for two instance nodes to join the cluster
  3. The instance nodes detect two nodes have joined the cluster so between them they elect a leader (the new active instance node)
  4. The active instance node continues to start and execute normally while emitting a ‘heartbeat’ to the cluster.
  5. The passive instance node blocks, simply monitoring a regular heartbeat from the active instance node.

If the active node fails, i.e. If the passive node does not receive a heartbeat indication for a period of time:

  1. The passive instance ‘calls another election’. The leader that is elected will become the active node and proceed as above.

If the failed instance node is restarted:

  1. The restarted instance node will see a new leader has been elected and fall back to being a passive instance node, monitoring heartbeats.

Per Instance Configuration

Failover configuration is defined in the application.conf file found in $P6CORE_DATA/conf:

failover: {
  "enabled": true
}

Other configuration values are:

Property Value Description
failover.cluster.size (default:2) Total number of instance nodes in this failover deployment
failover.heartbeat.fail.max.duration (default: 30000) Milli seconds without a heartbeat before new election
failover.join.wait.mins (default:2) Minutes to wait for the defined number of failover nodes to join the cluster before giving up!

In addition to the failover configuration, Hazelcast configuration is required allowing the interconnection of all instance nodes into a single cluster. Again this is defined using the application.conf:

hazelcast: {
  "hazelcast": {
    "instance-name": "p6_node_{IDOFTHISNODE}"
  }
  "group": {
    "name": "p6"
  },  
  "network": {
    "port": 5900,
    "public-address": "{THIS.NODES.PUBLIC.IP}",
    "join": {
      "multicast": {
        "enabled": false,
      },
      "tcp-ip": {
        "enabled": true,
        "members": ["{THIS.NODES.PUBLIC.IP}", "{IP.OF.OTHER.NODE.1}", "{IP.OF.OTHER.NODE.n}"]
      }
    }
  }
}

There are many ways to connect a cluster using Hazelcast. This example shows the basic interconnection of nodes using tcpip. See: https://docs.hazelcast.org/docs/latest/manual/html-single/#setting-up-clusters and https://docs.hazelcast.org/docs/latest/manual/html-single/#discovering-members-by-tcp

In the above example:

  1. Update p6_node_{IDOFTHISNODE} to contain a unique id for each node (something useful as an identity in a log)
  2. Add the public ip address of each node as: {THIS.NODES.PUBLIC.IP}
  3. Add the public ip address of all other nodes in the cluster as: “{IP.OF.OTHER.NODE.1}”, “{IP.OF.OTHER.NODE.n}”
Firewall Configuration

This very much depends on the type of Hazelcast cluster configuration you have used but if you follow the simple tcpip above you will need to open ports 5900-6000 to all node addresses used in your cluster. This allows interconnection between all nodes.

Recovering Network Failures

Heartbeat detection and the passive node election mechanism is a compromise in failover design.

More complex systems require at least three sentinel (or consensus) nodes to make election decisions and hold state. The configuration of consensus based failover monitoring is both complex and error prone to configure. As such Platform 6 failover trades this extra level of resillience for simplicty of deployment.

Therefore there are partial network failure conditions that you should be aware of.

If an active instance node loses a connection to the cluster network for long enough for the passive instance nodes to elect a new leader, it is possible that when/if the network connection is restored to the original active node the cluster will have two active nodes executing! In this situation, once the cluster has stabilised again, the recovered active node will detect the election of a new leader and ‘resign’ by shutting itself down.

In these circumstances, two active nodes can co-exist for the time it takes to stabilise the cluster (plus 10 seconds).