Run P6 in failover mode

Failover is the most simple form of resilience you can add to your Platform 6 deployment.

Failover provides a mechanism for a single instance to be ‘Active’, allowing one or more instance nodes to be waiting in ‘Passive’ mode to take over in the event of an active instance node failure.

A simple heartbeat is monitored by all passive instance nodes and should the heartbeat fail, one of the passive instance nodes is elected as the new active instance node and starts.

A restarted but failed active instance node will return to the cluster as a passive instance node.

Note

Failover should not be confused with clustering multiple active Platform 6 instances providing both horizontal scalability and resilience. Instance clustering providing horizontal scaling is part of the Platform 6 architectural design and will be delivered in a future release. Keep and eye on the product road map for details.

Requirements For Failover¶

A single secure PostgreSQL instance deployed on a separate host to any Platform 6 instance. Consider a database as a service provider or a privately managed failover cluster configuration such as: https://www.enterprisedb.com/enterprise-postgres/edb-postgres-failover-manager
At least two Platform 6 nodes running on separate hosts and ideally separate providers and locations.
All nodes configured to connect to a single PostgreSQL instance (or cluster).
All nodes configured to use a common filesystem: $P6CORE_DATA except for configuration $P6CORE_DATA/conf and logs $P6CORE_DATA/logs.
An external DNS based router/balancer to route all requests to the currently active node.
A resilient configuration of all external systems that may trigger routing events in Platform 6. Blockchain node or external database for example.

There’s a working prototype of the above configuration for educational purposes only that can run locally on a single machine thanks to Docker Compose.

Failover Process Overview¶

Assuming we have two Platform 6 nodes configured to connect to the same cluster and both are configured to connect to a single resilient PostgreSQL cluster or service:

Start node 1. It will block waiting for two instance nodes to join the cluster.
Start node 2. It will block waiting for two instance nodes to join the cluster.
The instance nodes detect two nodes have joined the cluster so between them they elect a leader (the new active instance node).
The active instance node continues to start and execute normally while emitting a ‘heartbeat’ to the cluster.
The passive instance node blocks, simply monitoring a regular heartbeat from the active instance node. It won’t accept any incoming HTTP request, hence from a load balancer’s perspective it appears offline, therefore all traffic is directed to the active instance node.

If the active node fails, i.e. If the passive node does not receive a heartbeat indication for a period of time:

The passive instance ‘calls another election’. The leader that is elected will become the active node and proceed as above. That leader is detected by the load balancer which redirects all traffic to it.

If the failed instance node is restarted:

The restarted instance node will see a new leader has been elected and fall back to being a passive instance node, monitoring heartbeats.

Per Instance Configuration¶

Failover configuration is defined in the application.conf file found in $P6CORE_DATA/conf:

failover {
  enabled = true
}

Other configuration values are:

Property	Value	Description
failover.cluster.size	(default:2)	Total number of instance nodes in this failover deployment
failover.heartbeat.fail.max.duration	(default: 30000)	Milli seconds without a heartbeat before new election
failover.join.wait.mins	(default:2)	Minutes to wait for the defined number of failover nodes to join the cluster before giving up!

In addition to the failover configuration, Hazelcast configuration is required to allow the interconnection of all instance nodes into a single cluster. Again this is defined using the application.conf:

hazelcast {
  hazelcast {
    # Choose something meaningful such as p6core1...n
    instance-name = {ID_OF_THIS_NODE}
  }
  group {
    name = p6
  },  
  network {
    port = 5900
    # Or HOSTNAME (container name if running in Docker Compose)
    public-address = {THIS.NODES.PUBLIC.IP}
    join {
      multicast {
        enabled = false
      }
      tcp-ip {
        enabled = true
        # Or HOSTNAME, specifying the port prevents Hazelcast from testing all available ports
        members = ["{THIS.NODES.PUBLIC.IP}:5900", "{IP.OF.OTHER.NODE.1}:5900", "{IP.OF.OTHER.NODE.n}:5900"]
      }
    }
  }
}

There are many ways to connect a cluster using Hazelcast. This example shows the basic interconnection of nodes using TCP/IP. See: https://docs.hazelcast.org/docs/latest/manual/html-single/#setting-up-clusters and https://docs.hazelcast.org/docs/latest/manual/html-single/#discovering-members-by-tcp

In the above example:

Update {ID_OF_THIS_NODE} to contain a unique id for each node (something useful as an identity in a log)
Add the public ip address or hostname of each node as: {THIS.NODES.PUBLIC.IP}
Add the public ip address or hostname of all other nodes in the cluster as: “{IP.OF.OTHER.NODE.1}”, “{IP.OF.OTHER.NODE.n}”

Firewall Configuration¶

This very much depends on the type of Hazelcast cluster configuration you have used but if you follow the simple TCP/IP above you will need to open port 5900 to all node addresses used in your cluster. This allows interconnection between all nodes.

Recovering Network Failures¶

Heartbeat detection and the passive node election mechanism is a compromise in failover design.

More complex systems require at least three sentinel (or consensus) nodes to make election decisions and hold state. The configuration of consensus based failover monitoring is both complex and error prone to configure. As such Platform 6 failover trades this extra level of resilience for simplicity of deployment.

Therefore there are partial network failure conditions that you should be aware of. If an active instance node loses a connection to the cluster network for long enough for the passive instance nodes to elect a new leader, it is possible that when/if the network connection is restored to the original active node the cluster will have two active nodes executing! In this situation, once the cluster has stabilised again, the recovered active node will detect the election of a new leader and ‘resign’ by shutting itself down.

In these circumstances, two active nodes can co-exist for the time it takes to stabilise the cluster (plus 10 seconds).