Self-healing Apache Zookeeper cluster

Published in

Universal Language

6 min readMar 18, 2020

kudos to my wife for that image https://www.instagram.com/gala_just_paint_it/

Introduction

If you manage your own Apache Zookeeper cluster in production and have handled at least 1 maintenance with it you definitely know that each version of Zookeeper has many hidden pitfalls:

Some versions have bug due to which freshly restarted node can’t join quorum, because cluster says that it has smaller id[1]
Lack of re-resolving DNS forces you to do a rolling restart of each zookeeper node to refresh the list of cluster member IPs
Other versions have problems with re-using the same Zookeeper node IDs. Due to this, you had to migrate from server IDs 1,2,3 to 4,5,6 and re-configure the whole cluster with node restart.

In general, Apache Zookeeper is one of the stablest things in infrastructure. However, if you need to touch it, even a small change can cause a lot of work.
But, there is a light at the end of the tunnel. Due to the hard work of Zookeeper developers and community we got Apache Zookeeper versions 3.5.X. Among tons of fixes which were applied from versions 3.4.X we got two main features:

Dynamic cluster reconfiguration [2]. Yes, this means that you don’t need to restart each node to join or remove it from it
Re-resolving DNS name of node with which connection was interrupted. This means that even if you re-created one of the Zookeeper nodes from scratch and it got a new IP address, you won’t need to restart any of the active Zookeeper nodes. They will find new node’s IP automatically. More details about this feature can be found in the official tracker [3], [4] general re-resolve

This is really great news. We have been monitoring the development of Zookeeper version 3.5 for a long time and was waiting exactly on the mentioned two features. Below I will show why they are so imported.

Main purpose

Have you ever searched for the keywords “zookeeper autoscaling” or “self-healing zookeeper cluster”? Have you ever dreamed about a dynamic Zookeeper cluster that will keep itself up and running without any actions from your side? With Apache Zookeeper version 3.5.6+ we can get this. We can do even more if we flavor it with AWS ECS and AWS Cloud Map. We will get:

completely self-healing ECS Zookeeper cluster
Automatic Zookeeper node registration in Route53.
in case Zookeeper docker container goes down it will be automatically restarted by ECS agent
Zookeeper myid will be automatically propagated based on ECS service names

Setup details

In our setup we are going to use the following services and tools:

AWS ECS
AWS Cloud Map
AWSVPC network mode for task definitions
Terraform
Special metadata URI to automatically generate Zookeeper myid file

Below I am going to shortly describe each of these features and provide a reason for using them:

Why do we need ECS?
ECS will be used as a manager for our Zookeeper cluster. It will start, monitor and restart containers. ECS service discovery will register each Zookeeper service in AWS Cloud Map. So, we will get fancy DNS name for Zookeeper config file
Why do we need AWS Cloud Map (Discovery Service)?
Each Zookeeper node should know about all other nodes in a cluster. That is why you need to specify the address of each node into the config file. Due to the dynamic nature of our environment we can’t use IP addresses. So, instead of them, we need DNS names. Exactly for this, we will use AWS Cloud Map. As we mentioned earlier ECS will register each service with its own DNS name and will keep them updated.
Why do we need AWSVPC network mode in ECS?
AWSVPC network mode is a special docker network mode that can be used within ECS cluster. In that network mode, ECS cluster attaches additional ENI (elastic network interface) to each EC2 instance in cluster and points all ECS task containers to that ENI. Only that network mode will give us standard DNS A record in discovery service for each of our ECS services. These DNS names will be used in the Zookeeper config file.
Terraform?
Our company uses terraform to manage infrastructure as code, so that is a native tool for us to work with AWS.
How to generate Zookeeper myid from ECS?
Unfortunately, even with all these updates Zookeeper still requires that each node should have its own unique id, called myid. Usually, you will generate it using instance hostname. Unfortunately you can’t rely on hostnames in autoscaling groups, so we need to have an alternative way to do this. In our solution, we decided to use ECS service names for that purpose. In example below we will have 3 ECS services with the following names zoo1,zoo2,zoo3.
The number of the service will be used as Zookeeper ID. Moreover, to get it we will use special metadata URI provided, but not documented by AWS. Default URI for ECS metadata is http://169.254.170.2/v3/ECS-TASK-ID, but if you add to it suffix taskWithTags and try to query http://169.254.170.2/v3/ECS-TASK-ID/taskWithTags you will get same metadata + two additional lines:
"aws:ecs:clusterName":"zookeeper" "aws:ecs:serviceName:"zoo2"
Note: each ECS service should be created with the option “Enable ECS managed tags”. More details about ECS tagging you can find at [5]. Also, you should enable long ARN format for ECS services and tasks. How this should be done described at https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-account-settings.html#ecs-resource-ids

Unexpected problem

During researching and preparing this project we have got an unexpected problem with default settings for AWS Cloud Map DNS namespace. The problem is related to the very long propagation of DNS record after negative reply. Especially if you don’t use AWS DNS directly, but have your own DNS servers, which just forward some requests to Route53 zones. To save our time I am not going to provide too many details about it, possibly we will create an additional post about it. Meanwhile, we were able to handle it using local pdns-recursor. In our solution pdns-recursor is used to overwrite default negative TTL and forward requests directly to AWS DNS servers.

Cluster Setup

Finally, we are done with all that boring background and can build our self-healing Zookeeper cluster.

Terraform code which will create all needed resources can be found on github — https://github.com/dethmix/self-healing-zookeeper

In same repo in folder docker you will find modified docker file for building Zookeeper image which has:

Option to enable/disable dynamic reconfiguration
Option to enable monitoring commands
Patch to create Zookeeper ID from ECS service name
Option to enable/disable ACL

To build your cluster just fill out variable.tf file with appropriate values for your AWS account like VPC ID, security group, a name for ECS cluster and execute magic commands terraform plan and then terraform apply. That is all. Now your AWS account contains self-healing Zookeeper cluster which runs on autoscaling instances.

Now let me highlight a few interesting moments from the mentioned github repo:

One of the main changes in default Zookeeper docker/docker-entrypoint.sh was following line

echo `wget -qO- $ECS_CONTAINER_METADATA_URI/taskWithTags | jq -r ‘.TaskTags.”aws:ecs:serviceName”’ | grep -o “[1–9]”` > “$ZOO_DATA_DIR/myid”

It queries and parses instance metadata to get ECS service index and use it as Zookeeper myid.

File files/user_data.shhas pretty much code for proper pdns-recursor installation and setup. Unfortunately, the default pdns-recursor version from EPEL will not work with Amazon Linux 2. To install it you will need to update the protobuf package. While the pdns-recursor config is small, you need to remember that it should be accessible from a docker container, which will be started on a dedicated network interface. Due to this you can’t use localhost addresses in setup and have to configure pdns-recursor to listen on primary private IP of the instance.

Now lets answer on one of the main questions which I am sure most of you want to ask — Does that Zookeeper cluster support autoscaling? Will cluster be automatically extended and shrunk if you add/remove nodes in the autoscaling group? The answer is no, it won’t. While for this purpose we can use the added dynamic reconfiguration feature. General procedure to extend/shrink such cluster should be following:

Extend/shrink autoscaling group
Update ECS cluster and add/remove appropriate ECS services
Manually update the configuration of the cluster by adding/removing new nodes (use zkCli for this)
Update task definition for services which are still running, so they have the same node stack

I am completely sure that by adding AWS Lambda function which will monitor the state of the autoscaling group and dynamically reconfigure Zookeeper cluster we can get fully automatic scaling/shrinking of Zookeeper cluster.

Conclusions

I think that self-healing Apache Zookeeper cluster is the last step before completely dynamic and autoscaled Zookeeper cluster. Hope that post will help you to create durable setup of Zookeeper. The more dynamic clusters we have created the earlier we will found and fix issues related to Zookeeper scaling.

References

[1]: https://issues.apache.org/jira/browse/ZOOKEEPER-2938

[2]: https://zookeeper.apache.org/doc/r3.5.3-beta/zookeeperReconfig.html#ch_reconfig_dyn

[3]: https://issues.apache.org/jira/browse/ZOOKEEPER-2184

[4]: https://issues.apache.org/jira/browse/ZOOKEEPER-1506

[5]: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-using-tags.html