Setting up a local Elasticsearch cluster with VirtualBox and Vagrant

Setting up an Elasticsearch lab in your development environment is an important step in learning all the ins-and-outs of the stack. Downloading the Elasticsearch zip and running it from the resulting directory will get you up and running but if you want to experiment with running multiple nodes, you need to run multiple instances of Elasticsearch at the same time.

Deploying a small-scale three node cluster is simple to do on a mid-range laptop using VirtualBox and Vagrant. VirtualBox will run the VMs, and Vagrant will orchestrate and provision them.

Why not just run Elasticsearch multiple times on the same machine?

A node is an instance of Elasticsearch running in a JVM. A machine is a physical host, or a VM in our case. Elasticsearch runs on the assumption that there will be one node per machine.

Elasticsearch will bind to certain ports when it starts up:

  • The transport interface, that is used for inter-node communication, binds on a port in the default range of 9300-9399.
  • The HTTP interface, used for accepting API calls, uses the default range 9200-9299.

If you’re running multiple nodes on one host operating system, each node will try ports in the default ranges until it finds one it can bind to. The cluster may start but you won’t know which port each node is using.

Elasticsearch is smart enough to configure itself based upon the hardware it’s running on. Running multiple nodes on the same machine will, therefore, cause Elasticsearch to configure itself incorrectly.

Virtual machines vs containers

You could run your cluster in virtual machines or containers. I’ve chosen VMs for several reasons:

  • After attending the Elastic Engineer I & II courses, I wanted an environment similar to the one I used there. This would allow me to re-run the lab exercises at home.
  • SSH access to each node.
  • It’s closer to how a real production cluster would be deployed.
  • If you are practicing for your Certification exam, it’s very similar to the environment you’ll be using during the exam.
  • I find it easier to configure

Building the lab

I have put all the requirements and instructions in a GitHub repository. There are also some notes on how and why the system settings are set up the way they are.

If you give it a try and have any feedback, please open an issue.

What’s next

I’ll be making a series of posts based around the Elastic Certified Engineer exam objectives. These aren’t purely an exam preparation, however. The exam objectives are the fundamentals of what you should know if you are deploying or maintaining a cluster. They’re essential tasks to know about, even if you’re not attempting the certification exam.

Now we have a suitable environment, we need some data. There is plenty of open data available but finding a raw data set that provides enough scope for several tasks is challenging. Working out a Logstash pipeline to get it ingested can be a whole project by itself. I’m working through one now that should be suitable.