Importing Lab Data

If you followed along and set up an Elasticsearch lab, it’s now time to get some data into the cluster so we can begin experimenting with searches, aggregations, reindexing, and more.

Choosing a dataset

Finding a dataset to work with is quite a challenge. I had these criteria in mind when hunting:

Free for anyone to use without requiring registration
Time-based data points
Fields for using in aggregations
Easy to understand
High volume of data points

The dataset I selected is records of cycle journeys made in London. You can rent a bike in London by picking one up at one of hundreds of docking stations around the capital, cycle to another docking station near your destination, and drop it off. They are referred to as ‘Boris Bikes’, after the then Mayor of London, Boris Johnson, who introduced the scheme.

Transport For London (TFL), who maintain the bikes, docking stations and infrastructure, make a lot of data available. This data includes details of all journeys made on these bikes: where they were picked up and dropped off, associated timestamps, a bike identifier, rental ID, and journey duration.

There is roughly one file per week of rentals, each file contains over 200000 records, and the data goes all the way back to late 2015. We won’t need to ingest all the data but there’s nothing stopping you if you have no better use for your hard drive!

I’m going to be using this data to explain how to craft searches and aggregations, reindex data, create time-based indices, perform shard filtering, and much more.

Ingesting with Logstash

Logstash will be used to get data from CSV files into Elasticsearch. Ingesting data using Logstash is a big topic so I won’t cover it in any detail here. All that’s required is to follow along with the instructions; you can delete Logstash once we’re done if you don’t want to use it again.

Logstash will be run from the host machine and the data will go to Elasticsearch in our, currently, single node cluster.

Download Logstash

The Elastic Stack we’re running is version 7.4, so the corresponding version of Logstash is recommended. There are several options for installation listed on the Logstash download page.

Clone or download the repository

The pipeline instructing Logstash how to transform the CSV files is in my GitHub repository. Either git clone or download the contents.

Create a directory for Logstash to monitor

Logstash needs to be given an absolute path - it won’t accept a relative one - so an environment variable will be used to define this location. Create this directory anywhere on your machine and add an environment variable called CYCLE_JOURNEY_CSV_PATH pointing to this new directory.

mkdir /Users/username/CSVs
export CYCLE_JOURNEY_CSV_PATH='/Users/username/CSVs/*.csv'

Bring up Elasticsearch

vagrant up will bring up our single node and Kibana VM. vagrant ssh node1 and start Elasticsearch.

> vagrant up
Bringing machine 'node1' up with 'virtualbox' provider...
Bringing machine 'node4' up with 'virtualbox' provider...
...
> vagrant ssh node1
[vagrant@node1 ~]> elasticsearch-7.4.0/bin/elasticsearch
...
[INFO ][o.e.c.r.a.AllocationService] [node1] Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[.kibana_1][0]]]).

Ingest

On the host machine, start Logstash using the pipeline configuration in the repository cloned earlier.

> logstash -f tfl-cycle-journey-pipeline.conf
...
[logstash.agent           ] Successfully started Logstash API endpoint {:port=>9600}

Download some cycle journey files from Transport For London. It may be sensible to get a few sequential files to get a month of data. For example, files beginning 172, 173, 174, 175, and 176.
Place these files into the the CYCLE_JOURNEY_CSV_PATH directory. Logstash will find them and process them one by one. Your computer’s fans may spin up at some point in the process.

After a few minutes, the files will have finished importing. Logstash won’t give any indication that it’s complete as it will carry on monitoring the directory for new files so it can be terminated by pressing Ctrl+C.

Query Elasticsearch to make sure we have data in the index.

curl -XGET 'http://10.0.200.101:9200/cycle-journeys/_search?pretty'

You should get back some journey details from the new index.

{
  "took" : 19,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : 1.0,
    "hits" : [
      ... *** Journey details here ***
    ]
  }
}

We are now ready to have a look at what these documents really are, how and where they’re stored, and how we can search them.