This is the first round of Elasticsearch exercises. In this set, we will load in the data and get the index ready to start cleaning up the documents.
Please get in touch if you have any questions or feedback.
I may post solutions once I’ve done a couple more rounds of exercises with this data. There is a lot of scope for some interesting questions covering quite a lot of the Elasticsearch APIs.
- Creating indices
- Defining mappings
- Ingest pipelines
- Delete by query
Configure Elasticsearch with the following criteria and start Elasticsearch:
Configure Kibana to point to your Elasticsearch node and start Kibana.
There will be exercises covering the Bulk API later but, for now, we will only be using the Bulk API to get a dataset into the cluster.
Download the archive containing the data with
curl -O https://s3.amazonaws.com/elasticsearch-exercises.whatgeorgemade/olympic-events.tar.gz and extract it with
tar -xzf olympic-events.tar.gz. Ensure the
bulk-post-files.sh script is executable with
chmod +x bulk-post-files.sh.
ndjson file is ready to be used with the Bulk API. The
bulk-post-files.sh script will create an index before iterating over all the
ndjson files in the directory and
POST each of them to the
_bulk endpoint. The script takes two optional arguments; the index name to use and the node URL with defaults of
http://localhost:9200 respectively. Change the node URL as required for your cluster.
There is no error handling in the
bulk-post-files.sh script. It will be refined over time to be more robust.
./bulk-post-files.sh "olympic-events" "http://localhost:9200"
The bulk post script created the index with a
1m refresh interval. If you see zero documents, try again a minute later.
Validate that the data was imported correctly by using a single API call to show the index name, index health, number of documents, and the size of the primary store. The details in the response must be in that order, with headers, and for the new index only.
The cluster health is yellow. Use a cluster API that can explain the problem.
Change the cluster or index settings as required to get the cluster to a green status.
Look at how Elasticsearch has applied very general-purpose mappings to the data. Why has it chosen to use a
text type for the
Age field? Find all unique values for the
Age field; there are less than 100 unique values for the
Age field. Look for any suspicious values.
We will be deleting data in the next exercise; making a backup is always prudent. Without making any changes to the data, reindex the
olympic-events index into a new index called
Weight fields suffer from the same problem as the
Age field. Later exercises will require numeric-type queries for these fields so we want to exclude any document we can’t use in our analyses. In a single request, delete all documents from the
olympic-events index that have a value of
NA for either the
Notice how the
Games field contains both the Olympic year and season. Create an ingest pipeline called
split_games that will split this field into two new fields -
season - and remove the original
Ensure your new pipeline is working correctly by simulating it with these values:
We’ll now start to clean up the mappings. Create a new index called
olympic-events-fixed with 1 shard, 0 replicas, and the following mapping: