Sum aggregation accepts boolean fields

The problem

A question on the Elasticsearch Slack today received an interesting answer that used a trick I haven’t noticed before. The trick is noteworthy as it has some handy use-cases.

The question boiled down to finding the number of documents where a field value is true, then using that count in a bucket_script pipeline aggregation.

Ben Trent - an ML Engineer at Elastic - replied with a trick involving the sum aggregation type. When applied to a boolean field, sum will treat values of false as 0 and true as 1. This results in it returning the number of documents with a value of true.

The Gist Ben provided gave a great example of how this can be used to address the OP’s problem. I’ll go over how it works but change the mapping to be a bit more specific.

The solution

Setup

I’ll create the mapping and index some documents first:

PUT products/
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  }, 
  "mappings": {
    "properties": {
      "productCategory": {
        "type": "keyword"
      },
      "inStock": {
        "type": "boolean"
      }
    }
  }
}

PUT products/_doc/1
{
  "productCategory": "book",
  "inStock": true
}

PUT products/_doc/2
{
  "productCategory": "book",
  "inStock": false
}

PUT products/_doc/3
{
  "productCategory": "book",
  "inStock": true
}

PUT products/_doc/4
{
  "productCategory": "toy",
  "inStock": true
}

PUT products/_doc/5
{
  "productCategory": "toy",
  "inStock": true
}

PUT products/_doc/6
{
  "productCategory": "toy",
  "inStock": false
}

PUT products/_doc/7
{
  "productCategory": "toy",
  "inStock": false
}

Start at the top

I’ll start by analysing the products by category.

GET products/_search
{
  "size": 0,
  "aggs": {
    "categories": {
      "terms": {
        "field": "productCategory"
      }
    }
  }
}

This creates a bucket per category - I’m removing the metadata and hits blocks from all results for brevity:

{
  "aggregations" : {
    "categories" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "toy",
          "doc_count" : 4
        },
        {
          "key" : "book",
          "doc_count" : 3
        }
      ]
    }
  }
}

We need to know the following in order to work out the percentage of products that are in stock:

Number of products in the category with a value of true for the inStock field
Total number of products in the category

Counting in stock products

Finding the number of in stock products is the more tricky of the two but this is where the sum aggregation will make life much easier. To see how it works, use a terms aggregation on the inStock field:

GET products/_search
{
  "size": 0,
  "aggs": {
    "categories": {
      "terms": {
        "field": "productCategory"
      },
      "aggs": {
        "inStock": {
          "terms": {
            "field": "inStock"
          }
        }
      }
    }
  }
}

The bucket keys in the results show how Elasticsearch is treating the values:

{
  "aggregations" : {
    "categories" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "toy",
          "doc_count" : 4,
          "inStock" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : 0,
                "key_as_string" : "false",
                "doc_count" : 2
              },
              {
                "key" : 1,
                "key_as_string" : "true",
                "doc_count" : 2
              }
            ]
          }
        },
        {
          "key" : "book",
          "doc_count" : 3,
          "inStock" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : 1,
                "key_as_string" : "true",
                "doc_count" : 2
              },
              {
                "key" : 0,
                "key_as_string" : "false",
                "doc_count" : 1
              }
            ]
          }
        }
      ]
    }
  }
}

Switching the terms to a sum will give us the number we need straight away, as Elasticsearch adds up all the ones (in stock) and zeroes (not in stock):

GET products/_search
{
  "size": 0,
  "aggs": {
    "categories": {
      "terms": {
        "field": "productCategory"
      },
      "aggs": {
        "numInStock": {
          "sum": {
            "field": "inStock"
          }
        }
      }
    }
  }
}

{
  "aggregations" : {
    "categories" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "toy",
          "doc_count" : 4,
          "numInStock" : {
            "value" : 2.0,
            "value_as_string" : "true"
          }
        },
        {
          "key" : "book",
          "doc_count" : 3,
          "numInStock" : {
            "value" : 2.0,
            "value_as_string" : "true"
          }
        }
      ]
    }
  }
}

Counting all products

Ben used a value_count aggregation to count the number of documents in the categories buckets. This is an unnecessary step; the document count is already available for us to use in a pipeline aggregation.

Calculating the percentage

Here’s the pipeline aggregation to calculate the percentage of in stock products:

GET products/_search
{
  "size": 0,
  "aggs": {
    "categories": {
      "terms": {
        "field": "productCategory"
      },
      "aggs": {
        "numInStock": {
          "sum": {
            "field": "inStock"
          }
        },
        "percInStock": {
          "bucket_script": {
            "buckets_path": {
              "inStock": "numInStock",
              "total": "_count"
            },
            "script": "(params.inStock / params.total) * 100"
          }
        }
      }
    }
  }
}

The buckets_path section may look a bit strange.

A bucket_script pipeline aggregation is a parent pipeline aggregation, meaning it can compute new values from the output of its parent aggregation. In this case, the parent is the categories aggregation. _count is a special bucket path that will use the document count as its input. This is the number of products in the category bucket.

numInStock is a metric aggregation, so using it in a buckets_path will use the metric aggregation value.

The script simply does the calculation to find the percentage and the output is just what we need:

{
  "aggregations" : {
    "categories" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "toy",
          "doc_count" : 4,
          "numInStock" : {
            "value" : 2.0,
            "value_as_string" : "true"
          },
          "percInStock" : {
            "value" : 50.0
          }
        },
        {
          "key" : "book",
          "doc_count" : 3,
          "numInStock" : {
            "value" : 2.0,
            "value_as_string" : "true"
          },
          "percInStock" : {
            "value" : 66.66666666666666
          }
        }
      ]
    }
  }
}

Summary

Using sum aggregations on boolean fields is a really useful trick for finding the number of documents where that field value is true.

Despite pipeline aggregations being removed from the Elastic Certified Engineer exam, they’re an incredibly useful feature and I will be writing more about them in the future.