Understanding `should` clauses

A common mistake I’ve seen, and made, is misunderstanding how should clauses in a bool query work. They’re understood to be the OR part of your query but that’s true only some of the time.

It’s important to know exactly what should does as bool queries are the bread and butter of most Elasticsearch queries and getting it wrong can result in having more documents returned than you would expect. There can be certification exam questions on the mechanics, too.

should this match?

To demonstrate this issue, we’ll create an index, add some documents, then run some queries and dig into the results.

PUT myindex
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  }, 
  "mappings": {
    "properties": {
      "category": {
        "type": "keyword"
      },
      "comment": {
        "type": "text"
      }
    }
  }
}

POST myindex/_doc
{
  "category": "video",
  "comment": "The video was good"
}

POST myindex/_doc
{
  "category": "video",
  "comment": "The video was terrible"
}

POST myindex/_doc
{
  "category": "video",
  "comment": "I didn't watch the video"
}

POST myindex/_doc
{
  "category": "blog",
  "comment": "Which blog?"
}

POST myindex/_doc
{
  "category": "blog",
  "comment": "The blog is good"
}

POST myindex/_doc
{
  "category": "blog",
  "comment": "I time my watch by the regularity of the posts"
}

We want to find all the documents where the comment contains either watch or good, which stinks of should query.

GET myindex/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "comment": "watch"
          }
        },
        {
          "match": {
            "comment": "good"
          }
        }
      ]
    }
  }
}

The results are exactly what we’re expecting; all the documents returned contain either watch or good:

{
  ...
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : 1.1077526,
    "hits" : [
      {
        "_index" : "myindex",
        "_type" : "_doc",
        "_id" : "IdL6YHEB_3GqsWyQHD8Y",
        "_score" : 1.1077526,
        "_source" : {
          "category" : "video",
          "comment" : "The video was good"
        }
      },
      {
        "_index" : "myindex",
        "_type" : "_doc",
        "_id" : "JdL6YHEB_3GqsWyQHj-z",
        "_score" : 1.1077526,
        "_source" : {
          "category" : "blog",
          "comment" : "The blog is good"
        }
      },
      {
        "_index" : "myindex",
        "_type" : "_doc",
        "_id" : "I9L6YHEB_3GqsWyQHj89",
        "_score" : 1.0152972,
        "_source" : {
          "category" : "video",
          "comment" : "I didn't watch the video"
        }
      },
      {
        "_index" : "myindex",
        "_type" : "_doc",
        "_id" : "JtL6YHEB_3GqsWyQHz8B",
        "_score" : 0.71635467,
        "_source" : {
          "category" : "blog",
          "comment" : "I time my watch by the regularity of the posts"
        }
      }
    ]
  }
}

We now want to filter down those documents to just those comments that are for the video category. Most people will simply add a must block to the top of the bool query and expect it to do exactly what we’re asking:

GET myindex/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "category": "video"
          }
        }
      ],
      "should": [
        {
          "match": {
            "comment": "watch"
          }
        },
        {
          "match": {
            "comment": "good"
          }
        }
      ]
    }
  }
}

Looking at the results, we can see that this is where the trouble starts:

{
  ...
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 1.8008997,
    "hits" : [
      {
        "_index" : "myindex",
        "_type" : "_doc",
        "_id" : "IdL6YHEB_3GqsWyQHD8Y",
        "_score" : 1.8008997,
        "_source" : {
          "category" : "video",
          "comment" : "The video was good"
        }
      },
      {
        "_index" : "myindex",
        "_type" : "_doc",
        "_id" : "I9L6YHEB_3GqsWyQHj89",
        "_score" : 1.7084444,
        "_source" : {
          "category" : "video",
          "comment" : "I didn't watch the video"
        }
      },
      {
        "_index" : "myindex",
        "_type" : "_doc",
        "_id" : "ItL6YHEB_3GqsWyQHT_r",
        "_score" : 0.6931472,
        "_source" : {
          "category" : "video",
          "comment" : "The video was terrible"
        }
      }
    ]
  }
}

mimimum_should_match and its changing default value

You were probably expecting two documents to match; what’s that third document doing in the results? It is a video comment but it doesn’t contain either watch or good. Understanding why this is happening requires being aware of what the minimum_should_match parameter does and knowing that is has different default values depending on what else is in the bool query.

mimimum_should_match will specify how many, or the percentage, of the should clauses in our query should match the document. For example, if we have four should clauses and we want at least two of them to match a document, we can specify either 2 or 50%. Under most circumstances, you’ll want to match one of the clauses. Even if you are aware of what the minimum_should_match does, the default value is what will trip up some people.

This is the key part in the docs:

If the bool query includes at least one should clause and no must or filter clauses, the default value is 1. Otherwise, the default value is 0.

In the first query containing only should clauses, minimum_should_match took a default of 1 so we got expected results.

In our query above, however, our should is combined with a must. As we don’t specify a value for minimum_should_match, the default is 0. Therefore, none of our should clauses are required to match a document. Documents will only be filtered down by the must clause; anything the does actually match should clauses will only increase the score for that match. The third document in the results is a video document but doesn’t match any of the should clauses. It therefore is returned as a match but has a much lower score than the other two that do match the shoulds.

Several fixes

There are several ways to fix the query but the easiest one is to simply apply a minimum_should_match of 1:

GET myindex/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "category": "video"
          }
        }
      ],
      "should": [
        {
          "match": {
            "comment": "watch"
          }
        },
        {
          "match": {
            "comment": "good"
          }
        }
      ],
      "minimum_should_match": 1
    }
  }
}

We’ll now get the results we were expecting:

{
  ...
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.8008997,
    "hits" : [
      {
        "_index" : "myindex",
        "_type" : "_doc",
        "_id" : "IdL6YHEB_3GqsWyQHD8Y",
        "_score" : 1.8008997,
        "_source" : {
          "category" : "video",
          "comment" : "The video was good"
        }
      },
      {
        "_index" : "myindex",
        "_type" : "_doc",
        "_id" : "I9L6YHEB_3GqsWyQHj89",
        "_score" : 1.7084444,
        "_source" : {
          "category" : "video",
          "comment" : "I didn't watch the video"
        }
      }
    ]
  }
}

Another fix, without using minimum_should_match, is putting the should in a nested bool inside the must:

GET myindex/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "category": "video"
          }
        },
        {
          "bool": {
            "should": [
              {
                "match": {
                  "comment": "watch"
                }
              },
              {
                "match": {
                  "comment": "good"
                }
              }
            ]
          }
        }
      ]
    }
  }
}

In our case, as we only have video and blog comments, we could have used a must_not and also omitted the minimum_should_match:

GET myindex/_search
{
  "query": {
    "bool": {
      "must_not": [
        {
          "match": {
            "category": "blog"
          }
        }
      ],
      "should": [
        {
          "match": {
            "comment": "watch"
          }
        },
        {
          "match": {
            "comment": "good"
          }
        }
      ]
    }
  }
}

Which one you choose depends on your use-case but the clearest way is to simply apply a minimum_should_match. Then there’s no need to try and remember which default value the bool query is using.

Conclusion

It’s easy to see where we’re getting incorrect results in this contrived example. In the wild, when you’re filtering down billions of documents, it’s harder to spot that you’re getting documents you’re not expecting.

bool queries are used everywhere and it’s important to know how they work so you can make your queries efficient, relevant, and - most importantly - correct.