Terms aggregations
The terms
aggregation dynamically creates a bucket for each unique term of a field.
The following example uses the terms
aggregation to find the number of documents per response code in web log data:
GET opensearch_dashboards_sample_data_logs/_search
{
"size": 0,
"aggs": {
"response_codes": {
"terms": {
"field": "response.keyword",
"size": 10
}
}
}
}
Example response
...
"aggregations" : {
"response_codes" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "200",
"doc_count" : 12832
},
{
"key" : "404",
"doc_count" : 801
},
{
"key" : "503",
"doc_count" : 441
}
]
}
}
}
The values are returned with the key key
. doc_count
specifies the number of documents in each bucket. By default, the buckets are sorted in descending order of doc-count
.
The response also includes two keys named doc_count_error_upper_bound
and sum_other_doc_count
.
The terms
aggregation returns the top unique terms. So, if the data has many unique terms, then some of them might not appear in the results. The sum_other_doc_count
field is the sum of the documents that are left out of the response. In this case, the number is 0 because all the unique values appear in the response.
The doc_count_error_upper_bound
field represents the maximum possible count for a unique value that’s left out of the final results. Use this field to estimate the error margin for the count.
The count might not be accurate. A coordinating node that’s responsible for the aggregation prompts each shard for its top unique terms. Imagine a scenario where the size
parameter is 3. The terms
aggregation requests each shard for its top 3 unique terms. The coordinating node takes each of the results and aggregates them to compute the final result. If a shard has an object that’s not part of the top 3, then it won’t show up in the response.
This is especially true if size
is set to a low number. Because the default size is 10, an error is unlikely to happen. If you don’t need high accuracy and want to increase the performance, you can reduce the size.
Account for pre-aggregated data
While the doc_count
field provides a representation of the number of individual documents aggregated in a bucket, doc_count
by itself does not have a way to correctly increment documents that store pre-aggregated data. To account for pre-aggregated data and accurately calculate the number of documents in a bucket, you can use the _doc_count
field to add the number of documents in a single summary field. When a document includes the _doc_count
field, all bucket aggregations recognize its value and increase the bucket doc_count
cumulatively. Keep these considerations in mind when using the _doc_count
field:
- The field does not support nested arrays; only positive integers can be used.
- If a document does not contain the
_doc_count
field, aggregation uses the document to increase the count by 1.
OpenSearch features that rely on an accurate document count illustrate the importance of using the _doc_count
field. To see how this field can be used to support other search tools, refer to Index rollups, an OpenSearch feature for the Index Management (IM) plugin that stores documents with pre-aggregated data in rollup indexes.
Example request
PUT /my_index/_doc/1
{
"response_code": 404,
"date":"2022-08-05",
"_doc_count": 20
}
PUT /my_index/_doc/2
{
"response_code": 404,
"date":"2022-08-06",
"_doc_count": 10
}
PUT /my_index/_doc/3
{
"response_code": 200,
"date":"2022-08-06",
"_doc_count": 300
}
GET /my_index/_search
{
"size": 0,
"aggs": {
"response_codes": {
"terms": {
"field" : "response_code"
}
}
}
}
Example response
{
"took" : 20,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"response_codes" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 200,
"doc_count" : 300
},
{
"key" : 404,
"doc_count" : 30
}
]
}
}
}