How to make a match query with aggregations?

Oerlikon · February 20, 2024, 12:55pm

Hi to all. I'm working on my university project, where I use Elasticsearch to find similar texts using a "match" query.
Here is an example of one document from my dataset:

 ""_ident": "5/425/2020",
"versions": [
  {
    "version": "20250506",
    "headlines": [
      {},
      {
        "paragraf_id": "predpis.nadpis",
        "title": "o štátnom rozpočte na rok 2021"
      }
    ],
    "text": " Podnikateľ Horskej záchrannej služby, policajtov, príslušníkov finančnej správy a profesionálnych vojakov.14)"
  }
]"

This is the mapping i use:

PUT /test_index
{
  "mappings": {
    "properties": {
      "_ident": {
        "type": "keyword"
      },
      "versions": {
        "type": "nested",
        "properties": {
          "version": {
            "type": "integer"
          },
          "headlines": {
            "type": "nested",
            "properties": {
              "paragraf_id": {
                "type": "keyword"
              },
              "title": {
                "type": "text"
              }
            }
          },
          "text": {
            "type": "text"
          }
        }
      }
    }
  }
}

I need to make a query that will find all similar documents by the "text" field, and also return only those documents in which the "version" field will be less than the number I specified, and also, if, for example, two or more found documents have the same value in the "field" _ident" then the answer will contain only one of these documents with the largest "version" value and at the same time less than the specified number. Note (the number in the "version" field is the date of adoption of the law, but it is written in INT format).

For greater clarity, I will give a practical example.
Here I have four documents:

{
"_ident": "5/425/2020",
"versions": [
  {
    "version": "20210506",
    "headlines": [
      {},
      {
        "paragraf_id": "predpis.nadpis",
        "title": "o štátnom rozpočte na rok 2021"
      }
    ],
    "text": "Sixty-Four comes asking for bread."
  }
]
}

{
"_ident": "5/425/2020",
"versions": [
  {
    "version": "20190506",
    "headlines": [
      {},
      {
        "paragraf_id": "predpis.nadpis",
        "title": "o štátnom rozpočte na rok 2021"
      }
    ],
    "text": "Sixty-Four comes asking for bread."
  }
]
}

{
"_ident": "5/425/2020",
"versions": [
  {
    "version": "20170406",
    "headlines": [
      {},
      {
        "paragraf_id": "predpis.nadpis",
        "title": "o štátnom rozpočte na rok 2021"
      }
    ],
    "text": "Sixty-Four comes asking for bread."
  }
]
}

{
"_ident": "5/7/2023",
"versions": [
  {
    "version": "20170406",
    "headlines": [
      {},
      {
        "paragraf_id": "predpis.nadpis",
        "title": "o štátnom rozpočte na rok 2021"
      }
    ],
    "text": "Sixty-Four comes asking for bread."
  }
]
}

And I want to get documents that are matched with word "comes" and their version less than "20210506": so responce should contain document with: "_ident": "5/425/2020","version": "20190506" and document with: "_ident": "5/7/2023", "version": "20170406".

I made this query, but it doesn't work:

GET /test_index/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "nested": {
            "path": "versions",
            "query": {
              "match": {
                "versions.text": "comes"
              }
            }
          }
        },
        {
          "script": {
            "script": {
              "source": """
                def threshold = params.threshold;
                def maxVersion = doc['versions.version'].size() > 0 ? doc['versions.version'].value : 0;
                return maxVersion < threshold;
              """,
              "params": {
                "threshold": 20210101
              }
            }
          }
        }
      ]
    }
  }
}

Could you help me please to make a query that will work?

RabBit_BR · February 21, 2024, 12:03pm

Hi @Oerlikon

Did try filter with Range Query.

{
  "query": {
    "bool": {
      "filter": [
        {
          "nested": {
            "path": "versions",
            "query": {
              "range": {
                "versions.version": {
                  "lt": "20210506"
                }
              }
            }
          }
        }
      ], 
      "must": [
        {
          "nested": {
            "path": "versions",
            "query": {
              "match": {
                "versions.text": "comes"
              }
            }
          }
        }
      ]
    }
  }
}

Oerlikon · February 21, 2024, 12:58pm

Thanks for your answer, but it only partially solves my problem. This is my fault, I now understand that I did not give a sufficiently correct example to explain what I wanted to do. Additionally, it is required that if two or more documents have the same "_ident", only one document with a "version" smaller than the one specified in the Range Query but greater than all other "versions" of documents with the same "_ident" should be returned.

Here is the example:
I have four documents:

"_ident": "5/425/2020",
"versions": [
  {
    "version": "20190506",
    "headlines": [
      {},
      {
        "paragraf_id": "predpis.nadpis",
        "title": "o štátnom rozpočte na rok 2021"
      }
    ],
    "text": "Sixty-Four comes asking for bread."
  }
]

"_ident": "5/425/2020",
          "versions": [
            {
              "version": "20200501",
              "headlines": [
                {},
                {
                  "paragraf_id": "predpis.nadpis",
                  "title": "o štátnom rozpočte na rok 2021"
                }
              ],
              "text": "Sixty-Four comes asking for bread."
            }
          ]

"_ident": "5/425/2020",
          "versions": [
            {
              "version": "20250506",
              "headlines": [
                {},
                {
                  "paragraf_id": "predpis.nadpis",
                  "title": "o štátnom rozpočte na rok 2021"
                }
              ],
              "text": "Sixty-Four comes asking for bread."
            }
          ]

"_ident": "1/1/2022",
          "versions": [
            {
              "version": "20190506",
              "headlines": [
                {},
                {
                  "paragraf_id": "predpis.nadpis",
                  "title": "o štátnom rozpočte na rok 2021"
                }
              ],
              "text": "Sixty-Four comes asking for bread."
            }
          ]

Suppose I want to find documents with "version" less than 20210101, so query should return just this two documents:

"_ident": "1/1/2022",
          "versions": [
            {
              "version": "20190506",
              "headlines": [
                {},
                {
                  "paragraf_id": "predpis.nadpis",
                  "title": "o štátnom rozpočte na rok 2021"
                }
              ],
              "text": "Sixty-Four comes asking for bread."
            }
          ]

"_ident": "5/425/2020",
          "versions": [
            {
              "version": "20200501",
              "headlines": [
                {},
                {
                  "paragraf_id": "predpis.nadpis",
                  "title": "o štátnom rozpočte na rok 2021"
                }
              ],
              "text": "Sixty-Four comes asking for bread."
            }
          ]

Is it even possible to make such a query?

RabBit_BR · February 21, 2024, 1:41pm

Got it, I see the collapse results. You can get both results but the query total will still count the document that was collapsed.

The other option is to try to use scripts but they are expensive and working with a nested object in the script is complicated.

Oerlikon · February 21, 2024, 2:30pm

Thank you for your advise, I'll try both possible solutions.

system · March 20, 2024, 2:31pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ElasticSearch - Match or less Elasticsearch	3	252	August 19, 2021
Help with a query - problem matching documents Elasticsearch	2	374	August 9, 2019
What's the problem with these query? Elasticsearch	6	816	September 27, 2017
How to return only matched texts in Elasticsearch aggregation and in source too Elasticsearch	5	5069	July 5, 2017
Elasticsearch: find documents with distinct values and then aggregate over them Elasticsearch	1	1289	July 6, 2017

How to make a match query with aggregations?

Related topics