How to make a match query with aggregations?

Hi to all. I'm working on my university project, where I use Elasticsearch to find similar texts using a "match" query.
Here is an example of one document from my dataset:

 ""_ident": "5/425/2020",
"versions": [
  {
    "version": "20250506",
    "headlines": [
      {},
      {
        "paragraf_id": "predpis.nadpis",
        "title": "o štátnom rozpočte na rok 2021"
      }
    ],
    "text": " Podnikateľ Horskej záchrannej služby, policajtov, príslušníkov finančnej správy a profesionálnych vojakov.14)"
  }
]"

This is the mapping i use:

PUT /test_index
{
  "mappings": {
    "properties": {
      "_ident": {
        "type": "keyword"
      },
      "versions": {
        "type": "nested",
        "properties": {
          "version": {
            "type": "integer"
          },
          "headlines": {
            "type": "nested",
            "properties": {
              "paragraf_id": {
                "type": "keyword"
              },
              "title": {
                "type": "text"
              }
            }
          },
          "text": {
            "type": "text"
          }
        }
      }
    }
  }
}

I need to make a query that will find all similar documents by the "text" field, and also return only those documents in which the "version" field will be less than the number I specified, and also, if, for example, two or more found documents have the same value in the "field" _ident" then the answer will contain only one of these documents with the largest "version" value and at the same time less than the specified number. Note (the number in the "version" field is the date of adoption of the law, but it is written in INT format).

For greater clarity, I will give a practical example.
Here I have four documents:

{
"_ident": "5/425/2020",
"versions": [
  {
    "version": "20210506",
    "headlines": [
      {},
      {
        "paragraf_id": "predpis.nadpis",
        "title": "o štátnom rozpočte na rok 2021"
      }
    ],
    "text": "Sixty-Four comes asking for bread."
  }
]
}
{
"_ident": "5/425/2020",
"versions": [
  {
    "version": "20190506",
    "headlines": [
      {},
      {
        "paragraf_id": "predpis.nadpis",
        "title": "o štátnom rozpočte na rok 2021"
      }
    ],
    "text": "Sixty-Four comes asking for bread."
  }
]
}
{
"_ident": "5/425/2020",
"versions": [
  {
    "version": "20170406",
    "headlines": [
      {},
      {
        "paragraf_id": "predpis.nadpis",
        "title": "o štátnom rozpočte na rok 2021"
      }
    ],
    "text": "Sixty-Four comes asking for bread."
  }
]
}
{
"_ident": "5/7/2023",
"versions": [
  {
    "version": "20170406",
    "headlines": [
      {},
      {
        "paragraf_id": "predpis.nadpis",
        "title": "o štátnom rozpočte na rok 2021"
      }
    ],
    "text": "Sixty-Four comes asking for bread."
  }
]
}

And I want to get documents that are matched with word "comes" and their version less than "20210506": so responce should contain document with: "_ident": "5/425/2020","version": "20190506" and document with: "_ident": "5/7/2023", "version": "20170406".

I made this query, but it doesn't work:

GET /test_index/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "nested": {
            "path": "versions",
            "query": {
              "match": {
                "versions.text": "comes"
              }
            }
          }
        },
        {
          "script": {
            "script": {
              "source": """
                def threshold = params.threshold;
                def maxVersion = doc['versions.version'].size() > 0 ? doc['versions.version'].value : 0;
                return maxVersion < threshold;
              """,
              "params": {
                "threshold": 20210101
              }
            }
          }
        }
      ]
    }
  }
}

Could you help me please to make a query that will work?

Hi @Oerlikon

Did try filter with Range Query.

{
  "query": {
    "bool": {
      "filter": [
        {
          "nested": {
            "path": "versions",
            "query": {
              "range": {
                "versions.version": {
                  "lt": "20210506"
                }
              }
            }
          }
        }
      ], 
      "must": [
        {
          "nested": {
            "path": "versions",
            "query": {
              "match": {
                "versions.text": "comes"
              }
            }
          }
        }
      ]
    }
  }
}

Thanks for your answer, but it only partially solves my problem. This is my fault, I now understand that I did not give a sufficiently correct example to explain what I wanted to do. Additionally, it is required that if two or more documents have the same "_ident", only one document with a "version" smaller than the one specified in the Range Query but greater than all other "versions" of documents with the same "_ident" should be returned.

Here is the example:
I have four documents:

"_ident": "5/425/2020",
"versions": [
  {
    "version": "20190506",
    "headlines": [
      {},
      {
        "paragraf_id": "predpis.nadpis",
        "title": "o štátnom rozpočte na rok 2021"
      }
    ],
    "text": "Sixty-Four comes asking for bread."
  }
]
"_ident": "5/425/2020",
          "versions": [
            {
              "version": "20200501",
              "headlines": [
                {},
                {
                  "paragraf_id": "predpis.nadpis",
                  "title": "o štátnom rozpočte na rok 2021"
                }
              ],
              "text": "Sixty-Four comes asking for bread."
            }
          ]
"_ident": "5/425/2020",
          "versions": [
            {
              "version": "20250506",
              "headlines": [
                {},
                {
                  "paragraf_id": "predpis.nadpis",
                  "title": "o štátnom rozpočte na rok 2021"
                }
              ],
              "text": "Sixty-Four comes asking for bread."
            }
          ]
"_ident": "1/1/2022",
          "versions": [
            {
              "version": "20190506",
              "headlines": [
                {},
                {
                  "paragraf_id": "predpis.nadpis",
                  "title": "o štátnom rozpočte na rok 2021"
                }
              ],
              "text": "Sixty-Four comes asking for bread."
            }
          ]

Suppose I want to find documents with "version" less than 20210101, so query should return just this two documents:

"_ident": "1/1/2022",
          "versions": [
            {
              "version": "20190506",
              "headlines": [
                {},
                {
                  "paragraf_id": "predpis.nadpis",
                  "title": "o štátnom rozpočte na rok 2021"
                }
              ],
              "text": "Sixty-Four comes asking for bread."
            }
          ]
"_ident": "5/425/2020",
          "versions": [
            {
              "version": "20200501",
              "headlines": [
                {},
                {
                  "paragraf_id": "predpis.nadpis",
                  "title": "o štátnom rozpočte na rok 2021"
                }
              ],
              "text": "Sixty-Four comes asking for bread."
            }
          ]

Is it even possible to make such a query?

Got it, I see the collapse results. You can get both results but the query total will still count the document that was collapsed.

The other option is to try to use scripts but they are expensive and working with a nested object in the script is complicated.

Thank you for your advise, I'll try both possible solutions.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.