A question regarding querying only the index's latest documents

nivo · October 9, 2021, 10:36am

Hi everyone,

I'm doing a POC on a rolling indexes alternative approach in order to avoid deleting documents.
I encounter some difficulties querying the data I need and I'm not sure if that's just me looking at the query from the wrong perspective or simply not something Elasticsearch is designed to do.

Background:
Currently we're maintaining a single latest-data-only index as I described here - Questions about handling delete operations on a latest-data-only index.

We have an index per customer that contains documents that represents the customer's files on each folder.

Each file is represented by the following properties: fileId, fileType, folderId and modifiedDate and the UI is presenting each customer's latest files for it to filter according to folderId or fileType.

Since we're not deleting documents in this POC but keep indexing them -
Several versions of each file will necessarily exist on each folder, the only thing that helps us to differ between versions of same documents that represents the same file is the modifiedDate property.

When a file is deleted from the folder and we scan it - it would simply won't share the latest modified date as the rest of files.

So the unique identifier of each file is its fileId, folderId and modifiedDate. (and the index itself of course).

An example dataset _bulk creation:

POST _bulk
{"index":{"_index":"files_index","_id": "6ac3d9fe-006c-422e-a396-53e899b1f4e3"}}
{ "fileId": "1", "modifiedDate": "20211007_111111", "folderId": "A", "fileType": "jpg" }
{"index":{"_index":"files_index","_id": "16d8f337-0760-420e-b245-fea10ddcb36d"}}
{ "fileId": "1", "modifiedDate": "20211008_111111", "folderId": "A", "fileType": "jpg" }
{"index":{"_index":"files_index","_id": "6156ce51-c333-4d1a-b651-5f624b2ea343"}}
{ "fileId": "1","modifiedDate": "20211009_111111","folderId": "A", "fileType": "jpg"}
{"index":{"_index":"files_index","_id": "6156ce51-c333-1234-b651-5f624b2ea343"}}
{ "fileId": "1A","modifiedDate": "20211009_111111","folderId": "A", "fileType": "png"}
{"index":{"_index":"files_index","_id": "a4a681de-2b70-4b4e-b183-2e3278044402"}}
{"fileId": "2","modifiedDate": "20211007_111111","folderId": "B", "fileType": "png"}
{"index":{"_index":"files_index","_id": "76a1563a-0aef-43c2-90ac-4b627a8b7c9a"}}
{"fileId": "2","modifiedDate": "20211010_111111","folderId": "B", "fileType": "png"}
{"index":{"_index":"files_index","_id": "76a1563a-0aef-1234-90ac-4b627a8b7c9a"}}
{"fileId": "2A","modifiedDate": "20211010_111111","folderId": "B", "fileType": "svg"}
{"index":{"_index":"files_index","_id": "215e857b-ace7-4373-a6dc-71a6b28a814c"}}
{ "fileId": "2", "modifiedDate": "20211011_111111", "folderId": "B", "fileType": "png" }
{"index":{"_index":"files_index","_id": "9d31749f-f55e-4c71-96f9-06e01cf0024e"}}
{ "fileId": "3", "modifiedDate": "20211006_111111", "folderId": "C", "fileType": "svg"}
{"index":{"_index":"files_index","_id": "3ce385fd-a656-432a-95f6-bb22013fbe5e"}}
{ "fileId": "3", "modifiedDate": "20211009_111111", "folderId": "C", "fileType": "svg"}
{"index":{"_index":"files_index","_id": "d558a4c0-e76e-4397-b83a-7b6639a57b6b"}}
{ "fileId": "3", "modifiedDate": "20211012_111111", "folderId": "C", "fileType": "jpg"}
{"index":{"_index":"files_index","_id": "d558a4c0-1234-4397-b83a-7b6639a57b6b"}}
{ "fileId": "3A", "modifiedDate": "20211012_111111", "folderId": "C", "fileType": "png"}
{"index":{"_index":"files_index","_id": "6b358eff-900b-4e1f-9e7b-8be40ea379f2"}}
{ "fileId": "4", "modifiedDate": "20211008_111111", "folderId": "D", "fileType": "jpg"}
{"index":{"_index":"files_index","_id": "74db1ff0-7ff8-4144-89a5-96d51508256a"}}
{"fileId": "4", "modifiedDate": "20211010_111111", "folderId": "D", "fileType": "jpg"}
{"index":{"_index":"files_index","_id": "44717bf7-c0c3-4435-ac66-2c25968839a6"}}
{ "fileId": "4", "modifiedDate": "20211012_111111", "folderId": "D", "fileType": "png"}

The result that I'm after is to get only the latest modified files from each folder.

So if we're looking at our dataset for example, this is the result that I'm after:

{ "fileId": "1","modifiedDate": "20211009_111111","folderId": "A", "fileType": "jpg"}
{ "fileId": "1A","modifiedDate": "20211009_111111","folderId": "A", "fileType": "png"}
{ "fileId": "2", "modifiedDate": "20211011_111111", "folderId": "B", "fileType": "png" }
{ "fileId": "3", "modifiedDate": "20211012_111111", "folderId": "C", "fileType": "jpg"}
{ "fileId": "3A", "modifiedDate": "20211012_111111", "folderId": "C", "fileType": "png"}
{ "fileId": "4", "modifiedDate": "20211012_111111", "folderId": "D", "fileType": "png"}

I've tried 3 approaches:

Terms aggregation
Is there a way to have a query based on the aggregation results on the same query?
for example: aggs to get each folder's most current modified date and to have a query for the documents of each aggs result, I couldn't find anyway to do that...

query:

GET /files_index/_search
{
  "_source": ["fileId","modifiedDate","folderId", "fileType"], 
  "size": 0, 
  "aggs": {
    "folder_aggs": {
      "terms": {
        "field": "folderId.keyword",
        "size": 100
      }
      , "aggs": {
        "modified_date_aggs": {
          "terms": {
            "field": "modifiedDate.keyword",
            "size": 1,
            "order": {
              "_key": "desc"
            }
            
          }
        }
      }
    }
  }
}

Nested collapse
This was a step towards the right direction, but I couldn't find a way to have only the latest modifiedDates and their documents.
I guess I need 3 nested collapses which from what I encountered - isn't supported (isn't it?).

query:

GET /files_index/_search
{
  "_source": ["fileId","modifiedDate","folderId", "fileType"], 
  "size": 100, 
  "collapse": {
    "field": "folderId.keyword",
    
    "inner_hits":{
      "name":"latest_modified",
      "collapse":{
        "field":"modifiedDate.keyword"
      },
      "sort": [{"modifiedDate.keyword":"desc"}]
     
    }
  },
  "sort":[{"folderId.keyword":"asc"}]
  
}

Bool query with shoulds and filter
I can calculate which each folder's latest modifiedDate and query that with pairs.
This actually works! but feels kind of hecky (maybe only for me).

But it has some caveats - I looked at the documentation for limitations and I see that there's a soft limit of 128 characters query length limit and 32 filters max.

query:

GET /files_index/_search
{
  "_source": ["fileId","modifiedDate","folderId", "fileType"], 
  "size": 100, 
  "query": {
    "bool": {
      "should": [
        {
          "bool": {
            "filter": [
              { "term": { "folderId.keyword": "A" }},
              {"term": { "modifiedDate.keyword": "20211009_111111" }}
              ]
          }
        },
        {
          "bool": {
            "filter": [
              { "term": { "folderId.keyword": "B" }},
              {"term": { "modifiedDate.keyword": "20211011_111111" }}
              ]
          }
        },
        {
          "bool": {
            "filter": [
              { "term": { "folderId.keyword": "C" }},
              {"term": { "modifiedDate.keyword": "20211012_111111" }}
              ]
          }
        },
        {
          "bool": {
            "filter": [
              { "term": { "folderId.keyword": "D" }},
              {"term": { "modifiedDate.keyword": "20211012_111111" }}
              ]
          }
        }
      ]
    }
  }, 
  "sort":[{"folderId.keyword":"asc"}]
}

Please enlighten me,
Isn't there a more elegant way?
Have I missed something?
Why is this simple query so hard to achieve?

I know I could have an index per folder and that could make it a bit easier but I'd be managing a lot of indexes per customer instead of one.

It feels that maybe elasticserach wasn't meant for those kind of queries (which I must say, fairly trivial), am I wrong?

Thanks a lot ahead,

Niv

Mark_Harwood · October 10, 2021, 7:57am

The “last known state” problem is best tackled using the transforms api to create an “entity-centric” index from event-centric data.

nivo · October 10, 2021, 9:36pm

Hi Mark,
Thanks for the reply

I didn't knew the transforms api , sounds very interesting!

I'll look into it,

Thanks!
Niv

warkolm · October 10, 2021, 10:13pm

I just wanted to drop a huge thank you for writing such an awesome question. You've given a level of detail and context that really helps us understand the question and easily provide an answer!

I also wanted to note that the documentation you are linking to is for App Search, which is based on Elasticsearch but has a different set of limitations.

nivo · October 10, 2021, 10:21pm

Thank you warkolm!

Tried my best to make my question understandable and so glad that it was (and not only for me )

Noted about the documentation mismatch, thanks for that.

Is the right one then?

Any limitations on query size / filters amount? (couldn't find it, maybe it's limitless? )

Have a great rest of the day,
Niv

warkolm · October 13, 2021, 12:28am

There is Terms query | Elasticsearch Guide [8.11] | Elastic;

By default, Elasticsearch limits the terms query to a maximum of 65,536 terms. This includes terms fetched using terms lookup. You can change this limit using the index.max_terms_count setting.

nivo · October 13, 2021, 4:10pm

Thank you!

nivo · October 21, 2021, 10:35pm

Hi everyone,

I tried using the Transform APIs | Elasticsearch Guide [7.15] | Elastic as @Mark_Harwood suggested but unfortunately our Elasticsearch version (7.7) doesn't support it.

I did find a solution that works though, with nested aggregations and a filter

Hope you'll find it useful:

GET /files_index/_search
{
  "size": 0,
  "aggs": {
    "types": {
      "terms": {
        "field": "folderId.keyword"
      },
      "aggs": {
        "latestModifiedDate": {
          "terms": {
            "size": 1,
            "field": "modifiedDate.keyword",
            "order": {
              "_key": "desc"
            }
          },
          "aggs": {
            "myFilter": {
              "filters": {
                "filters": {
                  "fileTypeFilter": {
                    "term": {
                      "fileType.keyword": "jpg"
                    }
                  }
                }
              },
              "aggs": {
                "myDocs": {
                  "top_hits": {
                    "size": 100
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

Have a great day,
Niv

system · November 18, 2021, 10:35pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Delete old documents Elasticsearch	4	344	December 8, 2020
Questions about handling delete operations on a latest-data-only index Elasticsearch	6	727	November 9, 2021
Deleting old versions Elasticsearch	7	7045	July 6, 2017
Querying the latest versions of your document Elasticsearch	3	463	October 5, 2018
Store last X documents per id Elasticsearch	11	215	May 3, 2024

A question regarding querying only the index's latest documents

Related topics