A question regarding querying only the index's latest documents

Hi everyone,

I'm doing a POC on a rolling indexes alternative approach in order to avoid deleting documents.
I encounter some difficulties querying the data I need and I'm not sure if that's just me looking at the query from the wrong perspective or simply not something Elasticsearch is designed to do.

Background:
Currently we're maintaining a single latest-data-only index as I described here - Questions about handling delete operations on a latest-data-only index.

We have an index per customer that contains documents that represents the customer's files on each folder.

Each file is represented by the following properties: fileId, fileType, folderId and modifiedDate and the UI is presenting each customer's latest files for it to filter according to folderId or fileType.

Since we're not deleting documents in this POC but keep indexing them -
Several versions of each file will necessarily exist on each folder, the only thing that helps us to differ between versions of same documents that represents the same file is the modifiedDate property.

When a file is deleted from the folder and we scan it - it would simply won't share the latest modified date as the rest of files.

So the unique identifier of each file is its fileId, folderId and modifiedDate. (and the index itself of course).

An example dataset _bulk creation:

POST _bulk
{"index":{"_index":"files_index","_id": "6ac3d9fe-006c-422e-a396-53e899b1f4e3"}}
{ "fileId": "1", "modifiedDate": "20211007_111111", "folderId": "A", "fileType": "jpg" }
{"index":{"_index":"files_index","_id": "16d8f337-0760-420e-b245-fea10ddcb36d"}}
{ "fileId": "1", "modifiedDate": "20211008_111111", "folderId": "A", "fileType": "jpg" }
{"index":{"_index":"files_index","_id": "6156ce51-c333-4d1a-b651-5f624b2ea343"}}
{ "fileId": "1","modifiedDate": "20211009_111111","folderId": "A", "fileType": "jpg"}
{"index":{"_index":"files_index","_id": "6156ce51-c333-1234-b651-5f624b2ea343"}}
{ "fileId": "1A","modifiedDate": "20211009_111111","folderId": "A", "fileType": "png"}
{"index":{"_index":"files_index","_id": "a4a681de-2b70-4b4e-b183-2e3278044402"}}
{"fileId": "2","modifiedDate": "20211007_111111","folderId": "B", "fileType": "png"}
{"index":{"_index":"files_index","_id": "76a1563a-0aef-43c2-90ac-4b627a8b7c9a"}}
{"fileId": "2","modifiedDate": "20211010_111111","folderId": "B", "fileType": "png"}
{"index":{"_index":"files_index","_id": "76a1563a-0aef-1234-90ac-4b627a8b7c9a"}}
{"fileId": "2A","modifiedDate": "20211010_111111","folderId": "B", "fileType": "svg"}
{"index":{"_index":"files_index","_id": "215e857b-ace7-4373-a6dc-71a6b28a814c"}}
{ "fileId": "2", "modifiedDate": "20211011_111111", "folderId": "B", "fileType": "png" }
{"index":{"_index":"files_index","_id": "9d31749f-f55e-4c71-96f9-06e01cf0024e"}}
{ "fileId": "3", "modifiedDate": "20211006_111111", "folderId": "C", "fileType": "svg"}
{"index":{"_index":"files_index","_id": "3ce385fd-a656-432a-95f6-bb22013fbe5e"}}
{ "fileId": "3", "modifiedDate": "20211009_111111", "folderId": "C", "fileType": "svg"}
{"index":{"_index":"files_index","_id": "d558a4c0-e76e-4397-b83a-7b6639a57b6b"}}
{ "fileId": "3", "modifiedDate": "20211012_111111", "folderId": "C", "fileType": "jpg"}
{"index":{"_index":"files_index","_id": "d558a4c0-1234-4397-b83a-7b6639a57b6b"}}
{ "fileId": "3A", "modifiedDate": "20211012_111111", "folderId": "C", "fileType": "png"}
{"index":{"_index":"files_index","_id": "6b358eff-900b-4e1f-9e7b-8be40ea379f2"}}
{ "fileId": "4", "modifiedDate": "20211008_111111", "folderId": "D", "fileType": "jpg"}
{"index":{"_index":"files_index","_id": "74db1ff0-7ff8-4144-89a5-96d51508256a"}}
{"fileId": "4", "modifiedDate": "20211010_111111", "folderId": "D", "fileType": "jpg"}
{"index":{"_index":"files_index","_id": "44717bf7-c0c3-4435-ac66-2c25968839a6"}}
{ "fileId": "4", "modifiedDate": "20211012_111111", "folderId": "D", "fileType": "png"}

The result that I'm after is to get only the latest modified files from each folder.

So if we're looking at our dataset for example, this is the result that I'm after:

{ "fileId": "1","modifiedDate": "20211009_111111","folderId": "A", "fileType": "jpg"}
{ "fileId": "1A","modifiedDate": "20211009_111111","folderId": "A", "fileType": "png"}
{ "fileId": "2", "modifiedDate": "20211011_111111", "folderId": "B", "fileType": "png" }
{ "fileId": "3", "modifiedDate": "20211012_111111", "folderId": "C", "fileType": "jpg"}
{ "fileId": "3A", "modifiedDate": "20211012_111111", "folderId": "C", "fileType": "png"}
{ "fileId": "4", "modifiedDate": "20211012_111111", "folderId": "D", "fileType": "png"}

I've tried 3 approaches:

  1. Terms aggregation
    Is there a way to have a query based on the aggregation results on the same query?
    for example: aggs to get each folder's most current modified date and to have a query for the documents of each aggs result, I couldn't find anyway to do that...

query:

GET /files_index/_search
{
  "_source": ["fileId","modifiedDate","folderId", "fileType"], 
  "size": 0, 
  "aggs": {
    "folder_aggs": {
      "terms": {
        "field": "folderId.keyword",
        "size": 100
      }
      , "aggs": {
        "modified_date_aggs": {
          "terms": {
            "field": "modifiedDate.keyword",
            "size": 1,
            "order": {
              "_key": "desc"
            }
            
          }
        }
      }
    }
  }
}
  1. Nested collapse
    This was a step towards the right direction, but I couldn't find a way to have only the latest modifiedDates and their documents.
    I guess I need 3 nested collapses which from what I encountered - isn't supported (isn't it?).

query:

GET /files_index/_search
{
  "_source": ["fileId","modifiedDate","folderId", "fileType"], 
  "size": 100, 
  "collapse": {
    "field": "folderId.keyword",
    
    "inner_hits":{
      "name":"latest_modified",
      "collapse":{
        "field":"modifiedDate.keyword"
      },
      "sort": [{"modifiedDate.keyword":"desc"}]
     
    }
  },
  "sort":[{"folderId.keyword":"asc"}]
  
}
  1. Bool query with shoulds and filter
    I can calculate which each folder's latest modifiedDate and query that with pairs.
    This actually works! :slight_smile: but feels kind of hecky (maybe only for me).

But it has some caveats - I looked at the documentation for limitations and I see that there's a soft limit of 128 characters query length limit and 32 filters max.

query:

GET /files_index/_search
{
  "_source": ["fileId","modifiedDate","folderId", "fileType"], 
  "size": 100, 
  "query": {
    "bool": {
      "should": [
        {
          "bool": {
            "filter": [
              { "term": { "folderId.keyword": "A" }},
              {"term": { "modifiedDate.keyword": "20211009_111111" }}
              ]
          }
        },
        {
          "bool": {
            "filter": [
              { "term": { "folderId.keyword": "B" }},
              {"term": { "modifiedDate.keyword": "20211011_111111" }}
              ]
          }
        },
        {
          "bool": {
            "filter": [
              { "term": { "folderId.keyword": "C" }},
              {"term": { "modifiedDate.keyword": "20211012_111111" }}
              ]
          }
        },
        {
          "bool": {
            "filter": [
              { "term": { "folderId.keyword": "D" }},
              {"term": { "modifiedDate.keyword": "20211012_111111" }}
              ]
          }
        }
      ]
    }
  }, 
  "sort":[{"folderId.keyword":"asc"}]
}

Please enlighten me,
Isn't there a more elegant way?
Have I missed something?
Why is this simple query so hard to achieve?

I know I could have an index per folder and that could make it a bit easier but I'd be managing a lot of indexes per customer instead of one.

It feels that maybe elasticserach wasn't meant for those kind of queries (which I must say, fairly trivial), am I wrong?

Thanks a lot ahead,

Niv

1 Like

The “last known state” problem is best tackled using the transforms api to create an “entity-centric” index from event-centric data.

2 Likes

Hi Mark,
Thanks for the reply :slight_smile:

I didn't knew the transforms api :bulb: , sounds very interesting!

I'll look into it,

Thanks!
Niv

1 Like

I just wanted to drop a huge thank you for writing such an awesome question. You've given a level of detail and context that really helps us understand the question and easily provide an answer!

I also wanted to note that the documentation you are linking to is for App Search, which is based on Elasticsearch but has a different set of limitations.

2 Likes

Thank you warkolm! :slight_smile:

Tried my best to make my question understandable and so glad that it was (and not only for me :slight_smile: )

Noted about the documentation mismatch, thanks for that.

Is the right one then?

Any limitations on query size / filters amount? (couldn't find it, maybe it's limitless? :slight_smile: )

Have a great rest of the day,
Niv

There is Terms query | Elasticsearch Guide [8.11] | Elastic;

By default, Elasticsearch limits the terms query to a maximum of 65,536 terms. This includes terms fetched using terms lookup. You can change this limit using the index.max_terms_count setting.

1 Like

Thank you!

Hi everyone,

I tried using the Transform APIs | Elasticsearch Guide [7.15] | Elastic as @Mark_Harwood suggested but unfortunately our Elasticsearch version (7.7) doesn't support it.

I did find a solution that works though, with nested aggregations and a filter :slight_smile:

Hope you'll find it useful:

GET /files_index/_search
{
  "size": 0,
  "aggs": {
    "types": {
      "terms": {
        "field": "folderId.keyword"
      },
      "aggs": {
        "latestModifiedDate": {
          "terms": {
            "size": 1,
            "field": "modifiedDate.keyword",
            "order": {
              "_key": "desc"
            }
          },
          "aggs": {
            "myFilter": {
              "filters": {
                "filters": {
                  "fileTypeFilter": {
                    "term": {
                      "fileType.keyword": "jpg"
                    }
                  }
                }
              },
              "aggs": {
                "myDocs": {
                  "top_hits": {
                    "size": 100
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

Have a great day,
Niv

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.