Hi everyone,
I'm doing a POC on a rolling indexes alternative approach in order to avoid deleting documents.
I encounter some difficulties querying the data I need and I'm not sure if that's just me looking at the query from the wrong perspective or simply not something Elasticsearch is designed to do.
Background:
Currently we're maintaining a single latest-data-only index as I described here - Questions about handling delete operations on a latest-data-only index.
We have an index per customer that contains documents that represents the customer's files on each folder.
Each file is represented by the following properties: fileId, fileType, folderId and modifiedDate and the UI is presenting each customer's latest files for it to filter according to folderId or fileType.
Since we're not deleting documents in this POC but keep indexing them -
Several versions of each file will necessarily exist on each folder, the only thing that helps us to differ between versions of same documents that represents the same file is the modifiedDate property.
When a file is deleted from the folder and we scan it - it would simply won't share the latest modified date as the rest of files.
So the unique identifier of each file is its fileId, folderId and modifiedDate. (and the index itself of course).
An example dataset _bulk creation:
POST _bulk
{"index":{"_index":"files_index","_id": "6ac3d9fe-006c-422e-a396-53e899b1f4e3"}}
{ "fileId": "1", "modifiedDate": "20211007_111111", "folderId": "A", "fileType": "jpg" }
{"index":{"_index":"files_index","_id": "16d8f337-0760-420e-b245-fea10ddcb36d"}}
{ "fileId": "1", "modifiedDate": "20211008_111111", "folderId": "A", "fileType": "jpg" }
{"index":{"_index":"files_index","_id": "6156ce51-c333-4d1a-b651-5f624b2ea343"}}
{ "fileId": "1","modifiedDate": "20211009_111111","folderId": "A", "fileType": "jpg"}
{"index":{"_index":"files_index","_id": "6156ce51-c333-1234-b651-5f624b2ea343"}}
{ "fileId": "1A","modifiedDate": "20211009_111111","folderId": "A", "fileType": "png"}
{"index":{"_index":"files_index","_id": "a4a681de-2b70-4b4e-b183-2e3278044402"}}
{"fileId": "2","modifiedDate": "20211007_111111","folderId": "B", "fileType": "png"}
{"index":{"_index":"files_index","_id": "76a1563a-0aef-43c2-90ac-4b627a8b7c9a"}}
{"fileId": "2","modifiedDate": "20211010_111111","folderId": "B", "fileType": "png"}
{"index":{"_index":"files_index","_id": "76a1563a-0aef-1234-90ac-4b627a8b7c9a"}}
{"fileId": "2A","modifiedDate": "20211010_111111","folderId": "B", "fileType": "svg"}
{"index":{"_index":"files_index","_id": "215e857b-ace7-4373-a6dc-71a6b28a814c"}}
{ "fileId": "2", "modifiedDate": "20211011_111111", "folderId": "B", "fileType": "png" }
{"index":{"_index":"files_index","_id": "9d31749f-f55e-4c71-96f9-06e01cf0024e"}}
{ "fileId": "3", "modifiedDate": "20211006_111111", "folderId": "C", "fileType": "svg"}
{"index":{"_index":"files_index","_id": "3ce385fd-a656-432a-95f6-bb22013fbe5e"}}
{ "fileId": "3", "modifiedDate": "20211009_111111", "folderId": "C", "fileType": "svg"}
{"index":{"_index":"files_index","_id": "d558a4c0-e76e-4397-b83a-7b6639a57b6b"}}
{ "fileId": "3", "modifiedDate": "20211012_111111", "folderId": "C", "fileType": "jpg"}
{"index":{"_index":"files_index","_id": "d558a4c0-1234-4397-b83a-7b6639a57b6b"}}
{ "fileId": "3A", "modifiedDate": "20211012_111111", "folderId": "C", "fileType": "png"}
{"index":{"_index":"files_index","_id": "6b358eff-900b-4e1f-9e7b-8be40ea379f2"}}
{ "fileId": "4", "modifiedDate": "20211008_111111", "folderId": "D", "fileType": "jpg"}
{"index":{"_index":"files_index","_id": "74db1ff0-7ff8-4144-89a5-96d51508256a"}}
{"fileId": "4", "modifiedDate": "20211010_111111", "folderId": "D", "fileType": "jpg"}
{"index":{"_index":"files_index","_id": "44717bf7-c0c3-4435-ac66-2c25968839a6"}}
{ "fileId": "4", "modifiedDate": "20211012_111111", "folderId": "D", "fileType": "png"}
The result that I'm after is to get only the latest modified files from each folder.
So if we're looking at our dataset for example, this is the result that I'm after:
{ "fileId": "1","modifiedDate": "20211009_111111","folderId": "A", "fileType": "jpg"}
{ "fileId": "1A","modifiedDate": "20211009_111111","folderId": "A", "fileType": "png"}
{ "fileId": "2", "modifiedDate": "20211011_111111", "folderId": "B", "fileType": "png" }
{ "fileId": "3", "modifiedDate": "20211012_111111", "folderId": "C", "fileType": "jpg"}
{ "fileId": "3A", "modifiedDate": "20211012_111111", "folderId": "C", "fileType": "png"}
{ "fileId": "4", "modifiedDate": "20211012_111111", "folderId": "D", "fileType": "png"}
I've tried 3 approaches:
- Terms aggregation
Is there a way to have a query based on the aggregation results on the same query?
for example: aggs to get each folder's most current modified date and to have a query for the documents of each aggs result, I couldn't find anyway to do that...
query:
GET /files_index/_search
{
"_source": ["fileId","modifiedDate","folderId", "fileType"],
"size": 0,
"aggs": {
"folder_aggs": {
"terms": {
"field": "folderId.keyword",
"size": 100
}
, "aggs": {
"modified_date_aggs": {
"terms": {
"field": "modifiedDate.keyword",
"size": 1,
"order": {
"_key": "desc"
}
}
}
}
}
}
}
- Nested collapse
This was a step towards the right direction, but I couldn't find a way to have only the latest modifiedDates and their documents.
I guess I need 3 nested collapses which from what I encountered - isn't supported (isn't it?).
query:
GET /files_index/_search
{
"_source": ["fileId","modifiedDate","folderId", "fileType"],
"size": 100,
"collapse": {
"field": "folderId.keyword",
"inner_hits":{
"name":"latest_modified",
"collapse":{
"field":"modifiedDate.keyword"
},
"sort": [{"modifiedDate.keyword":"desc"}]
}
},
"sort":[{"folderId.keyword":"asc"}]
}
- Bool query with shoulds and filter
I can calculate which each folder's latest modifiedDate and query that with pairs.
This actually works! but feels kind of hecky (maybe only for me).
But it has some caveats - I looked at the documentation for limitations and I see that there's a soft limit of 128 characters query length limit and 32 filters max.
query:
GET /files_index/_search
{
"_source": ["fileId","modifiedDate","folderId", "fileType"],
"size": 100,
"query": {
"bool": {
"should": [
{
"bool": {
"filter": [
{ "term": { "folderId.keyword": "A" }},
{"term": { "modifiedDate.keyword": "20211009_111111" }}
]
}
},
{
"bool": {
"filter": [
{ "term": { "folderId.keyword": "B" }},
{"term": { "modifiedDate.keyword": "20211011_111111" }}
]
}
},
{
"bool": {
"filter": [
{ "term": { "folderId.keyword": "C" }},
{"term": { "modifiedDate.keyword": "20211012_111111" }}
]
}
},
{
"bool": {
"filter": [
{ "term": { "folderId.keyword": "D" }},
{"term": { "modifiedDate.keyword": "20211012_111111" }}
]
}
}
]
}
},
"sort":[{"folderId.keyword":"asc"}]
}
Please enlighten me,
Isn't there a more elegant way?
Have I missed something?
Why is this simple query so hard to achieve?
I know I could have an index per folder and that could make it a bit easier but I'd be managing a lot of indexes per customer instead of one.
It feels that maybe elasticserach wasn't meant for those kind of queries (which I must say, fairly trivial), am I wrong?
Thanks a lot ahead,
Niv