Looking for guidance on how to create a DSL query

Hello, please, I am looking for guidance on how to perform a simple search. I have the following set of data:

POST /processing_records/_bulk
{"index":{}}
{"processing_id":"1234","file_name":"file1.xls","start_processing":"06/07/2024","status_processing":"processing"}
{"index":{}}
{"processing_id":"1234","file_name":"file1.xls","start_processing":"06/07/2024","end_processing":"06/07/2024","status_processing":"processed"}
{"index":{}}
{"processing_id":"1235","file_name":"file2.xls","start_processing":"06/07/2024","status_processing":"processing"}
{"index":{}}
{"processing_id":"1235","file_name":"file2.xls","start_processing":"06/07/2024","end_processing":"06/07/2024","status_processing":"processed"}
{"index":{}}
{"processing_id":"1236","file_name":"file2.xls","start_processing":"06/07/2024","status_processing":"processing"}
{"index":{}}
{"processing_id":"1236","file_name":"file2.xls","start_processing":"06/07/2024","end_processing":"06/07/2024","status_processing":"processed"}
{"index":{}}
{"processing_id":"1237","file_name":"file4.xls","start_processing":"06/07/2024","status_processing":"processing"}

if helps, the set of data would look like this in CSV

processing_id, file_name, start_processing, end_processing, status_processing
1234, file1.xls, 06/jul/2024, empty, processing
1234, file1.xls, 06/jul/2024, 06/jul/2024, processed
1235, file2.xls, 06/jul/2024, empty, processing
1235, file2.xls, 06/jul/2024, 06/jul/2024, processed 
1236, file2.xls, 06/jul/2024, empty, processing
1236, file2.xls, 06/jul/2024, 06/jul/2024, processed
1237, file4.xls, 06/jul/2024, empty, processing

as you can see the same processing id will appear twice, once when is processing and a second when is processed, in this case and in this given moment, only the record 1237 is still processing and is not processed.

in a SQL form, in order to find this record i would run something like this:

SELECT * FROM processing_records 
WHERE status_processing = 'processing' AND 
		processing_id not IN (SELECT processing_id FROM processing_records WHERE status_processing = 'processed' )

I tried a few ways to get this done in DSL as the example below, however it is not working,

I also tried to SQL convert using the APIs but again this is not supported.

Any advices?

Thanks in advance

Hi @RobertBM

Sometimes when we find a index modeling already defined, it becomes difficult to apply changes.
I believe you would have less work if there is only one record for each file, where you would update the status_processing field when the file finishes processing. Your query would be much simpler.

I don't know if with aggregation you will get the desired result, but a possible solution would be to search for records with 'processing' status and then run a second query filtering the processing_ids that do not have 'processed' status.

Another point you can check is whether the latest version of elasticsearch supports the SQL query 'not IN' clause

Here are some ways to do this. One of them is to use only the processing_ids where the document count is greater than 1. We can add a bucket_selector condition to the aggregation. This will filter the results according to the document count within each processing_id:

POST /processing_records/_search
{
  "size": 0,
  "aggs": {
    "processing_ids": {
      "terms": {
        "field": "processing_id.keyword",
        "size": 10000
      },
      "aggs": {
        "status_processing": {
          "terms": {
            "field": "status_processing.keyword"
          },
          "aggs": {
            "latest_record": {
              "top_hits": {
                "size": 1,
                "sort": [
                  {
                    "start_processing.keyword": {
                      "order": "desc"
                    }
                  }
                ]
              }
            }
          }
        },
        "processing_id_filter": {
          "bucket_selector": {
            "buckets_path": {
              "docCount": "_count"
            },
            "script": "params.docCount == 1"
          }
        }
      }
    }
  }
}

See if this helps.

1 Like

hello @Alex_Salgado-Elastic thank you so much for taking the time to respond this thread.
as far as i could test, this solution matches perfectly with what i was trying to achieve.
again, thank you

1 Like