Query with multiple conditions on denormalized data

Hi, I seem to be stuck formulating a certain query on a denormalized matrix.

The original dataset looks like this:

id fieldA fieldB fieldC
"0001" 0.1 0.2 0.3
"0002" 0.4 0.5 0.6
...

In such a structure I can easily find all ids for which multiple fields fulfill a certain condition by using multiple filters in my query and then just returning the id field of matching documents. Something like this (untested):

"_source": ["id"],
"query": {
    "bool": {
        "filter": [
            {"range": {"fieldA": {"gt": 0.5}}},
            {"range": {"fieldB": {"gt": 0.3}}}
        ]
    }
}

Unfortunately, this dataset has far too many ids and fields to be stored in this way, so we denormalized it:

id field value
"0001" fieldA 0.1
"0001" fieldB 0.4
"0002" fieldA 0.2
"0002" fieldB 0.5
...

This structure works great for aggregations, but filtering on multiple conditions like above seems impossible. I can formulate a query that does (feature == fieldA AND value > 0.5) OR (feature == fieldB AND value > 0.3), where I get all IDs for which at least one of the field conditions holds true. Using a "AND" instead of "OR" will also not work, because a document cannot contain a the field with value fieldA and fieldB at the same time.

I thought about maybe using the OR query from above, then terms aggregate and check which of the buckets is full (=> all conditions matched), but that seems awkward.

Do you have any suggestions?

I used script and a bucket selector to do what I described at the end of my last post. This is rather awkward but it works. If someone can think of a nicer approach, please let me know!

{
    "size": 0,
    "query": {
        "bool": {
            "should": [
                {'bool': {'filter': [
                    {'term': {'field': {'value': 'fieldA'}}},
                    {'range': {'value': {'gt': 0.5}}}
                ]}}, 
                {'bool': {'filter': [
                    {'term': {'field': {'value': 'fieldB'}}}, 
                    {'range': {'value': {'gt': 0.3}}}
                ]}}
            ]
        }
    },
    "aggs": {
        "ids": {
            "terms": {
                "field": "id",
                "size": N_IDS            
            },
            "aggs": {
                "min_bucket_selector": {
                    "bucket_selector": {
                        "buckets_path": {
                            "count": "_count"
                        },
                        "script": {
                            "source": f"params.count == 2"
                        }
                    }
                }

            }
        }
    }
}

Basically:

  1. include all fields for which at least one of the conditions matches
  2. term-aggregate id field
  3. exclude all buckets that are not max-sized (all conditions match)

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.