Cross-field atleast functionality (return documents where a term has x occurrences)

Hello!

I'm working for a client in the patent domain. We have a feature called atleast that returns all documents where a term has a minimum of x occurrences. In the past a plugin was written for this (also mentioned here). This was build for ES 6.1. As you can see in the code provided in the old post, this is build in a cross field manner. I'l repost the code here for convenience:

    DELETE test
    PUT test
    {
      "settings": {
        "number_of_replicas": 0,
        "number_of_shards": 1,
        "analysis": {
          "analyzer": {
            "my_analyzer": {
              "tokenizer": "standard"
            }
          }
        }
      },
      "mappings": {
        "test": {
          "properties": {
            "abstract": {
              "type": "text",
              "analyzer": "my_analyzer", 
              "index_options": "offsets"
            },
            "title": {
              "type": "text",
              "analyzer": "my_analyzer", 
              "index_options": "offsets"
            }
          }
        }
      }
    }

    POST test/test/1
    {
      "title": "Activity of a cell signaling pathway TGF-b in a subject ...",
      "abstract": "The present invention relates to a computer-implemented method for inferring activity of a TGF-β cellular signaling pathway in a subject ..."
    }

    POST test/test/2
    {
      "title": "Activity of a cell signaling pathway TGF-b in a subject ...",
      "abstract": "This doesn't have the search text referred to later"
    }

    GET test/_search
    {
      "query": {
        "bool": {
          "filter": {
            "script": {
              "script": {
                "source": "atleast",
                "lang": "byron_scripts",
                "params": {
                  "fields": [
                    "abstract",
                    "title"
                    ],
                  "term": "signaling pathway",
                  "occurrences": 2
                }
              }
            }
          }
        }
      }
    }

The query returns the first document, since the text occurs once in both fields, which are 2 occurrences total.

We recently upgraded to 7.9 and were hoping to get rid of this plugin and make use of the intervals query. We were able to recreate the functionality for multiple fields, but unfortunately the intervals query doesn't support cross-field like multi_match as far as we could tell from our investigation.

Here some example code:

    DELETE test
    PUT test
    {
      "settings": {
        "number_of_replicas": 0,
        "number_of_shards": 1
      }
    }

    POST test/_doc/1
    {
      "other.field.text1" : "my my text ",
      "other.field.text2" : "my my text",
      "other.field.text3" : "my my text"
    }
    POST test/_doc/2
    {
      "other.field.text1" : "hallo my text",
      "other.field.text2" : "hallo my text",
      "other.field.text3" : "hallo my text"
    }
    POST test/_doc/3
    {
      "other.field.text1" : "my my  text",
      "other.field.text2" : "my my  text",
      "other.field.text3" : "my my  text"
    }
    POST test/_doc/4
    {
      "other.field.text1" : "my text",
      "other.field.text2" : "my text",
      "other.field.text3" : "my text"
    }

    POST test/_search
    {
      "query": {
        "bool": {
          "should": [
            {
              "intervals": {
                "other.field.text1": {
                  "all_of": {
                    "ordered": true,
                    "max_gaps": 10000000,
                    "intervals": [
                      {
                        "match": {
                          "query": "my"
                        }
                      },
                      {
                        "match": {
                          "query": "my"
                        }
                      }
                    ]
                  }
                }
              }
            },
            {
              "intervals": {
                "other.field.text2": {
                  "all_of": {
                    "ordered": true,
                    "max_gaps": 10000000,
                    "intervals": [
                      {
                        "match": {
                          "query": "my"
                        }
                      },
                      {
                        "match": {
                          "query": "my"
                        }
                      }
                    ]
                  }
                }
              }
            
            }
          ],
          "minimum_should_match": 1,
          "boost": 1.0
        }
      }
    }

The above query returns all documents where text1 and text2 fields both contain the term "my" 2 times minimum. But there seems to be no way to say give me all documents where the term "my" occurs 2 times minimum in these fields together, which would make all 4 documents return.

So I have 2 questions:

  1. Is there any way we can solve this with intervals query which we missed?
  2. Is there any other query or maybe script introduced in ES7 that we missed with which we could create this functionality?

Side note: creating a new field which contains text from multiple fields is not an option due to lot of multi-field combinations and index size.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.