Multiword Search With Stemming

Hi all,

I am new to ES (on Version: 7.9.0) and I have a case there I am searching with string that has more than one word, and I want the returned docs have all the contained words in the search phrase including stemmed versions of those words.

For instance, I need to search by "liver tumors" and I would like to find all docs that have both "liver" AND "tumor", or "liver" AND "tumors", and so forth. I would like to exclude docs that have only "liver(s)" or "tumor(s)", but not both.

A simple search is not meeting my needs since some docs have only "liver" or "tumor" 50x or so, and have higher relevance, whereas the docs I want that have "liver tumors" many fewer times, and have lower relevance.

I cannot post my real data, but have some mocked up data to share.

POST /_bulk
{ "create" : { "_index" : "resumes", "_id" : "1" } }
{ "resume_text" : "liver tumor" }
{ "create" : { "_index" : "resumes", "_id" : "2" } }
{ "resume_text" : "liver tumors"}
{ "create" : { "_index" : "resumes", "_id" : "3" } }
{ "resume_text" : "brain tumor"}
{ "create" : { "_index" : "resumes", "_id" : "4" } }
{ "resume_text" : "liver disease" }
{ "create" : { "_index" : "resumes", "_id" : "5" } }
{ "resume_text" : "something else" }
{ "create" : { "_index" : "resumes", "_id" : "6" } }
{ "resume_text" : "liver function and kidney tumors" }

The simple case

GET /resumes/_search
{
  "query": {
    "match": {
      "resume_text": {
        "query": "liver tumors"
      }
    }
  }
}

As one would expect this search results all but "_id" : "5".

Using "operator": "AND"

GET /resumes/_search
{
  "query": {
    "match": {
      "resume_text": {
        "query": "liver tumors",
        "operator": "AND"
      }
    }
  }
}

This returns me only documents with "liver" and "tumors", i.e. 2 and 6.

Using "bool": "must"

GET /resumes/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "resume_text": {
              "query": "liver"
            }
          }
        },
        {
          "match": {
            "resume_text": {
              "query": "tumors"
            }
          }
        }
      ]
    }
  }
}

This behaves exactly as "operator": "AND".

What I would like to get are documents with the words "liver" or "livers" and "tumor" or "tumors", where my user can only type "liver tumors".

As bonus, I'd also love to have the phase "liver tumor", "liver tumors", etc., but will settle for the above since that will get my users close enough.

Any help would be greatly appreciated.

Thanks in advance.

Welcome!

I'd do this:

DELETE /resumes
PUT /resumes
{
  "mappings": {
    "properties": {
      "resume_text": {
        "type": "text",
        "analyzer": "english"
      }
    }
  }
}

POST /resumes/_bulk
{ "create" : { "_id" : "1" } }
{ "resume_text" : "liver tumor" }
{ "create" : { "_id" : "2" } }
{ "resume_text" : "liver tumors"}
{ "create" : { "_id" : "3" } }
{ "resume_text" : "brain tumor"}
{ "create" : { "_id" : "4" } }
{ "resume_text" : "liver disease" }
{ "create" : { "_id" : "5" } }
{ "resume_text" : "something else" }
{ "create" : { "_id" : "6" } }
{ "resume_text" : "liver function and kidney tumors" }

GET /resumes/_search
{
  "query": {
    "match": {
      "resume_text": {
        "query": "liver tumors",
        "operator": "AND"
      }
    }
  }
}

Also try then with:

GET /resumes/_search
{
  "query": {
    "match_phrase": {
      "resume_text": {
        "query": "liver tumors"
      }
    }
  }
}

It might be a better fit for your use case.

And finally, this could be event better:

GET /resumes/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match_phrase": {
            "resume_text": {
              "query": "liver tumors"
            }
          }
        },
        {
          "match": {
            "resume_text": {
              "query": "liver tumors",
              "operator": "AND"
            }
          }
        },
        {
          "match": {
            "resume_text": {
              "query": "liver tumors"
            }
          }
        }
      ]
    }
  }
}

Note the ordering of the results.

{
  "took" : 10,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 5,
      "relation" : "eq"
    },
    "max_score" : 2.8155408,
    "hits" : [
      {
        "_index" : "resumes",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 2.8155408,
        "_source" : {
          "resume_text" : "liver tumor"
        }
      },
      {
        "_index" : "resumes",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 2.8155408,
        "_source" : {
          "resume_text" : "liver tumors"
        }
      },
      {
        "_index" : "resumes",
        "_type" : "_doc",
        "_id" : "6",
        "_score" : 1.3676834,
        "_source" : {
          "resume_text" : "liver function and kidney tumors"
        }
      },
      {
        "_index" : "resumes",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.46925682,
        "_source" : {
          "resume_text" : "brain tumor"
        }
      },
      {
        "_index" : "resumes",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 0.46925682,
        "_source" : {
          "resume_text" : "liver disease"
        }
      }
    ]
  }
}