How to combine Elasticsearch function score query and text proximity scoring with weight?

This question is originally from StackOverflow's post.
I've not been got any response, so I'm reposting here.


I'd like to use function score query and text proximity with weight. But the query does not correctly calculate score of "match_phrase" in "query.function_score.functions"

For example, let's say I'm creating curation media and put a banner link of "Financial articles in 2017".

I'd like to filter and score like below,

  • Filter
    • Articles must be created in 2017.
    • The category must be "finance" .
  • Scoring
    • The more "favorited" article is, it gets the higher score.
    • If the article has the comment within the last 1 month, it gets the higher score.
    • If the article has certain tags, it gets the higher score.
      • (tags might be more than 100+ words)

and data has precondition,

  • Precondition
    • the dataset is more than 2 million document
    • articles must have one "category"
    • articles might have one or more "tags"
      • tags might be more than 1000+ in the single article
    • "tags_text" is string text and it is alphabetical order and joined by whitespace
    • "favorite" is number that people set the article to "favorite" (e.g. Facebook's Like button)

example data and query

// create index
curl -XPUT 'http://localhost:9200/blog'

And put articles,

// create articles
curl -H "Content-Type: application/json" -XPUT http://localhost:9200/blog/article/1 -d '
{
  "article_id": 1,
  "title": "Fintech company list in London",
  "tags": ["fintech", "uk", "london"],
  "tags_text": "fintech london uk",
  "category": "finance",
  "created_at": "2016-12-01T00:00:00Z",
  "last_comment_at": null,
  "favorite": 100
}'

curl -H "Content-Type: application/json" -XPUT http://localhost:9200/blog/article/2 -d '
{
  "article_id": 2,
  "title": "World economy",
  "tags": ["world", "economy", "regression", "war"],
  "tags_text": "economy regression war world",
  "category": "finance",
  "created_at": "2017-02-15T00:00:00Z",
  "last_comment_at": "2017-11-01T00:00:00Z",
  "favorite": 20
}'

curl -H "Content-Type: application/json" -XPUT http://localhost:9200/blog/article/3 -d '
{
  "article_id": 3,
  "title": "Bitcoin bubble",
  "tags": ["bitcoin", "bubble", "btc", "mtgox", "wizsec"],
  "tags_text": "bitcoin btc bubble mtgox wizsec",
  "category": "finance",
  "created_at": "2017-08-03T00:00:00Z",
  "last_comment_at": null,
  "favorite": 50
}'

curl -H "Content-Type: application/json" -XPUT http://localhost:9200/blog/article/4 -d '
{
  "article_id": 4,
  "title": "Virtual currency in China",
  "tags": ["bitcoin", "ico", "china"],
  "tags_text": "bitcoin china ico",
  "category": "finance",
  "created_at": "2017-09-03T00:00:00Z",
  "last_comment_at": null,
  "favorite": 10
}'

curl -H "Content-Type: application/json" -XPUT http://localhost:9200/blog/article/5 -d '
{
  "article_id": 5,
  "title": "Average FX rate in 2017-10",
  "tags": ["fx", "currency", "doller"],
  "tags_text": "currency doller fx",
  "category": "finance",
  "created_at": "2017-11-01T00:00:00Z",
  "last_comment_at": null,
  "favorite": 10
}'

curl -H "Content-Type: application/json" -XPUT http://localhost:9200/blog/article/6 -d '
{
  "article_id": 6,
  "title": "Cat and Dog",
  "tags": ["pet", "cat", "dog", "family"],
  "tags_text": "cat dog family pet",
  "category": "pet",
  "created_at": "2017-11-02T00:00:00Z",
  "last_comment_at": null,
  "favorite": 500
}'

Then execute query,

Note: Originaly "functions.filter.range.last_comment_at.from" is set to "now-30d". But to fix the changeable time and get unchanging results, asssume now is 2017-11-06 and use 2017-11-06||-30d.

curl -H "Content-Type: application/json" -XGET 'http://localhost:9200/blog/article/_search' -d '
{
  "_source": {
    "includes": ["article_id", "title", "tags_text"]
  },
  "query": {
    "function_score": {
      "functions": [
        {
          "field_value_factor": {
            "factor": 1,
            "modifier": "log",
            "field": "favorite"
          },
          "weight": 0.3
        },
        {
          "filter": {
            "range": {
              "last_comment_at": {
                "from": "2017-11-06||-30d",
                "to": null,
                "include_lower": true,
                "include_upper": false
              }
            }
          },
          "weight": 0.3
        },
        {
          "filter": {
            "match_phrase": {
              "tags_text": {
                "query": "bitcoin fintech smartphone",
                "slop": 100
              }
            }
          },
          "weight": 0.4
        }
      ],
      "query": {
        "bool": {
          "filter": [
            {"term": {"category": "finance"} },
            {
              "range": {
                "created_at": {
                  "from": "2017-01-01T00:00:00",
                  "to": "2017-12-31T23:59:59",
                  "include_lower": true,
                  "include_upper": true
                }
              }
            }
          ],
          "must": {
            "match_all": {}
          }
        }
      },
      "score_mode": "sum"
    }
  }
}'

The results are like below,

{
  "hits": {
    "total": 4,
    "max_score": 0.69030905,
    "hits": [
      {
        "_index": "blog",
        "_type": "article",
        "_id": "2",
        "_score": 0.69030905,
        "_source": {
          "article_id": 2,
          "tags_text": "economy regression war world",
          "title": "World economy"
        }
      },
      {
        "_index": "blog",
        "_type": "article",
        "_id": "3",
        "_score": 0.509691,
        "_source": {
          "article_id": 3,
          "tags_text": "bitcoin btc bubble mtgox wizsec",
          "title": "Bitcoin bubble"
        }
      },
      {
        "_index": "blog",
        "_type": "article",
        "_id": "5",
        "_score": 0.3,
        "_source": {
          "article_id": 5,
          "tags_text": "currency doller fx",
          "title": "Average FX rate in 2017-10"
        }
      },
      {
        "_index": "blog",
        "_type": "article",
        "_id": "4",
        "_score": 0.3,
        "_source": {
          "article_id": 4,
          "tags_text": "bitcoin china ico",
          "title": "Virtual currency in China"
        }
      }
    ]
  }
}

I checked result with "explain" but it seemed that "match_phrase" query to "tags_text" field does not affect to scoring at all.

How to use weighted similarity scoring and function score query? (I checked by ES v2.4.0 and v6.3.0)

Thanks,

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.