Elasticsearch: Understanding Match query


(Rahul Nama) #1

Hi Team

Can you please tell the difference between below two queries?

  1. GET /_search
    {
    "query": {
    "match" : {
    "message" : "this is a test"
    }
    }
    }

  2. GET /_search
    {
    "query": {
    "match" : {
    "message" : {
    "query" : "this is a test",
    "operator" : "and"
    }
    }
    }
    }

I've indexed few pdf files and when I use the second query, I'm getting more relevant results.

Can someone explain the difference?


(David Pilato) #2

The first one is equivalent to

GET /_search
{
"query": {
"match" : {
"message" : {
"query" : "this is a test",
"operator" : "or"
}
}
}
}

(Rahul Nama) #3

Hi @dadoonet

got it.

any difference if we dont mention operator in 2.

  1. "message: : "this is test"

2."message:
"query":
"this is a test"

because when I use both 1 and 2 in match query, the results are varying lot.

Thanks
Rahul


(David Pilato) #4

No I don't think it makes any difference.


(Rahul Nama) #5

hi @dadoonet

facing a typical issue.

I have deployed my elasticsearch on Windows and Linux(same set of documents in both nodes, but both nodes are independent ) with same settings and mappings.

But when I search with a query, the results in windows and results in linux are completely different.

Any idea on this behavior?

Thanks
rahul


(David Pilato) #6

No. You need to share both results from both systems.

Please format your code, logs or configuration files using </> icon as explained in this guide and not the citation button. It will make your post more readable.

Or use markdown style like:

```
CODE
```

This is the icon to use if you are not using markdown format:

There's a live preview panel for exactly this reasons.

Lots of people read these forums, and many of them will simply skip over a post that is difficult to read, because it's just too large an investment of their time to try and follow a wall of badly formatted text.
If your goal is to get an answer to your questions, it's in your interest to make it as easy to read and understand as possible.


(Rahul Nama) #7

Okay @dadoonet. will follow the instructions.

Issue: indexed 14 pdf files with same settings and mappings on 2 es nodes. but getting different results when queried.

Case-1: Elasticsearch deployed on Amazon EC2(Windows)

Indexed 14 pdf files

query:

indexname: testindex

{   "_source" : "url",
    "query": {
        "match" : {
            "content" : {
                "query" : "windows install"
                , "operator": "and"
            }
        }
    }
}

Response:

the last term in url is the name of the file

"hits": [
      {
        "_index": "testindex",
        "_type": "_doc",
        "_id": "5",
        "_score": 2.230532,
        "_source": {
          "url": "http://127.0.0.1:5000/js/Linux/linux _faq_3_manual.pdf"
        }
      },
      {
        "_index": "testindex",
        "_type": "_doc",
        "_id": "8",
        "_score": 2.084747,
        "_source": {
          "url": "http://127.0.0.1:5000/js/Linux/the-linux-faq.pdf"
        }
      }
]

Case-2: Elasticsearch deployed on Redhat Linux

Indexed same 14 pdf files

Index name: testindex

query:

{   "_source" : "url",
    "query": {
        "match" : {
            "content" : {
                "query" : "windows install"
                , "operator": "and"
            }
        }
    }
}

results:

"hits": [
            {  
                "_index": "testindex",
                "_type": "_doc",
                "_id": "11",
                "_score": 2.6487362,
                "_source": {
                    "url": "http://filesystemwef.com/Windows_Issues/31831392.pdf"
                }
            },
            {
                "_index": "testindex",
                "_type": "_doc",
                "_id": "12",
                "_score": 1.2416239,
                "_source": {
                    "url": "http://http://filesystemwef.com/Windows_Issues/357786482.pdf"
                }
            }
]

(David Pilato) #8

We can see that the computed _score is different. By default, elasticsearch sorts by _score so the ordering seems correct here.

Sadly I don't have the full response object just I'm just guessing here.
May be you have more than one shard and the distribution of your documents is different in one case than the other. Also the total number of documents is may be different in one system than the other.

Some ideas:

  • Run the same test with only one shard
  • Or use DFS: ?search_type=dfs_query_then_fetch
  • Check that you have exactly the same documents in both systems

(Rahul Nama) #9

@dadoonet

you are right. both indices have 5 shards.

Any api to understand how many documents are in each shard?

using query_then_fetch is giving the same results in both nodes. Also, the results are more relevant. But is it recommended in production ?

Thanks for the query_then_fetch. I haven't seen this before. All the elastic concepts literally makes sense. Elasticsearch is offering lot of fleixibility. the more you understand it, the more you use the features of it, the more relevant your search is.

still lot and lot to know. thanks to all the elastic team for such sensible features.
:slight_smile:

-Rahul


(David Pilato) #10

You can but if you don't have so many data, it's always better to use one single shard.


(Rahul Nama) #11

hi @dadoonet

using one shard is giving more relevant results. Thank you for that.

If possible, Can you also suggest any solution to the below problem?

All the indexed documents are related to only windows and linux issues. Now whenever a user searches about "mobile issues", elasticsearch will return the results as it matches with issues, and it might also match with mobile somewhere in the documents.

Reference:

search query:

GET pdfminerone/_search
{ 
  "size": 10, 
  "_source": "url", 
  "query": {
    "match": {
      "content":   "mobile issues"
     
    }
  }
}

Response:

"hits": [
      {
        "_index": "pdfminerone",
        "_type": "_doc",
        "_id": "12",
        "_score": 4.144372,
        "_source": {
          "url": "http://127.0.0.1:5000/js/Windows_Issues/357786482.pdf"
        }
      },
      {
        "_index": "pdfminerone",
        "_type": "_doc",
        "_id": "10",
        "_score": 2.7226787,
        "_source": {
          "url": "http://127.0.0.1:5000/js/Linux/linux _faq_2_manual.pdf"
        }
      }]

The first document with score 4 is related to windows issues and nothing to say about mobile issues.

But, if we recommend that url to the user, user will waste his time searching about mobile issues in that url.

How to avoid such scenarios?


(David Pilato) #12

By default elasticsearch does a "or" but you can change it to be a "and" with something like

GET /_search
{
    "query": {
        "match" : {
            "Field" : {
                "query" : "text",
                "operator" : "and"
            }
        }
    }
}

(Rahul Nama) #13

makes sense but still I see similar results

Query-1

GET testbooks/_search
{ 
  "size": 10, 
  "_source": "url", 
  "query": {
    "match": {
      "content":   "mobile issues"
     
    }
  }
}

response:

"hits": {
    "total": 9,
    "max_score": 4.144372,
    "hits": [
      {
        "_index": "testbooks",
        "_type": "_doc",
        "_id": "3",
        "_score": 4.144372,
        "_source": {
          "url": "/Windows_Issues/357786482.pdf"
        }
      },
      {
        "_index": "testbooks",
        "_type": "_doc",
        "_id": "8",
        "_score": 2.7226787,
        "_source": {
          "url": "/Linux/linux _faq_2_manual.pdf"
        }
      }]

Query-2:

GET testbooks/_search
{    "_source": "url", 
    "query": {
        "match" : {
            "content" : {
                "query" : "mobile issues",
                "operator" : "and"
            }
        }
    }
}

Response:

"hits": {
    "total": 2,
    "max_score": 4.144372,
    "hits": [
      {
        "_index": "testbooks",
        "_type": "_doc",
        "_id": "3",
        "_score": 4.144372,
        "_source": {
          "url": "/Windows_Issues/357786482.pdf"
        }
      },
      {
        "_index": "testbooks",
        "_type": "_doc",
        "_id": "8",
        "_score": 2.7226787,
        "_source": {
          "url": "/Linux/linux _faq_2_manual.pdf"
        }
      }
    ]

(David Pilato) #14

There is no content field in your example so I don't see how this works.


(Rahul Nama) #15

hi @dadoonet

Yea I agree. will both queries return the same score if both keywords(mobile , issues ) appeared in the documents even once?

-Rahul


(Rahul Nama) #16

hi @dadoonet.

I've indexed 14 books out of which two-three books talk about internet

search query:

GET testbooks/_search
{    "_source": "url", 
      "explain": true, 
    "query": {
        "match" : {
            "content" : {
                "query" : "unable to connect to the internet ",  
                "operator" : "and"
            }
        }
    }
} 

when I run this query, I got a document which is not relevant to internet. though documents which are more relevant to internet are available in ES.

In the document ES returned, the keyword unable is repeated 60 times, the word connected is repeated 100 times but the internet is repeated only 2 times.

Still it got first in results: How to avoid such scenarios?

Please suggest

Note: I could post the results but it has lot of text so I didn't.


(system) closed #17

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.