Elasticsearch fails to return some documents


#1

I have this data:

{"_index": "simple", "_type": "motorcycle", "_source": {"date": "2018-04-28T13:16", "price": 59900, "sellerName": "Lelles MC AB", "description": "KTM 690 Duke (Abs) M\u00e4tarst\u00e4llning: 450 mil F\u00e4rg: Vit Typ: Touring/Landsv\u00e4g Info: Mycket fin Duke 690 med rensad bakdel och handskydd.", "location": "Uppsala", "id": 345, "title": "KTM 690 Duke (Abs)", "modelYear": 2016, "url": "https://www.blocket.se/uppsala/KTM_690_Duke__Abs__79079911.htm?ca=11&w=3", "vehicleType": "Touring"}}
{"_index": "simple", "_type": "motorcycle", "_source": {"date": "2018-04-28T14:00", "price": 12900, "sellerName": "Hondo", "description": "Hej! D\u00e5 va det dags att s\u00e4lja p\u00e4rlan. Det som \u00e4r gjort med crossen \u00e4r Nytt bakd\u00e4ck. Nya bromsbel\u00e4gg bak. Nytt sadel\u00f6verdrag. Kolvbytet gjord f\u00f6r 25 timmar sen. Inga l\u00e4ckage. Extra k\u00e5pset ing\u00e5r. Vid en smidig aff\u00e4r s\u00e5 ing\u00e5r en haspl\u00e5t. Crossen startar alltid p\u00e5 f\u00f6rsta eller andra kicken. Vid mer info f\u00e5r ni g\u00e4rna ringa p\u00e5 telefon mvh", "location": "Uddevalla", "id": 319, "title": "Honda Cr 125", "modelYear": 2001, "url": "https://www.blocket.se/goteborg/Honda_Cr_125_79080992.htm?ca=11&w=3", "vehicleType": "Cross/enduro"}}
{"_index": "simple", "_type": "motorcycle", "_source": {"date": "2018-04-28T14:15", "price": 22000, "sellerName": "Martin", "description": "G\u00e5tt - 2284mil.Haft sedan 2008 \u00e4r servad regelbunden p\u00e5 mc-firma. V\u00e4lsk\u00f6tt. Startar och g\u00e5r fint. Allt original. Vinterf\u00f6rvaring i garage. Besiktad senast maj -17. Ring eller maila", "location": "Norrk\u00f6ping", "id": 314, "title": "Honda VT 600C", "modelYear": 1999, "url": "https://www.blocket.se/ostergotland/Honda_VT_600C_79081306.htm?ca=11&w=3", "vehicleType": "Custom"}}

and I'm running this python code to index it:

from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk

    def create_index(self, file_path):
        """
            Takes path to file containing JSON-formatted data
            and indexes into Elasticsearch index.
        """
        self.es = Elasticsearch()
        
        print('Creating index "{}"'.format(INDEX_NAME))

        request_body = {
"settings":{
    "index":{
        "number_of_shards":1,
        "number_of_replicas":0
    }
},
"mappings":{
    "motorcycle":{
        "properties":{
            "location": {
                "type":"text",
                "analyzer":"swedish"
            },
            "description":{
                "type":"text",
                "analyzer":"swedish"
            }
        }
    }
}
        }
        self.es.indices.create(index = INDEX_NAME, body = request_body)
        f_in = open(PATH_TO_DATASET, "r")
        actions = (json.loads(line) for line in f_in)
        print("Performed bulk index: {}".format(bulk(self.es, actions)))
        self.es.indices.refresh(index = "simple")

Now, I'm trying to query the index using postman for all documents with location:Uppsala (the location of the first object (I did the same query with python with the same result):

POST to localhost:9200/simple/_search:
{
    "query": {
        "bool": {
            "filter": [
                
                {
                    "term": {
                        "location": "uppsala"
                    }
                }
            ]
        }
    }
}

It returns nothing. The same thing happens if I change the location to uddevalla, which is also in the original data (second document).

However, if I change location to norrköping, it returns the third document, which it should do.

What is the reason behind this erratic behaviour?

UPDATE:

The documents that don't show up when they should with the location filter seem to not show up for any query at all. For example, this query:

{
    "query": {
        "bool": {
            "filter": [],
            "must": {
                "multi_match": {
                    "fields": [
                        "title^1.0",
                        "description"
                    ],
                    "operator": "or",
                    "query": "honda",
                    "type": "cross_fields"
                }
            }
        }
    }
}

only returns one result, (the one with location:Norrköping), while it should in fact return two (the one with location:Uddevalla should also be returned).


(Christoph) #2

Hi @Sam1993,

I will just zoom in on one of the problems, since the rest is probably related. When you index the location field with value "Upsalla" using the "swedish" analyzer it seems to get stemmed in some way. You can see this using the "_analyze" endpoint:

POST /simple/_analyze
  {
  "analyzer" : "swedish",
  "text" : "Uppsala"
} 

-->

{
  "tokens": [
    {
      "token": "uppsal",
      "start_offset": 0,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 0
    }
  ]
}

Now, the "term" query you are usingis used to find documents that contain the exact term specified in the inverted index. The query term will not be analyzed. This is why there is no match. If you use "term": { "location": "uppsal" } or a "match" query, which is used for full text search and gets analyzed, it will return the first document you posted.

I suggest re-reading about the differences between "term" and "match" query, and also about the analyzer you are using. This will answer most of the other questions above I think. Hope this helps.


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.