How Percolate can perform, qeries on not analyzed index?


(Sarath R Nair) #1

So , after so much of digging around , I came to ask it over here . Let me start with a simple use case .

curl -XPUT 'localhost:9200/my-index?pretty' -H 'Content-Type: application/json' -d'
{
    "mappings": {
        "doctype": {
            "properties": {
                "message": {
                    "type": "text"                     }
            }
        },
        "queries": {
            "properties": {
                "query": {
                    "type": "percolator"
                }
            }
        }
    }
}
'

curl -XPUT 'localhost:9200/my-index/queries/2?refresh&pretty' -H 'Content-Type: application/json' -d'
{
    "query" : {
        "match_phrase" : {
            "message" : "pub/sub"
        }
    }
}
'


curl -XPUT 'localhost:9200/my-index/queries/1?refresh&pretty' -H 'Content-Type: application/json' -d'
{
    "query" : {
        "match_phrase" : {
            "message" : "x++"
        }
    }
}
'

Now my problem is if I execute

curl -XGET 'localhost:9200/my-index/_search?pretty' -H 'Content-Type: application/json' -d'
{
    "query" : {
        "percolate" : {
            "field" : "query",
            "document_type" : "doctype",
            "document" : {
                "message" : "A new bonsai pub sub tree in the office x"
            }
        }
    }
}
'

I will get two matched . one for "pub" and other for "x" , as pub'/sub and x++ .. I know , its because of analyzer . But , even in the mapping field if I change to

curl -XPUT 'localhost:9200/my-index?pretty' -H 'Content-Type: application/json' -d'
{
"mappings": {
    "doctype": {
        "properties": {
            "message": {
                "type": "string" , 
                "index": "not_analyzed"                     }
        }
    },
    "queries": {
        "properties": {
            "query": {
                "type": "percolator"
            }
        }
    }
}
}
'

then the "message" : "A new bonsai pub sub tree in the office x" will give zero match , because , it passes this entire text / doc as not_analyzed .

In simple any way to solve this issue ? I only want those phrase . non phrase queries to be matched , which are indexed without removing any special charaxcters like / , + etc ?


(Val Crettaz) #2

By default, the text field uses the standard analyzer. If you use the whitespace analyzer instead then your input will simply be split on whitespaces (but the token will not be be lowercased)

"mappings": {
    "doctype": {
        "properties": {
            "message": {
                "type": "text",
                "analyzer": "whitespace"
            }
        }
    },

If you also want the tokens to be lowercased, then you need to create a custom analyzer

curl -XPUT 'localhost:9200/my-index?pretty' -H 'Content-Type: application/json' -d'{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "doctype": {
      "properties": {
        "message": {
          "type": "text",
          "analyzer": "my_analyzer"
        }
      }
    },
    "queries": {
      "properties": {
        "query": {
          "type": "percolator"
        }
      }
    }
  }
}'

Then this will only match the pub/sub query

curl -XGET 'localhost:9200/my-index/_search?pretty' -H 'Content-Type: application/json' -d'
    {
        "query" : {
            "percolate" : {
                "field" : "query",
                "document_type" : "doctype",
                "document" : {
                    "message" : "A new bonsai pub/sub tree in the office x"
            }
        }
    }
}
'

(Sarath R Nair) #3

Thank you so much vaal crettaz . Amazing and very precise explanation.


(Val Crettaz) #4

Awesome, glad it helped :wink:


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.