Query_string bug in Elasticsearch-0.90.3, please tell me if it really is a bug?


(dark_shadow) #1

I started using explain api for query_string but I guess in process I found
a bug (don't know if it really is a bug or intended behaviour of
query_string). This is going to be a long post, please be patient with me.

I'm using a doc:{name:"new delhi to goa",st:"goa"}
On using analyzer api for indexing I got these tokens:

{
"tokens" : [ {
"token" : "new",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 1
}, {
"token" : "new",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 1
}, {
"token" : "new ",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 1
}, {
"token" : "new d",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 1
}, {
"token" : "new de",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 1
}, {
"token" : "new del",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 1
}, {
"token" : "new delh",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 1
}, {
"token" : "new",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new ",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new d",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new de",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new del",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new delh",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi ",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi t",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi to",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new ",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new d",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new de",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new del",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new delh",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi ",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi t",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi to",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi to ",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi to g",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi to go",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi to goa",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "del",
"start_offset" : 4,
"end_offset" : 9,
"type" : "word",
"position" : 2
}, {
"token" : "delh",
"start_offset" : 4,
"end_offset" : 9,
"type" : "word",
"position" : 2
}, {
"token" : "delhi",
"start_offset" : 4,
"end_offset" : 9,
"type" : "word",
"position" : 2
}, {
"token" : "del",
"start_offset" : 4,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "delh",
"start_offset" : 4,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "delhi",
"start_offset" : 4,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "delhi ",
"start_offset" : 4,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "delhi t",
"start_offset" : 4,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "delhi to",
"start_offset" : 4,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "del",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "delh",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "delhi",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "delhi ",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "delhi t",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "delhi to",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "delhi to ",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "delhi to g",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "delhi to go",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "delhi to goa",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "to ",
"start_offset" : 10,
"end_offset" : 16,
"type" : "word",
"position" : 3
}, {
"token" : "to g",
"start_offset" : 10,
"end_offset" : 16,
"type" : "word",
"position" : 3
}, {
"token" : "to go",
"start_offset" : 10,
"end_offset" : 16,
"type" : "word",
"position" : 3
}, {
"token" : "to goa",
"start_offset" : 10,
"end_offset" : 16,
"type" : "word",
"position" : 3
}, {
"token" : "goa",
"start_offset" : 13,
"end_offset" : 16,
"type" : "word",
"position" : 4
} ]
}

Now, if I query like: "delhi to goa", I got this by search_analyzer:

{
"tokens" : [ {
"token" : "del",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 1
}, {
"token" : "delh",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 1
}, {
"token" : "delhi",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 1
}, {
"token" : "del",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 1
}, {
"token" : "delh",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 1
}, {
"token" : "delhi",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 1
}, {
"token" : "delhi ",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 1
}, {
"token" : "delhi t",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 1
}, {
"token" : "delhi to",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 1
}, {
"token" : "del",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "delh",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "delhi",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "delhi ",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "delhi t",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "delhi to",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "delhi to ",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "delhi to g",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "delhi to go",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "delhi to goa",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "to ",
"start_offset" : 6,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "to g",
"start_offset" : 6,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "to go",
"start_offset" : 6,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "to goa",
"start_offset" : 6,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "goa",
"start_offset" : 9,
"end_offset" : 12,
"type" : "word",
"position" : 3
} ]
}

On using explain api, it gives me following:

{text=new delhi to goa,boostFactor=9.820192307,po=9.82}
510.39673 = custom score, product of:
510.39673 = script score function: composed of:
510.39673 = sum of:

  371.12375 = max of:
    371.12375 = sum of:
      104.61707 = weight(text:del in 1003990) [PerFieldSimilarity], result of:
        104.61707 = score(doc=1003990,freq=5.0 = termFreq=5.0

), product of:

          0.43576795 = queryWeight, product of:
            5.368244 = idf(docFreq=53067, maxDocs=4187328)
            0.08117513 = queryNorm
          240.0752 = fieldWeight in 1003990, product of:

            2.236068 = tf(freq=5.0), with freq of:
              5.0 = termFreq=5.0
            5.368244 = idf(docFreq=53067, maxDocs=4187328)
            20.0 = fieldNorm(doc=1003990)
      133.24011 = weight(text:delh in 1003990) [PerFieldSimilarity], result of:

        133.24011 = score(doc=1003990,freq=5.0 = termFreq=5.0

), product of:
0.49178073 = queryWeight, product of:
6.058268 = idf(docFreq=26616, maxDocs=4187328)
0.08117513 = queryNorm

          270.934 = fieldWeight in 1003990, product of:
            2.236068 = tf(freq=5.0), with freq of:
              5.0 = termFreq=5.0
            6.058268 = idf(docFreq=26616, maxDocs=4187328)

            20.0 = fieldNorm(doc=1003990)
      133.26657 = weight(text:delhi in 1003990) [PerFieldSimilarity], result of:
        133.26657 = score(doc=1003990,freq=5.0 = termFreq=5.0

), product of:

          0.49182954 = queryWeight, product of:
            6.0588694 = idf(docFreq=26600, maxDocs=4187328)
            0.08117513 = queryNorm
          270.96088 = fieldWeight in 1003990, product of:

            2.236068 = tf(freq=5.0), with freq of:
              5.0 = termFreq=5.0
            6.0588694 = idf(docFreq=26600, maxDocs=4187328)
            20.0 = fieldNorm(doc=1003990)
  139.27298 = max of:

    139.27298 = weight(text:goa^20.0 in 1003990) [PerFieldSimilarity], result of:
      139.27298 = score(doc=1003990,freq=3.0 = termFreq=3.0

), product of:
0.5712808 = queryWeight, product of:

          20.0 = boost
          7.037633 = idf(docFreq=9995, maxDocs=4187328)
          0.004058757 = queryNorm
        243.79076 = fieldWeight in 1003990, product of:
          1.7320508 = tf(freq=3.0), with freq of:

            3.0 = termFreq=3.0
          7.037633 = idf(docFreq=9995, maxDocs=4187328)
          20.0 = fieldNorm(doc=1003990)

1.0 = queryBoost

Though the above explain shows the results for:
del
delh
delhi
goa

But not getting results for other tokens which were generated by my search
analyzer. Why is it so ?

I have read that query_string uses query parser which is based on Lucene by
default. So, My guess is query_string is using a whitespace tokenizer after
my tokens are generated by search analyzer, am I correct ? How can I make
query_string to calculate score for all the tokens which are generated by
search_analyzer. Please correct me if I am wrong.

There is one more things which I noticed,
I'm using a query time boost on one of my doc field but it is not working
the way I thought it would work. In the above explain you can see, there is
a boost associated with goa but not with delhi, though but goa and delhi
are present in original doc. My guess for this is,
query_string applies boost to only terms where a term is a token of a user
typed string which is not analyzed by any analyzer because in the above
example, goa is kept as it is but delhi is being analyzed. Am I correct ?

Waiting a reply !!!

Thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/10dd24df-fe87-430d-8433-73df1acb1d0c%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Adrien Grand) #2

Hi,

Indeed, query_string splits on whitespaces before applying the analyzer.
You could try the match query[1] which doesn't have this flaw or the new
simple_query_parser[2] which has the ability to disable the whitespace
operator (just provide a list of flags that doesn't contain WHITESPACE).

However I didn't understand your boosting issue, what query did you send to
Elasticsearch?

[1]
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-match-query.html
[2]
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-simple-query-string-query.html#_simple_query_string_syntax

On Wed, Feb 5, 2014 at 4:47 AM, coder mukulnitkkr@gmail.com wrote:

I started using explain api for query_string but I guess in process I
found a bug (don't know if it really is a bug or intended behaviour of
query_string). This is going to be a long post, please be patient with me.

I'm using a doc:{name:"new delhi to goa",st:"goa"}
On using analyzer api for indexing I got these tokens:

{
"tokens" : [ {
"token" : "new",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 1
}, {
"token" : "new",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 1
}, {
"token" : "new ",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 1
}, {
"token" : "new d",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 1
}, {
"token" : "new de",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 1
}, {
"token" : "new del",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 1
}, {
"token" : "new delh",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 1
}, {
"token" : "new",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new ",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new d",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new de",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new del",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new delh",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi ",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi t",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi to",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new ",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new d",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new de",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new del",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new delh",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi ",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi t",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi to",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi to ",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi to g",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi to go",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi to goa",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "del",
"start_offset" : 4,
"end_offset" : 9,
"type" : "word",
"position" : 2
}, {
"token" : "delh",
"start_offset" : 4,
"end_offset" : 9,
"type" : "word",
"position" : 2
}, {
"token" : "delhi",
"start_offset" : 4,
"end_offset" : 9,
"type" : "word",
"position" : 2
}, {
"token" : "del",
"start_offset" : 4,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "delh",
"start_offset" : 4,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "delhi",
"start_offset" : 4,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "delhi ",
"start_offset" : 4,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "delhi t",
"start_offset" : 4,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "delhi to",
"start_offset" : 4,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "del",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "delh",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "delhi",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "delhi ",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "delhi t",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "delhi to",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "delhi to ",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "delhi to g",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "delhi to go",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "delhi to goa",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "to ",
"start_offset" : 10,
"end_offset" : 16,
"type" : "word",
"position" : 3
}, {
"token" : "to g",
"start_offset" : 10,
"end_offset" : 16,
"type" : "word",
"position" : 3
}, {
"token" : "to go",
"start_offset" : 10,
"end_offset" : 16,
"type" : "word",
"position" : 3
}, {
"token" : "to goa",
"start_offset" : 10,
"end_offset" : 16,
"type" : "word",
"position" : 3
}, {
"token" : "goa",
"start_offset" : 13,
"end_offset" : 16,
"type" : "word",
"position" : 4
} ]
}

Now, if I query like: "delhi to goa", I got this by search_analyzer:

{
"tokens" : [ {
"token" : "del",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 1
}, {
"token" : "delh",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 1
}, {
"token" : "delhi",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 1
}, {
"token" : "del",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 1
}, {
"token" : "delh",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 1
}, {
"token" : "delhi",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 1
}, {
"token" : "delhi ",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 1
}, {
"token" : "delhi t",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 1
}, {
"token" : "delhi to",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 1
}, {
"token" : "del",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "delh",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "delhi",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "delhi ",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "delhi t",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "delhi to",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "delhi to ",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "delhi to g",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "delhi to go",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "delhi to goa",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "to ",
"start_offset" : 6,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "to g",
"start_offset" : 6,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "to go",
"start_offset" : 6,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "to goa",
"start_offset" : 6,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "goa",
"start_offset" : 9,
"end_offset" : 12,
"type" : "word",
"position" : 3
} ]
}

On using explain api, it gives me following:

{text=new delhi to goa,boostFactor=9.820192307,po=9.82}
510.39673 = custom score, product of:
510.39673 = script score function: composed of:
510.39673 = sum of:

  371.12375 = max of:
    371.12375 = sum of:
      104.61707 = weight(text:del in 1003990) [PerFieldSimilarity], result of:
        104.61707 = score(doc=1003990,freq=5.0 = termFreq=5.0

), product of:

          0.43576795 = queryWeight, product of:
            5.368244 = idf(docFreq=53067, maxDocs=4187328)
            0.08117513 = queryNorm
          240.0752 = fieldWeight in 1003990, product of:

            2.236068 = tf(freq=5.0), with freq of:
              5.0 = termFreq=5.0
            5.368244 = idf(docFreq=53067, maxDocs=4187328)
            20.0 = fieldNorm(doc=1003990)
      133.24011 = weight(text:delh in 1003990) [PerFieldSimilarity], result of:

        133.24011 = score(doc=1003990,freq=5.0 = termFreq=5.0

), product of:
0.49178073 = queryWeight, product of:
6.058268 = idf(docFreq=26616, maxDocs=4187328)
0.08117513 = queryNorm

          270.934 = fieldWeight in 1003990, product of:
            2.236068 = tf(freq=5.0), with freq of:
              5.0 = termFreq=5.0
            6.058268 = idf(docFreq=26616, maxDocs=4187328)

            20.0 = fieldNorm(doc=1003990)
      133.26657 = weight(text:delhi in 1003990) [PerFieldSimilarity], result of:
        133.26657 = score(doc=1003990,freq=5.0 = termFreq=5.0

), product of:

          0.49182954 = queryWeight, product of:
            6.0588694 = idf(docFreq=26600, maxDocs=4187328)
            0.08117513 = queryNorm
          270.96088 = fieldWeight in 1003990, product of:

            2.236068 = tf(freq=5.0), with freq of:
              5.0 = termFreq=5.0
            6.0588694 = idf(docFreq=26600, maxDocs=4187328)
            20.0 = fieldNorm(doc=1003990)
  139.27298 = max of:

    139.27298 = weight(text:goa^20.0 in 1003990) [PerFieldSimilarity], result of:
      139.27298 = score(doc=1003990,freq=3.0 = termFreq=3.0

), product of:
0.5712808 = queryWeight, product of:

          20.0 = boost
          7.037633 = idf(docFreq=9995, maxDocs=4187328)
          0.004058757 = queryNorm
        243.79076 = fieldWeight in 1003990, product of:
          1.7320508 = tf(freq=3.0), with freq of:

            3.0 = termFreq=3.0
          7.037633 = idf(docFreq=9995, maxDocs=4187328)
          20.0 = fieldNorm(doc=1003990)

1.0 = queryBoost

Though the above explain shows the results for:
del
delh
delhi
goa

But not getting results for other tokens which were generated by my search
analyzer. Why is it so ?

I have read that query_string uses query parser which is based on Lucene
by default. So, My guess is query_string is using a whitespace tokenizer
after my tokens are generated by search analyzer, am I correct ? How can I
make query_string to calculate score for all the tokens which are generated
by search_analyzer. Please correct me if I am wrong.

There is one more things which I noticed,
I'm using a query time boost on one of my doc field but it is not working
the way I thought it would work. In the above explain you can see, there is
a boost associated with goa but not with delhi, though but goa and delhi
are present in original doc. My guess for this is,
query_string applies boost to only terms where a term is a token of a user
typed string which is not analyzed by any analyzer because in the above
example, goa is kept as it is but delhi is being analyzed. Am I correct ?

Waiting a reply !!!

Thanks

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/10dd24df-fe87-430d-8433-73df1acb1d0c%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7_LGPtbnM7yuNQoAjOR31kOKmddpnsJpuoEN2fssS1zw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(dark_shadow) #3

Hi Adrien,

Great to hear from you !!!

Actually you have written in your post that query_string splits on
whitespaces before applying analyzer. Are you sure it applies it before
analyzer because if it works that way, the I won't be getting the above
issue. Above results shows that analyzer works the way I expect it to work,
but somehow I query_string is not taking tokens which contains spaces into
consideration. Why is it so ? Is it like after the analyzer gives the
tokens to query_parser, it will again split it on the basis of white spaces
and that way I won't be able to see the effects of those tokens into my
explain output ?

I tried using multi-match query but then all the terms needs to be present
in at least one field and in my scenario, I have multiple fields and user
can search using terms which exists in different fields. That way my
multi-match will fail. Is there any way by which I can get the
functionality of query_string. Also, the spelling corrector is gone with
query_string. If by mistake, I write a single wrong character, multi-match
won't find the existing doc.

Is there anyway by which I can tackle these issues.

Thanks

On Wed, Feb 5, 2014 at 9:26 PM, Adrien Grand <adrien.grand@elasticsearch.com

wrote:

Hi,

Indeed, query_string splits on whitespaces before applying the analyzer.
You could try the match query[1] which doesn't have this flaw or the new
simple_query_parser[2] which has the ability to disable the whitespace
operator (just provide a list of flags that doesn't contain WHITESPACE).

However I didn't understand your boosting issue, what query did you send
to Elasticsearch?

[1]
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-match-query.html
[2]
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-simple-query-string-query.html#_simple_query_string_syntax

On Wed, Feb 5, 2014 at 4:47 AM, coder mukulnitkkr@gmail.com wrote:

I started using explain api for query_string but I guess in process I
found a bug (don't know if it really is a bug or intended behaviour of
query_string). This is going to be a long post, please be patient with me.

I'm using a doc:{name:"new delhi to goa",st:"goa"}
On using analyzer api for indexing I got these tokens:

{
"tokens" : [ {
"token" : "new",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 1
}, {
"token" : "new",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 1
}, {
"token" : "new ",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 1
}, {
"token" : "new d",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 1
}, {
"token" : "new de",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 1
}, {
"token" : "new del",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 1
}, {
"token" : "new delh",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 1
}, {
"token" : "new",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new ",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new d",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new de",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new del",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new delh",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi ",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi t",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi to",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new ",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new d",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new de",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new del",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new delh",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi ",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi t",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi to",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi to ",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi to g",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi to go",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi to goa",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "del",
"start_offset" : 4,
"end_offset" : 9,
"type" : "word",
"position" : 2
}, {
"token" : "delh",
"start_offset" : 4,
"end_offset" : 9,
"type" : "word",
"position" : 2
}, {
"token" : "delhi",
"start_offset" : 4,
"end_offset" : 9,
"type" : "word",
"position" : 2
}, {
"token" : "del",
"start_offset" : 4,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "delh",
"start_offset" : 4,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "delhi",
"start_offset" : 4,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "delhi ",
"start_offset" : 4,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "delhi t",
"start_offset" : 4,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "delhi to",
"start_offset" : 4,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "del",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "delh",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "delhi",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "delhi ",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "delhi t",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "delhi to",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "delhi to ",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "delhi to g",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "delhi to go",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "delhi to goa",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "to ",
"start_offset" : 10,
"end_offset" : 16,
"type" : "word",
"position" : 3
}, {
"token" : "to g",
"start_offset" : 10,
"end_offset" : 16,
"type" : "word",
"position" : 3
}, {
"token" : "to go",
"start_offset" : 10,
"end_offset" : 16,
"type" : "word",
"position" : 3
}, {
"token" : "to goa",
"start_offset" : 10,
"end_offset" : 16,
"type" : "word",
"position" : 3
}, {
"token" : "goa",
"start_offset" : 13,
"end_offset" : 16,
"type" : "word",
"position" : 4
} ]
}

Now, if I query like: "delhi to goa", I got this by search_analyzer:

{
"tokens" : [ {
"token" : "del",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 1
}, {
"token" : "delh",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 1
}, {
"token" : "delhi",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 1
}, {
"token" : "del",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 1
}, {
"token" : "delh",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 1
}, {
"token" : "delhi",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 1
}, {
"token" : "delhi ",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 1
}, {
"token" : "delhi t",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 1
}, {
"token" : "delhi to",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 1
}, {
"token" : "del",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "delh",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "delhi",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "delhi ",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "delhi t",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "delhi to",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "delhi to ",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "delhi to g",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "delhi to go",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "delhi to goa",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "to ",
"start_offset" : 6,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "to g",
"start_offset" : 6,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "to go",
"start_offset" : 6,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "to goa",
"start_offset" : 6,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "goa",
"start_offset" : 9,
"end_offset" : 12,
"type" : "word",
"position" : 3
} ]
}

On using explain api, it gives me following:

{text=new delhi to goa,boostFactor=9.820192307,po=9.82}
510.39673 = custom score, product of:
510.39673 = script score function: composed of:
510.39673 = sum of:

  371.12375 = max of:
    371.12375 = sum of:
      104.61707 = weight(text:del in 1003990) [PerFieldSimilarity], result of:
        104.61707 = score(doc=1003990,freq=5.0 = termFreq=5.0

), product of:

          0.43576795 = queryWeight, product of:
            5.368244 = idf(docFreq=53067, maxDocs=4187328)
            0.08117513 = queryNorm
          240.0752 = fieldWeight in 1003990, product of:


            2.236068 = tf(freq=5.0), with freq of:
              5.0 = termFreq=5.0
            5.368244 = idf(docFreq=53067, maxDocs=4187328)
            20.0 = fieldNorm(doc=1003990)
      133.24011 = weight(text:delh in 1003990) [PerFieldSimilarity], result of:


        133.24011 = score(doc=1003990,freq=5.0 = termFreq=5.0

), product of:
0.49178073 = queryWeight, product of:
6.058268 = idf(docFreq=26616, maxDocs=4187328)
0.08117513 = queryNorm

          270.934 = fieldWeight in 1003990, product of:
            2.236068 = tf(freq=5.0), with freq of:
              5.0 = termFreq=5.0
            6.058268 = idf(docFreq=26616, maxDocs=4187328)


            20.0 = fieldNorm(doc=1003990)
      133.26657 = weight(text:delhi in 1003990) [PerFieldSimilarity], result of:
        133.26657 = score(doc=1003990,freq=5.0 = termFreq=5.0

), product of:

          0.49182954 = queryWeight, product of:
            6.0588694 = idf(docFreq=26600, maxDocs=4187328)
            0.08117513 = queryNorm
          270.96088 = fieldWeight in 1003990, product of:


            2.236068 = tf(freq=5.0), with freq of:
              5.0 = termFreq=5.0
            6.0588694 = idf(docFreq=26600, maxDocs=4187328)
            20.0 = fieldNorm(doc=1003990)
  139.27298 = max of:


    139.27298 = weight(text:goa^20.0 in 1003990) [PerFieldSimilarity], result of:
      139.27298 = score(doc=1003990,freq=3.0 = termFreq=3.0

), product of:
0.5712808 = queryWeight, product of:

          20.0 = boost
          7.037633 = idf(docFreq=9995, maxDocs=4187328)
          0.004058757 = queryNorm
        243.79076 = fieldWeight in 1003990, product of:
          1.7320508 = tf(freq=3.0), with freq of:


            3.0 = termFreq=3.0
          7.037633 = idf(docFreq=9995, maxDocs=4187328)
          20.0 = fieldNorm(doc=1003990)

1.0 = queryBoost

Though the above explain shows the results for:
del
delh
delhi
goa

But not getting results for other tokens which were generated by my
search analyzer. Why is it so ?

I have read that query_string uses query parser which is based on Lucene
by default. So, My guess is query_string is using a whitespace tokenizer
after my tokens are generated by search analyzer, am I correct ? How can I
make query_string to calculate score for all the tokens which are generated
by search_analyzer. Please correct me if I am wrong.

There is one more things which I noticed,
I'm using a query time boost on one of my doc field but it is not working
the way I thought it would work. In the above explain you can see, there is
a boost associated with goa but not with delhi, though but goa and delhi
are present in original doc. My guess for this is,
query_string applies boost to only terms where a term is a token of a
user typed string which is not analyzed by any analyzer because in the
above example, goa is kept as it is but delhi is being analyzed. Am I
correct ?

Waiting a reply !!!

Thanks

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/10dd24df-fe87-430d-8433-73df1acb1d0c%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7_LGPtbnM7yuNQoAjOR31kOKmddpnsJpuoEN2fssS1zw%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAAVTvp6mEcXSt0%3DeQ-PD9%3DdRAfaGZB4fRKdFUBjGt95E3w7Qog%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(dark_shadow) #4

Adrien,

Regarding the boosting issue:
I have a field "text" and I'm using a query-time boost like
field=["text^30"]
Assume I have a doc like {text:"new delhi to goa"}. Now if I query for
"delhi to goa" then score for only term goa is boosted like goa^30 (as you
can see above in explain output) but what I expect is it should boost delhi
also like "delhi^30" which is not happening here. Is it like goa is not
analyzed so it will be considered as a term but delhi since it is analyzed
by analyzer it won't be considered as a term.

Thanks

On Wed, Feb 5, 2014 at 9:26 PM, Adrien Grand <adrien.grand@elasticsearch.com

wrote:

Hi,

Indeed, query_string splits on whitespaces before applying the analyzer.
You could try the match query[1] which doesn't have this flaw or the new
simple_query_parser[2] which has the ability to disable the whitespace
operator (just provide a list of flags that doesn't contain WHITESPACE).

However I didn't understand your boosting issue, what query did you send
to Elasticsearch?

[1]
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-match-query.html
[2]
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-simple-query-string-query.html#_simple_query_string_syntax

On Wed, Feb 5, 2014 at 4:47 AM, coder mukulnitkkr@gmail.com wrote:

I started using explain api for query_string but I guess in process I
found a bug (don't know if it really is a bug or intended behaviour of
query_string). This is going to be a long post, please be patient with me.

I'm using a doc:{name:"new delhi to goa",st:"goa"}
On using analyzer api for indexing I got these tokens:

{
"tokens" : [ {
"token" : "new",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 1
}, {
"token" : "new",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 1
}, {
"token" : "new ",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 1
}, {
"token" : "new d",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 1
}, {
"token" : "new de",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 1
}, {
"token" : "new del",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 1
}, {
"token" : "new delh",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 1
}, {
"token" : "new",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new ",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new d",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new de",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new del",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new delh",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi ",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi t",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi to",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "new",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new ",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new d",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new de",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new del",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new delh",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi ",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi t",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi to",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi to ",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi to g",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi to go",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "new delhi to goa",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "del",
"start_offset" : 4,
"end_offset" : 9,
"type" : "word",
"position" : 2
}, {
"token" : "delh",
"start_offset" : 4,
"end_offset" : 9,
"type" : "word",
"position" : 2
}, {
"token" : "delhi",
"start_offset" : 4,
"end_offset" : 9,
"type" : "word",
"position" : 2
}, {
"token" : "del",
"start_offset" : 4,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "delh",
"start_offset" : 4,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "delhi",
"start_offset" : 4,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "delhi ",
"start_offset" : 4,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "delhi t",
"start_offset" : 4,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "delhi to",
"start_offset" : 4,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "del",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "delh",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "delhi",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "delhi ",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "delhi t",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "delhi to",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "delhi to ",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "delhi to g",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "delhi to go",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "delhi to goa",
"start_offset" : 4,
"end_offset" : 16,
"type" : "word",
"position" : 2
}, {
"token" : "to ",
"start_offset" : 10,
"end_offset" : 16,
"type" : "word",
"position" : 3
}, {
"token" : "to g",
"start_offset" : 10,
"end_offset" : 16,
"type" : "word",
"position" : 3
}, {
"token" : "to go",
"start_offset" : 10,
"end_offset" : 16,
"type" : "word",
"position" : 3
}, {
"token" : "to goa",
"start_offset" : 10,
"end_offset" : 16,
"type" : "word",
"position" : 3
}, {
"token" : "goa",
"start_offset" : 13,
"end_offset" : 16,
"type" : "word",
"position" : 4
} ]
}

Now, if I query like: "delhi to goa", I got this by search_analyzer:

{
"tokens" : [ {
"token" : "del",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 1
}, {
"token" : "delh",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 1
}, {
"token" : "delhi",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 1
}, {
"token" : "del",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 1
}, {
"token" : "delh",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 1
}, {
"token" : "delhi",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 1
}, {
"token" : "delhi ",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 1
}, {
"token" : "delhi t",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 1
}, {
"token" : "delhi to",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 1
}, {
"token" : "del",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "delh",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "delhi",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "delhi ",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "delhi t",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "delhi to",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "delhi to ",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "delhi to g",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "delhi to go",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "delhi to goa",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "to ",
"start_offset" : 6,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "to g",
"start_offset" : 6,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "to go",
"start_offset" : 6,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "to goa",
"start_offset" : 6,
"end_offset" : 12,
"type" : "word",
"position" : 2
}, {
"token" : "goa",
"start_offset" : 9,
"end_offset" : 12,
"type" : "word",
"position" : 3
} ]
}

On using explain api, it gives me following:

{text=new delhi to goa,boostFactor=9.820192307,po=9.82}
510.39673 = custom score, product of:
510.39673 = script score function: composed of:
510.39673 = sum of:

  371.12375 = max of:
    371.12375 = sum of:
      104.61707 = weight(text:del in 1003990) [PerFieldSimilarity], result of:
        104.61707 = score(doc=1003990,freq=5.0 = termFreq=5.0

), product of:

          0.43576795 = queryWeight, product of:
            5.368244 = idf(docFreq=53067, maxDocs=4187328)
            0.08117513 = queryNorm
          240.0752 = fieldWeight in 1003990, product of:


            2.236068 = tf(freq=5.0), with freq of:
              5.0 = termFreq=5.0
            5.368244 = idf(docFreq=53067, maxDocs=4187328)
            20.0 = fieldNorm(doc=1003990)
      133.24011 = weight(text:delh in 1003990) [PerFieldSimilarity], result of:


        133.24011 = score(doc=1003990,freq=5.0 = termFreq=5.0

), product of:
0.49178073 = queryWeight, product of:
6.058268 = idf(docFreq=26616, maxDocs=4187328)
0.08117513 = queryNorm

          270.934 = fieldWeight in 1003990, product of:
            2.236068 = tf(freq=5.0), with freq of:
              5.0 = termFreq=5.0
            6.058268 = idf(docFreq=26616, maxDocs=4187328)


            20.0 = fieldNorm(doc=1003990)
      133.26657 = weight(text:delhi in 1003990) [PerFieldSimilarity], result of:
        133.26657 = score(doc=1003990,freq=5.0 = termFreq=5.0

), product of:

          0.49182954 = queryWeight, product of:
            6.0588694 = idf(docFreq=26600, maxDocs=4187328)
            0.08117513 = queryNorm
          270.96088 = fieldWeight in 1003990, product of:


            2.236068 = tf(freq=5.0), with freq of:
              5.0 = termFreq=5.0
            6.0588694 = idf(docFreq=26600, maxDocs=4187328)
            20.0 = fieldNorm(doc=1003990)
  139.27298 = max of:


    139.27298 = weight(text:goa^20.0 in 1003990) [PerFieldSimilarity], result of:
      139.27298 = score(doc=1003990,freq=3.0 = termFreq=3.0

), product of:
0.5712808 = queryWeight, product of:

          20.0 = boost
          7.037633 = idf(docFreq=9995, maxDocs=4187328)
          0.004058757 = queryNorm
        243.79076 = fieldWeight in 1003990, product of:
          1.7320508 = tf(freq=3.0), with freq of:


            3.0 = termFreq=3.0
          7.037633 = idf(docFreq=9995, maxDocs=4187328)
          20.0 = fieldNorm(doc=1003990)

1.0 = queryBoost

Though the above explain shows the results for:
del
delh
delhi
goa

But not getting results for other tokens which were generated by my
search analyzer. Why is it so ?

I have read that query_string uses query parser which is based on Lucene
by default. So, My guess is query_string is using a whitespace tokenizer
after my tokens are generated by search analyzer, am I correct ? How can I
make query_string to calculate score for all the tokens which are generated
by search_analyzer. Please correct me if I am wrong.

There is one more things which I noticed,
I'm using a query time boost on one of my doc field but it is not working
the way I thought it would work. In the above explain you can see, there is
a boost associated with goa but not with delhi, though but goa and delhi
are present in original doc. My guess for this is,
query_string applies boost to only terms where a term is a token of a
user typed string which is not analyzed by any analyzer because in the
above example, goa is kept as it is but delhi is being analyzed. Am I
correct ?

Waiting a reply !!!

Thanks

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/10dd24df-fe87-430d-8433-73df1acb1d0c%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7_LGPtbnM7yuNQoAjOR31kOKmddpnsJpuoEN2fssS1zw%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAAVTvp4_PfRZPKqMTRZ34o4fKtdv7ROs%3DLsVsT3%2B3rLHHucfPg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Adrien Grand) #5

On Wed, Feb 5, 2014 at 6:01 PM, Mukul Gupta mukulnitkkr@gmail.com wrote:

Actually you have written in your post that query_string splits on
whitespaces before applying analyzer. Are you sure it applies it before
analyzer because if it works that way, the I won't be getting the above
issue. Above results shows that analyzer works the way I expect it to work,
but somehow I query_string is not taking tokens which contains spaces into
consideration. Why is it so ? Is it like after the analyzer gives the
tokens to query_parser, it will again split it on the basis of white spaces
and that way I won't be able to see the effects of those tokens into my
explain output ?

This is precisely because query_string splits on whitespaces. :slight_smile: For
example, if the query is "new dehli to goa", query_string would split on
whitespaces: ["dehli", "to", "goa"] and then apply the analyzer on each
individual word: [ ["deh", "dehl", "dehli"], [], ["goa"]] before using
these tokens to generate a query.

I tried using multi-match query but then all the terms needs to be present
in at least one field and in my scenario, I have multiple fields and user
can search using terms which exists in different fields. That way my
multi-match will fail. Is there any way by which I can get the
functionality of query_string. Also, the spelling corrector is gone with
query_string. If by mistake, I write a single wrong character, multi-match
won't find the existing doc.

Why would multi-match fail and what do you call the spelling corrector?

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7uohp3FzJwvZv7gvJLgucyZ7dsLg13ZCiyz1F%3DY%2BE%2BRA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Adrien Grand) #6

On Wed, Feb 5, 2014 at 6:21 PM, Mukul Gupta mukulnitkkr@gmail.com wrote:

Adrien,

Regarding the boosting issue:
I have a field "text" and I'm using a query-time boost like
field=["text^30"]
Assume I have a doc like {text:"new delhi to goa"}. Now if I query for
"delhi to goa" then score for only term goa is boosted like goa^30 (as you
can see above in explain output) but what I expect is it should boost delhi
also like "delhi^30" which is not happening here. Is it like goa is not
analyzed so it will be considered as a term but delhi since it is analyzed
by analyzer it won't be considered as a term.

Can you copy here the exact query that you ran?

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j5pz_w%2BhBteXzYXvUGEvtpu25F%3Dy1CuThaCauat-OTEQw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #7