Newbie, basic search help


(MArk Williams) #1

Hi, new to all this ES and overwhelmed by all the options and syntax.
I (currently) only have 2 fields, company name and company number. I want
to search company names.
I have a a really simple search to do and cannot get the order I want (or
would expect)
I set up like:-
curl -XPOST 'http://localhost:9200/playcompany'
then run a bunch of :-
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{ "number" :
"06026916", "name" : "FRASERS VENTURES LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{ "number" :
"01366799", "name" : "SUPPORT SERVICES LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{ "number" :
"01349558", "name" : "MCGINLEY SUPPORT SERVICES LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{ "number" :
"01409241", "name" : "SUPPORT SERVICES (FILMS) LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{ "number" :
"01470672", "name" : "A.C.S. (CONSULTANCY AND SUPPORT SERVICES) LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{ "number" :
"01475234", "name" : "GENERAL SUPPORT AND HANDLING SERVICES LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{ "number" :
"02795677", "name" : "SUPPORT SERVICES LIMITED" }'
etc
to load up only 127 for testing. Mapping shows:-

curl -XGET 'http://localhost:9200/playcompany/companies/_mapping?pretty'
{
"companies" : {
"properties" : {
"name" : {
"type" : "string"
},
"number" : {
"type" : "string"
}
}
}
}

ALL I need is sensible matches, matches that humans would expect, in an
order that humans would expect.
when I do a simple :-
curl -XGET
'http://localhost:9200/playcompany/companies/_search?q=name:support%20services%20limited&pretty=true'
I would hope to get "SUPPORT SERVICES LIMITED" as the first hit, followed
by other 'relevant' results, in some sort of explainable order.
'relevant' to me (or human searchers) means that the more words that
match, the more relevant. so 3 out of 3 words match should be the top, 3
out of 4 also pretty relevant. 3 words matched out of 6 words for example
are deemed less relevant.
I would hope to also like to match plurals and common endings, so I would
like a search for "SUPPORT SERVICES LIMITED" to also match "SUPPORT SERVICE
LIMITED" (singular) , but not as high as the exact match "SUPPORT SERVICES
LIMITED".
hope this makes sense. With only 10 or so names loaded, I get mostlt what I
want, but as soon as I load up over 100, the order goes out the window and
the _score values are all the same after the first 2 or 3 matches.

How do I do this?
Sorry this is such a long post, first one and not sure where to start.
Any help (more importantly examples that I can copy and paste to try out)
would be invaluable.
Thanks for you time, appreciated.
MArk Williams

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Luca Cavanna) #2

Hi,
elasticsearch uses by default the TF-IDF similarity. To keep it simple, the
score depends on the following main factors:

  • term frequency: how many times the term you are searching for occurs in
    the document (same field), the more the better
  • inverted document frequency: how many documents contain the term you are
    searching for throughout the whole index (same field), the less the better
    (rare terms get rewarded)
  • length of the field, the shorter the better

It should now be clearer why the score changes if the index changes. Just
to mention an example, it can happen that one of the terms is rare and has
more importance than another term that appears more times in the same
document (higher term frequency)...

You can definitely tune and influence the way scoring works, but first
thing you would need to do is define a mapping upfront for your index,
instead of using the default one, which by the way applies stopwords but
not stemming. If you want to have the described behaviour with plurals, you
need to use stemming, thus you have to create a custom analyzer in your
index settings and refer to it in your mapping.

You can also check how you are indexing documents through the analyze api,
and check the reason behind a certain score using the explain=true
parameter when running searches (do this only to debug queries, it slows
down searches!).

Also, if I were you I would look at the match query and use that one
instead of the query_string, which is more error-prone due to its required
parsing and many things that you can do with a single query.

After that, if the scoring is still not satisfying and you can identify
custom requirements, like a certain set of documents that need to be
boosted etc. you can use proper queries to achieve that (e.g. the custom
score query of custom filters score).

Hope this helps
Luca

On Tuesday, September 3, 2013 2:54:41 PM UTC+2, MArk Williams wrote:

Hi, new to all this ES and overwhelmed by all the options and syntax.
I (currently) only have 2 fields, company name and company number. I want
to search company names.
I have a a really simple search to do and cannot get the order I want (or
would expect)
I set up like:-
curl -XPOST 'http://localhost:9200/playcompany'
then run a bunch of :-
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{ "number"
: "06026916", "name" : "FRASERS VENTURES LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{ "number"
: "01366799", "name" : "SUPPORT SERVICES LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{ "number"
: "01349558", "name" : "MCGINLEY SUPPORT SERVICES LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{ "number"
: "01409241", "name" : "SUPPORT SERVICES (FILMS) LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{ "number"
: "01470672", "name" : "A.C.S. (CONSULTANCY AND SUPPORT SERVICES) LIMITED"
}'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{ "number"
: "01475234", "name" : "GENERAL SUPPORT AND HANDLING SERVICES LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{ "number"
: "02795677", "name" : "SUPPORT SERVICES LIMITED" }'
etc
to load up only 127 for testing. Mapping shows:-

curl -XGET 'http://localhost:9200/playcompany/companies/_mapping?pretty'
{
"companies" : {
"properties" : {
"name" : {
"type" : "string"
},
"number" : {
"type" : "string"
}
}
}
}

ALL I need is sensible matches, matches that humans would expect, in an
order that humans would expect.
when I do a simple :-
curl -XGET '
http://localhost:9200/playcompany/companies/_search?q=name:support%20services%20limited&pretty=true
'
I would hope to get "SUPPORT SERVICES LIMITED" as the first hit, followed
by other 'relevant' results, in some sort of explainable order.
'relevant' to me (or human searchers) means that the more words that
match, the more relevant. so 3 out of 3 words match should be the top, 3
out of 4 also pretty relevant. 3 words matched out of 6 words for example
are deemed less relevant.
I would hope to also like to match plurals and common endings, so I would
like a search for "SUPPORT SERVICES LIMITED" to also match "SUPPORT SERVICE
LIMITED" (singular) , but not as high as the exact match "SUPPORT SERVICES
LIMITED".
hope this makes sense. With only 10 or so names loaded, I get mostlt what
I want, but as soon as I load up over 100, the order goes out the window
and the _score values are all the same after the first 2 or 3 matches.

How do I do this?
Sorry this is such a long post, first one and not sure where to start.
Any help (more importantly examples that I can copy and paste to try out)
would be invaluable.
Thanks for you time, appreciated.
MArk Williams

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Ivan Brusic) #3

Besides the excellent write-up that Luca provided, elasticsearch's
distributed nature comes into play. The TF and IDF values are shard-based,
not index-based, so ordering can be skewed when you have a small number of
documents and more than one shard. If you are testing on a small amount of
documents, try with only one shard. You can also change your search type to
use distributed frequencies using the dfs_query_then_fetch type.

http://www.elasticsearch.org/guide/reference/api/search/search-type/

Cheers,

Ivan

On Tue, Sep 3, 2013 at 8:48 AM, Luca Cavanna cavannaluca@gmail.com wrote:

Hi,
elasticsearch uses by default the TF-IDF similarity. To keep it simple,
the score depends on the following main factors:

  • term frequency: how many times the term you are searching for occurs in
    the document (same field), the more the better
  • inverted document frequency: how many documents contain the term you are
    searching for throughout the whole index (same field), the less the better
    (rare terms get rewarded)
  • length of the field, the shorter the better

It should now be clearer why the score changes if the index changes. Just
to mention an example, it can happen that one of the terms is rare and has
more importance than another term that appears more times in the same
document (higher term frequency)...

You can definitely tune and influence the way scoring works, but first
thing you would need to do is define a mapping upfront for your index,
instead of using the default one, which by the way applies stopwords but
not stemming. If you want to have the described behaviour with plurals, you
need to use stemming, thus you have to create a custom analyzer in your
index settings and refer to it in your mapping.

You can also check how you are indexing documents through the analyze api,
and check the reason behind a certain score using the explain=true
parameter when running searches (do this only to debug queries, it slows
down searches!).

Also, if I were you I would look at the match query and use that one
instead of the query_string, which is more error-prone due to its required
parsing and many things that you can do with a single query.

After that, if the scoring is still not satisfying and you can identify
custom requirements, like a certain set of documents that need to be
boosted etc. you can use proper queries to achieve that (e.g. the custom
score query of custom filters score).

Hope this helps
Luca

On Tuesday, September 3, 2013 2:54:41 PM UTC+2, MArk Williams wrote:

Hi, new to all this ES and overwhelmed by all the options and syntax.
I (currently) only have 2 fields, company name and company number. I want
to search company names.
I have a a really simple search to do and cannot get the order I want (or
would expect)
I set up like:-
curl -XPOST 'http://localhost:9200/**playcompanyhttp://localhost:9200/playcompany
'
then run a bunch of :-
curl -XPOST 'http://localhost:9200/**playcompany/companies/http://localhost:9200/playcompany/companies/'
-d '{ "number" : "06026916", "name" : "FRASERS VENTURES LIMITED" }'
curl -XPOST 'http://localhost:9200/**playcompany/companies/http://localhost:9200/playcompany/companies/'
-d '{ "number" : "01366799", "name" : "SUPPORT SERVICES LIMITED" }'
curl -XPOST 'http://localhost:9200/**playcompany/companies/http://localhost:9200/playcompany/companies/'
-d '{ "number" : "01349558", "name" : "MCGINLEY SUPPORT SERVICES LIMITED" }'
curl -XPOST 'http://localhost:9200/**playcompany/companies/http://localhost:9200/playcompany/companies/'
-d '{ "number" : "01409241", "name" : "SUPPORT SERVICES (FILMS) LIMITED" }'
curl -XPOST 'http://localhost:9200/**playcompany/companies/http://localhost:9200/playcompany/companies/'
-d '{ "number" : "01470672", "name" : "A.C.S. (CONSULTANCY AND SUPPORT
SERVICES) LIMITED" }'
curl -XPOST 'http://localhost:9200/**playcompany/companies/http://localhost:9200/playcompany/companies/'
-d '{ "number" : "01475234", "name" : "GENERAL SUPPORT AND HANDLING
SERVICES LIMITED" }'
curl -XPOST 'http://localhost:9200/**playcompany/companies/http://localhost:9200/playcompany/companies/'
-d '{ "number" : "02795677", "name" : "SUPPORT SERVICES LIMITED" }'
etc
to load up only 127 for testing. Mapping shows:-

curl -XGET 'http://localhost:9200/playcompany/companies/_
mapping?prettyhttp://localhost:9200/playcompany/companies/_mapping?pretty
'
{
"companies" : {
"properties" : {
"name" : {
"type" : "string"
},
"number" : {
"type" : "string"
}
}
}
}

ALL I need is sensible matches, matches that humans would expect, in an
order that humans would expect.
when I do a simple :-
curl -XGET 'http://localhost:9200/playcompany/companies/_search?
q=name:support%20services%**20limited&pretty=truehttp://localhost:9200/playcompany/companies/_search?q=name:support%20services%20limited&pretty=true
'
I would hope to get "SUPPORT SERVICES LIMITED" as the first hit, followed
by other 'relevant' results, in some sort of explainable order.
'relevant' to me (or human searchers) means that the more words that
match, the more relevant. so 3 out of 3 words match should be the top, 3
out of 4 also pretty relevant. 3 words matched out of 6 words for example
are deemed less relevant.
I would hope to also like to match plurals and common endings, so I would
like a search for "SUPPORT SERVICES LIMITED" to also match "SUPPORT SERVICE
LIMITED" (singular) , but not as high as the exact match "SUPPORT SERVICES
LIMITED".
hope this makes sense. With only 10 or so names loaded, I get mostlt what
I want, but as soon as I load up over 100, the order goes out the window
and the _score values are all the same after the first 2 or 3 matches.

How do I do this?
Sorry this is such a long post, first one and not sure where to start.
Any help (more importantly examples that I can copy and paste to try out)
would be invaluable.
Thanks for you time, appreciated.
MArk Williams

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(MArk Williams) #4

Thanks for that info and I see what that would mean.
On our production system, we will have around 8million entries on 3 to 5
shards.
small differences in the order will be acceptable, but the top 5 or 10 must
contain what we expect.
We cannot test yet until the new hardware arrives, but will keep what you
have said in mind.
Thanks again.

On Tuesday, 3 September 2013 19:33:21 UTC+1, Ivan Brusic wrote:

Besides the excellent write-up that Luca provided, elasticsearch's
distributed nature comes into play. The TF and IDF values are shard-based,
not index-based, so ordering can be skewed when you have a small number of
documents and more than one shard. If you are testing on a small amount of
documents, try with only one shard. You can also change your search type to
use distributed frequencies using the dfs_query_then_fetch type.

http://www.elasticsearch.org/guide/reference/api/search/search-type/

Cheers,

Ivan

On Tue, Sep 3, 2013 at 8:48 AM, Luca Cavanna <cavan...@gmail.com<javascript:>

wrote:

Hi,
elasticsearch uses by default the TF-IDF similarity. To keep it simple,
the score depends on the following main factors:

  • term frequency: how many times the term you are searching for occurs in
    the document (same field), the more the better
  • inverted document frequency: how many documents contain the term you
    are searching for throughout the whole index (same field), the less the
    better (rare terms get rewarded)
  • length of the field, the shorter the better

It should now be clearer why the score changes if the index changes. Just
to mention an example, it can happen that one of the terms is rare and has
more importance than another term that appears more times in the same
document (higher term frequency)...

You can definitely tune and influence the way scoring works, but first
thing you would need to do is define a mapping upfront for your index,
instead of using the default one, which by the way applies stopwords but
not stemming. If you want to have the described behaviour with plurals, you
need to use stemming, thus you have to create a custom analyzer in your
index settings and refer to it in your mapping.

You can also check how you are indexing documents through the analyze
api, and check the reason behind a certain score using the explain=true
parameter when running searches (do this only to debug queries, it slows
down searches!).

Also, if I were you I would look at the match query and use that one
instead of the query_string, which is more error-prone due to its required
parsing and many things that you can do with a single query.

After that, if the scoring is still not satisfying and you can identify
custom requirements, like a certain set of documents that need to be
boosted etc. you can use proper queries to achieve that (e.g. the custom
score query of custom filters score).

Hope this helps
Luca

On Tuesday, September 3, 2013 2:54:41 PM UTC+2, MArk Williams wrote:

Hi, new to all this ES and overwhelmed by all the options and syntax.
I (currently) only have 2 fields, company name and company number. I
want to search company names.
I have a a really simple search to do and cannot get the order I want
(or would expect)
I set up like:-
curl -XPOST 'http://localhost:9200/**playcompanyhttp://localhost:9200/playcompany
'
then run a bunch of :-
curl -XPOST 'http://localhost:9200/**playcompany/companies/http://localhost:9200/playcompany/companies/'
-d '{ "number" : "06026916", "name" : "FRASERS VENTURES LIMITED" }'
curl -XPOST 'http://localhost:9200/**playcompany/companies/http://localhost:9200/playcompany/companies/'
-d '{ "number" : "01366799", "name" : "SUPPORT SERVICES LIMITED" }'
curl -XPOST 'http://localhost:9200/**playcompany/companies/http://localhost:9200/playcompany/companies/'
-d '{ "number" : "01349558", "name" : "MCGINLEY SUPPORT SERVICES LIMITED" }'
curl -XPOST 'http://localhost:9200/**playcompany/companies/http://localhost:9200/playcompany/companies/'
-d '{ "number" : "01409241", "name" : "SUPPORT SERVICES (FILMS) LIMITED" }'
curl -XPOST 'http://localhost:9200/**playcompany/companies/http://localhost:9200/playcompany/companies/'
-d '{ "number" : "01470672", "name" : "A.C.S. (CONSULTANCY AND SUPPORT
SERVICES) LIMITED" }'
curl -XPOST 'http://localhost:9200/**playcompany/companies/http://localhost:9200/playcompany/companies/'
-d '{ "number" : "01475234", "name" : "GENERAL SUPPORT AND HANDLING
SERVICES LIMITED" }'
curl -XPOST 'http://localhost:9200/**playcompany/companies/http://localhost:9200/playcompany/companies/'
-d '{ "number" : "02795677", "name" : "SUPPORT SERVICES LIMITED" }'
etc
to load up only 127 for testing. Mapping shows:-

curl -XGET 'http://localhost:9200/playcompany/companies/_
mapping?prettyhttp://localhost:9200/playcompany/companies/_mapping?pretty
'
{
"companies" : {
"properties" : {
"name" : {
"type" : "string"
},
"number" : {
"type" : "string"
}
}
}
}

ALL I need is sensible matches, matches that humans would expect, in an
order that humans would expect.
when I do a simple :-
curl -XGET 'http://localhost:9200/playcompany/companies/_search?
q=name:support%20services%**20limited&pretty=truehttp://localhost:9200/playcompany/companies/_search?q=name:support%20services%20limited&pretty=true
'
I would hope to get "SUPPORT SERVICES LIMITED" as the first hit,
followed by other 'relevant' results, in some sort of explainable order.
'relevant' to me (or human searchers) means that the more words that
match, the more relevant. so 3 out of 3 words match should be the top, 3
out of 4 also pretty relevant. 3 words matched out of 6 words for example
are deemed less relevant.
I would hope to also like to match plurals and common endings, so I
would like a search for "SUPPORT SERVICES LIMITED" to also match "SUPPORT
SERVICE LIMITED" (singular) , but not as high as the exact match "SUPPORT
SERVICES LIMITED".
hope this makes sense. With only 10 or so names loaded, I get mostlt
what I want, but as soon as I load up over 100, the order goes out the
window and the _score values are all the same after the first 2 or 3
matches.

How do I do this?
Sorry this is such a long post, first one and not sure where to start.
Any help (more importantly examples that I can copy and paste to try
out) would be invaluable.
Thanks for you time, appreciated.
MArk Williams

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(MArk Williams) #5

Thanks so much for your reply.
My search result order makes more sense after reading your reply and I
actually understood what you were saying! Great plain talking, much
appreciated.
I have applied a mapping upfront. Snowball is the one for us, like you said
we really want stemming.
IF we wanted to use snowball BUT turn off stopwords, is that possible?
I now have:-
curl -XGET 'http://localhost:9200/playcompany/companies/_mapping?pretty'

{
"companies" : {
"properties" : {
"name" : {
"type" : "string",
"analyzer" : "snowball"
},
"number" : {
"type" : "string"
}}}}
and I am now finding what I want, including plurals etc. Great.
BUT the order is now the problem.
I have just started to use your suggestion for explain=true and will start
to examine the scores.
Hopefully that will give me some clues. Will post results.
Again, thanks so much for your help.
MArk

On Tuesday, 3 September 2013 16:48:57 UTC+1, Luca Cavanna wrote:

Hi,
elasticsearch uses by default the TF-IDF similarity. To keep it simple,
the score depends on the following main factors:

  • term frequency: how many times the term you are searching for occurs in
    the document (same field), the more the better
  • inverted document frequency: how many documents contain the term you are
    searching for throughout the whole index (same field), the less the better
    (rare terms get rewarded)
  • length of the field, the shorter the better

It should now be clearer why the score changes if the index changes. Just
to mention an example, it can happen that one of the terms is rare and has
more importance than another term that appears more times in the same
document (higher term frequency)...

You can definitely tune and influence the way scoring works, but first
thing you would need to do is define a mapping upfront for your index,
instead of using the default one, which by the way applies stopwords but
not stemming. If you want to have the described behaviour with plurals, you
need to use stemming, thus you have to create a custom analyzer in your
index settings and refer to it in your mapping.

You can also check how you are indexing documents through the analyze api,
and check the reason behind a certain score using the explain=true
parameter when running searches (do this only to debug queries, it slows
down searches!).

Also, if I were you I would look at the match query and use that one
instead of the query_string, which is more error-prone due to its required
parsing and many things that you can do with a single query.

After that, if the scoring is still not satisfying and you can identify
custom requirements, like a certain set of documents that need to be
boosted etc. you can use proper queries to achieve that (e.g. the custom
score query of custom filters score).

Hope this helps
Luca

On Tuesday, September 3, 2013 2:54:41 PM UTC+2, MArk Williams wrote:

Hi, new to all this ES and overwhelmed by all the options and syntax.
I (currently) only have 2 fields, company name and company number. I want
to search company names.
I have a a really simple search to do and cannot get the order I want (or
would expect)
I set up like:-
curl -XPOST 'http://localhost:9200/playcompany'
then run a bunch of :-
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "06026916", "name" : "FRASERS VENTURES LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "01366799", "name" : "SUPPORT SERVICES LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "01349558", "name" : "MCGINLEY SUPPORT SERVICES LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "01409241", "name" : "SUPPORT SERVICES (FILMS) LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "01470672", "name" : "A.C.S. (CONSULTANCY AND SUPPORT SERVICES)
LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "01475234", "name" : "GENERAL SUPPORT AND HANDLING SERVICES
LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "02795677", "name" : "SUPPORT SERVICES LIMITED" }'
etc
to load up only 127 for testing. Mapping shows:-

curl -XGET 'http://localhost:9200/playcompany/companies/_mapping?pretty'
{
"companies" : {
"properties" : {
"name" : {
"type" : "string"
},
"number" : {
"type" : "string"
}
}
}
}

ALL I need is sensible matches, matches that humans would expect, in an
order that humans would expect.
when I do a simple :-
curl -XGET '
http://localhost:9200/playcompany/companies/_search?q=name:support%20services%20limited&pretty=true
'
I would hope to get "SUPPORT SERVICES LIMITED" as the first hit, followed
by other 'relevant' results, in some sort of explainable order.
'relevant' to me (or human searchers) means that the more words that
match, the more relevant. so 3 out of 3 words match should be the top, 3
out of 4 also pretty relevant. 3 words matched out of 6 words for example
are deemed less relevant.
I would hope to also like to match plurals and common endings, so I would
like a search for "SUPPORT SERVICES LIMITED" to also match "SUPPORT SERVICE
LIMITED" (singular) , but not as high as the exact match "SUPPORT SERVICES
LIMITED".
hope this makes sense. With only 10 or so names loaded, I get mostlt what
I want, but as soon as I load up over 100, the order goes out the window
and the _score values are all the same after the first 2 or 3 matches.

How do I do this?
Sorry this is such a long post, first one and not sure where to start.
Any help (more importantly examples that I can copy and paste to try out)
would be invaluable.
Thanks for you time, appreciated.
MArk Williams

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(MArk Williams) #6

Taking advice from Luca and Ivan...
using snowball analyzer I am getting the matches I want, but not the order.
I have used explain, but I do not understand the results :cry:
when I :-
curl -XPOST
'http://localhost:9200/playcompany/companies/_search?pretty&explain=true'
-d '{ "query": { "match": { "name" : { "query": "support services limited",
"type" : "phrase" } } } }'
I get 17 results. 3rd and 6th place I want in 1st and 2nd place. Why does
first match score higher than 3rd and 6th place.
Here are the explain section for 1st, 3rd and 6th if anyone can explain
what it means, and how to increase the score of 3rd place and 6th place.
Thanks so much.
1st is :-
"_score" : 2.5553496, "_source" : { "number" : "02733484", "name" :
"COUNSELLING SUPPORT SERVICES LIMITED" },
"_explanation" : {
"value" : 2.5553496,
"description" : "weight(name:"support servic limit" in 0)
[PerFieldSimilarity], result of:",
"details" : [ {
"value" : 2.5553496,
"description" : "fieldWeight in 0, product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(freq=1.0), with freq of:",
"details" : [ {
"value" : 1.0,
"description" : "phraseFreq=1.0"
} ]
}, {
"value" : 5.110699,
"description" : "idf(), sum of:",
"details" : [ {
"value" : 2.9459102,
"description" : "idf(docFreq=2, maxDocs=21)"
}, {
"value" : 0.95348,
"description" : "idf(docFreq=21, maxDocs=21)"
}, {
"value" : 1.2113091,
"description" : "idf(docFreq=16, maxDocs=21)"
} ]
}, {
"value" : 0.5,
"description" : "fieldNorm(doc=0)"
} ]
} ]

3rd is
"_score" : 2.2324283, "_source" : { "number" : "01366799", "name" :
"SUPPORT SERVICES LIMITED" },
"_explanation" : {
"value" : 2.2324283,
"description" : "weight(name:"support servic limit" in 0)
[PerFieldSimilarity], result of:",
"details" : [ {
"value" : 2.2324283,
"description" : "fieldWeight in 0, product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(freq=1.0), with freq of:",
"details" : [ {
"value" : 1.0,
"description" : "phraseFreq=1.0"
} ]
}, {
"value" : 4.4648566,
"description" : "idf(), sum of:",
"details" : [ {
"value" : 2.321756,
"description" : "idf(docFreq=7, maxDocs=30)"
}, {
"value" : 1.0,
"description" : "idf(docFreq=29, maxDocs=30)"
}, {
"value" : 1.1431009,
"description" : "idf(docFreq=25, maxDocs=30)"
} ]
}, {
"value" : 0.5,
"description" : "fieldNorm(doc=0)"
} ]
} ]
and 6th is:-
"_score" : 2.1750708, "_source" : { "number" : "01234567", "name" :
"SUPPORT SERVICE LIMITED" },
"_explanation" : {
"value" : 2.1750708,
"description" : "weight(name:"support servic limit" in 0)
[PerFieldSimilarity], result of:",
"details" : [ {
"value" : 2.1750708,
"description" : "fieldWeight in 0, product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(freq=1.0), with freq of:",
"details" : [ {
"value" : 1.0,
"description" : "phraseFreq=1.0"
} ]
}, {
"value" : 4.3501415,
"description" : "idf(), sum of:",
"details" : [ {
"value" : 2.299283,
"description" : "idf(docFreq=5, maxDocs=22)"
}, {
"value" : 0.9555482,
"description" : "idf(docFreq=22, maxDocs=22)"
}, {
"value" : 1.0953102,
"description" : "idf(docFreq=19, maxDocs=22)"
} ]
}, {
"value" : 0.5,
"description" : "fieldNorm(doc=0)"
} ]
} ]

On Tuesday, 3 September 2013 13:54:41 UTC+1, MArk Williams wrote:

Hi, new to all this ES and overwhelmed by all the options and syntax.
I (currently) only have 2 fields, company name and company number. I want
to search company names.
I have a a really simple search to do and cannot get the order I want (or
would expect)
I set up like:-
curl -XPOST 'http://localhost:9200/playcompany'
then run a bunch of :-
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{ "number"
: "06026916", "name" : "FRASERS VENTURES LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{ "number"
: "01366799", "name" : "SUPPORT SERVICES LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{ "number"
: "01349558", "name" : "MCGINLEY SUPPORT SERVICES LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{ "number"
: "01409241", "name" : "SUPPORT SERVICES (FILMS) LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{ "number"
: "01470672", "name" : "A.C.S. (CONSULTANCY AND SUPPORT SERVICES) LIMITED"
}'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{ "number"
: "01475234", "name" : "GENERAL SUPPORT AND HANDLING SERVICES LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{ "number"
: "02795677", "name" : "SUPPORT SERVICES LIMITED" }'
etc
to load up only 127 for testing. Mapping shows:-

curl -XGET 'http://localhost:9200/playcompany/companies/_mapping?pretty'
{
"companies" : {
"properties" : {
"name" : {
"type" : "string"
},
"number" : {
"type" : "string"
}
}
}
}

ALL I need is sensible matches, matches that humans would expect, in an
order that humans would expect.
when I do a simple :-
curl -XGET '
http://localhost:9200/playcompany/companies/_search?q=name:support%20services%20limited&pretty=true
'
I would hope to get "SUPPORT SERVICES LIMITED" as the first hit, followed
by other 'relevant' results, in some sort of explainable order.
'relevant' to me (or human searchers) means that the more words that
match, the more relevant. so 3 out of 3 words match should be the top, 3
out of 4 also pretty relevant. 3 words matched out of 6 words for example
are deemed less relevant.
I would hope to also like to match plurals and common endings, so I would
like a search for "SUPPORT SERVICES LIMITED" to also match "SUPPORT SERVICE
LIMITED" (singular) , but not as high as the exact match "SUPPORT SERVICES
LIMITED".
hope this makes sense. With only 10 or so names loaded, I get mostlt what
I want, but as soon as I load up over 100, the order goes out the window
and the _score values are all the same after the first 2 or 3 matches.

How do I do this?
Sorry this is such a long post, first one and not sure where to start.
Any help (more importantly examples that I can copy and paste to try out)
would be invaluable.
Thanks for you time, appreciated.
MArk Williams

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Luca Cavanna) #7

Hi Mark,
indeed in my explanation I didn't mention sharding, which is what Ivan
kindly added. In fact that's the problem you are hitting right now :slight_smile:

As you can see from the explain the inverted document frequency (idf) for
each term is always different for those three docs, which means they come
from different shards. Inverted document frequency defines how rare your
search terms are throughout the index, but here I mean the lucene index,
which is the elasticsearch shard. It's harder to compare idf's between
different shards, but these problems tend to even out as soon as you have
enough data. These are the options you have to somehow improve the current
situation:

  • add more data to have better scoring
  • index into a single shard (but given the number of docs you are going to
    index this doesn't seem like the solution for you)
  • use the dfs_query_then_fetch search type for better scoring, at the cost
    of little worse performance. Elasticsearch will try to collect more
    information from each shard in order to compute a more accurate score for
    the documents. Have a look here to know
    more: http://www.elasticsearch.org/guide/reference/api/search/search-type/ .

Cheers
Luca

On Thursday, September 5, 2013 12:34:31 PM UTC+2, MArk Williams wrote:

Taking advice from Luca and Ivan...
using snowball analyzer I am getting the matches I want, but not the order.
I have used explain, but I do not understand the results :cry:
when I :-
curl -XPOST '
http://localhost:9200/playcompany/companies/_search?pretty&explain=true'
-d '{ "query": { "match": { "name" : { "query": "support services limited",
"type" : "phrase" } } } }'
I get 17 results. 3rd and 6th place I want in 1st and 2nd place. Why does
first match score higher than 3rd and 6th place.
Here are the explain section for 1st, 3rd and 6th if anyone can explain
what it means, and how to increase the score of 3rd place and 6th place.
Thanks so much.
1st is :-
"_score" : 2.5553496, "_source" : { "number" : "02733484", "name" :
"COUNSELLING SUPPORT SERVICES LIMITED" },
"_explanation" : {
"value" : 2.5553496,
"description" : "weight(name:"support servic limit" in 0)
[PerFieldSimilarity], result of:",
"details" : [ {
"value" : 2.5553496,
"description" : "fieldWeight in 0, product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(freq=1.0), with freq of:",
"details" : [ {
"value" : 1.0,
"description" : "phraseFreq=1.0"
} ]
}, {
"value" : 5.110699,
"description" : "idf(), sum of:",
"details" : [ {
"value" : 2.9459102,
"description" : "idf(docFreq=2, maxDocs=21)"
}, {
"value" : 0.95348,
"description" : "idf(docFreq=21, maxDocs=21)"
}, {
"value" : 1.2113091,
"description" : "idf(docFreq=16, maxDocs=21)"
} ]
}, {
"value" : 0.5,
"description" : "fieldNorm(doc=0)"
} ]
} ]

3rd is
"_score" : 2.2324283, "_source" : { "number" : "01366799", "name" :
"SUPPORT SERVICES LIMITED" },
"_explanation" : {
"value" : 2.2324283,
"description" : "weight(name:"support servic limit" in 0)
[PerFieldSimilarity], result of:",
"details" : [ {
"value" : 2.2324283,
"description" : "fieldWeight in 0, product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(freq=1.0), with freq of:",
"details" : [ {
"value" : 1.0,
"description" : "phraseFreq=1.0"
} ]
}, {
"value" : 4.4648566,
"description" : "idf(), sum of:",
"details" : [ {
"value" : 2.321756,
"description" : "idf(docFreq=7, maxDocs=30)"
}, {
"value" : 1.0,
"description" : "idf(docFreq=29, maxDocs=30)"
}, {
"value" : 1.1431009,
"description" : "idf(docFreq=25, maxDocs=30)"
} ]
}, {
"value" : 0.5,
"description" : "fieldNorm(doc=0)"
} ]
} ]
and 6th is:-
"_score" : 2.1750708, "_source" : { "number" : "01234567", "name" :
"SUPPORT SERVICE LIMITED" },
"_explanation" : {
"value" : 2.1750708,
"description" : "weight(name:"support servic limit" in 0)
[PerFieldSimilarity], result of:",
"details" : [ {
"value" : 2.1750708,
"description" : "fieldWeight in 0, product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(freq=1.0), with freq of:",
"details" : [ {
"value" : 1.0,
"description" : "phraseFreq=1.0"
} ]
}, {
"value" : 4.3501415,
"description" : "idf(), sum of:",
"details" : [ {
"value" : 2.299283,
"description" : "idf(docFreq=5, maxDocs=22)"
}, {
"value" : 0.9555482,
"description" : "idf(docFreq=22, maxDocs=22)"
}, {
"value" : 1.0953102,
"description" : "idf(docFreq=19, maxDocs=22)"
} ]
}, {
"value" : 0.5,
"description" : "fieldNorm(doc=0)"
} ]
} ]

On Tuesday, 3 September 2013 13:54:41 UTC+1, MArk Williams wrote:

Hi, new to all this ES and overwhelmed by all the options and syntax.
I (currently) only have 2 fields, company name and company number. I want
to search company names.
I have a a really simple search to do and cannot get the order I want (or
would expect)
I set up like:-
curl -XPOST 'http://localhost:9200/playcompany'
then run a bunch of :-
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "06026916", "name" : "FRASERS VENTURES LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "01366799", "name" : "SUPPORT SERVICES LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "01349558", "name" : "MCGINLEY SUPPORT SERVICES LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "01409241", "name" : "SUPPORT SERVICES (FILMS) LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "01470672", "name" : "A.C.S. (CONSULTANCY AND SUPPORT SERVICES)
LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "01475234", "name" : "GENERAL SUPPORT AND HANDLING SERVICES
LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "02795677", "name" : "SUPPORT SERVICES LIMITED" }'
etc
to load up only 127 for testing. Mapping shows:-

curl -XGET 'http://localhost:9200/playcompany/companies/_mapping?pretty'
{
"companies" : {
"properties" : {
"name" : {
"type" : "string"
},
"number" : {
"type" : "string"
}
}
}
}

ALL I need is sensible matches, matches that humans would expect, in an
order that humans would expect.
when I do a simple :-
curl -XGET '
http://localhost:9200/playcompany/companies/_search?q=name:support%20services%20limited&pretty=true
'
I would hope to get "SUPPORT SERVICES LIMITED" as the first hit, followed
by other 'relevant' results, in some sort of explainable order.
'relevant' to me (or human searchers) means that the more words that
match, the more relevant. so 3 out of 3 words match should be the top, 3
out of 4 also pretty relevant. 3 words matched out of 6 words for example
are deemed less relevant.
I would hope to also like to match plurals and common endings, so I would
like a search for "SUPPORT SERVICES LIMITED" to also match "SUPPORT SERVICE
LIMITED" (singular) , but not as high as the exact match "SUPPORT SERVICES
LIMITED".
hope this makes sense. With only 10 or so names loaded, I get mostlt what
I want, but as soon as I load up over 100, the order goes out the window
and the _score values are all the same after the first 2 or 3 matches.

How do I do this?
Sorry this is such a long post, first one and not sure where to start.
Any help (more importantly examples that I can copy and paste to try out)
would be invaluable.
Thanks for you time, appreciated.
MArk Williams

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(MArk Williams) #8

Thanks for that.
I can now see that effect with an index with only 120 entries. I have now
loaded up 4 million entries.
Just as you say, the scoring is fixed for large numbers. I tried
the search_type=dfs_query_then_fetch which made a massive difference in my
small dataset, but made no (or very little) difference in my 4 million
dataset. For the very small performance hit, we may still use
the dfs_query_then_fetch anyway to ensure order is always consistent.
We now only have to work on the similarity score so that 'SUPPORT SERVICES
LIMITED' is at the top when that is the search term and not 'A.I.M.S
SUPPORT SERVICES LIMITED', (followed by 78 other hits) which is what we
have at the moment.
I think we need to configure similarities now, but have no idea how to or
what to.
I have read
http://www.elasticsearch.org/guide/reference/index-modules/similarity/ but
do understand the differences between default similarity, bm25 similarity,
drf similarity and ib similarity. Which one (if any) would give me the
order I require?
Thanks for you help so far, very much appreciated as it has saved me a lot
of time.
MArk

On Thursday, 5 September 2013 16:02:22 UTC+1, Luca Cavanna wrote:

Hi Mark,
indeed in my explanation I didn't mention sharding, which is what Ivan
kindly added. In fact that's the problem you are hitting right now :slight_smile:

As you can see from the explain the inverted document frequency (idf) for
each term is always different for those three docs, which means they come
from different shards. Inverted document frequency defines how rare your
search terms are throughout the index, but here I mean the lucene index,
which is the elasticsearch shard. It's harder to compare idf's between
different shards, but these problems tend to even out as soon as you have
enough data. These are the options you have to somehow improve the current
situation:

  • add more data to have better scoring
  • index into a single shard (but given the number of docs you are going to
    index this doesn't seem like the solution for you)
  • use the dfs_query_then_fetch search type for better scoring, at the cost
    of little worse performance. Elasticsearch will try to collect more
    information from each shard in order to compute a more accurate score for
    the documents. Have a look here to know more:
    http://www.elasticsearch.org/guide/reference/api/search/search-type/ .

Cheers
Luca

On Thursday, September 5, 2013 12:34:31 PM UTC+2, MArk Williams wrote:

Taking advice from Luca and Ivan...
using snowball analyzer I am getting the matches I want, but not the
order.
I have used explain, but I do not understand the results :cry:
when I :-
curl -XPOST '
http://localhost:9200/playcompany/companies/_search?pretty&explain=true'
-d '{ "query": { "match": { "name" : { "query": "support services limited",
"type" : "phrase" } } } }'
I get 17 results. 3rd and 6th place I want in 1st and 2nd place. Why does
first match score higher than 3rd and 6th place.
Here are the explain section for 1st, 3rd and 6th if anyone can explain
what it means, and how to increase the score of 3rd place and 6th place.
Thanks so much.
1st is :-
"_score" : 2.5553496, "_source" : { "number" : "02733484", "name" :
"COUNSELLING SUPPORT SERVICES LIMITED" },
"_explanation" : {
"value" : 2.5553496,
"description" : "weight(name:"support servic limit" in 0)
[PerFieldSimilarity], result of:",
"details" : [ {
"value" : 2.5553496,
"description" : "fieldWeight in 0, product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(freq=1.0), with freq of:",
"details" : [ {
"value" : 1.0,
"description" : "phraseFreq=1.0"
} ]
}, {
"value" : 5.110699,
"description" : "idf(), sum of:",
"details" : [ {
"value" : 2.9459102,
"description" : "idf(docFreq=2, maxDocs=21)"
}, {
"value" : 0.95348,
"description" : "idf(docFreq=21, maxDocs=21)"
}, {
"value" : 1.2113091,
"description" : "idf(docFreq=16, maxDocs=21)"
} ]
}, {
"value" : 0.5,
"description" : "fieldNorm(doc=0)"
} ]
} ]

3rd is
"_score" : 2.2324283, "_source" : { "number" : "01366799", "name" :
"SUPPORT SERVICES LIMITED" },
"_explanation" : {
"value" : 2.2324283,
"description" : "weight(name:"support servic limit" in 0)
[PerFieldSimilarity], result of:",
"details" : [ {
"value" : 2.2324283,
"description" : "fieldWeight in 0, product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(freq=1.0), with freq of:",
"details" : [ {
"value" : 1.0,
"description" : "phraseFreq=1.0"
} ]
}, {
"value" : 4.4648566,
"description" : "idf(), sum of:",
"details" : [ {
"value" : 2.321756,
"description" : "idf(docFreq=7, maxDocs=30)"
}, {
"value" : 1.0,
"description" : "idf(docFreq=29, maxDocs=30)"
}, {
"value" : 1.1431009,
"description" : "idf(docFreq=25, maxDocs=30)"
} ]
}, {
"value" : 0.5,
"description" : "fieldNorm(doc=0)"
} ]
} ]
and 6th is:-
"_score" : 2.1750708, "_source" : { "number" : "01234567", "name" :
"SUPPORT SERVICE LIMITED" },
"_explanation" : {
"value" : 2.1750708,
"description" : "weight(name:"support servic limit" in 0)
[PerFieldSimilarity], result of:",
"details" : [ {
"value" : 2.1750708,
"description" : "fieldWeight in 0, product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(freq=1.0), with freq of:",
"details" : [ {
"value" : 1.0,
"description" : "phraseFreq=1.0"
} ]
}, {
"value" : 4.3501415,
"description" : "idf(), sum of:",
"details" : [ {
"value" : 2.299283,
"description" : "idf(docFreq=5, maxDocs=22)"
}, {
"value" : 0.9555482,
"description" : "idf(docFreq=22, maxDocs=22)"
}, {
"value" : 1.0953102,
"description" : "idf(docFreq=19, maxDocs=22)"
} ]
}, {
"value" : 0.5,
"description" : "fieldNorm(doc=0)"
} ]
} ]

On Tuesday, 3 September 2013 13:54:41 UTC+1, MArk Williams wrote:

Hi, new to all this ES and overwhelmed by all the options and syntax.
I (currently) only have 2 fields, company name and company number. I
want to search company names.
I have a a really simple search to do and cannot get the order I want
(or would expect)
I set up like:-
curl -XPOST 'http://localhost:9200/playcompany'
then run a bunch of :-
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "06026916", "name" : "FRASERS VENTURES LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "01366799", "name" : "SUPPORT SERVICES LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "01349558", "name" : "MCGINLEY SUPPORT SERVICES LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "01409241", "name" : "SUPPORT SERVICES (FILMS) LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "01470672", "name" : "A.C.S. (CONSULTANCY AND SUPPORT SERVICES)
LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "01475234", "name" : "GENERAL SUPPORT AND HANDLING SERVICES
LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "02795677", "name" : "SUPPORT SERVICES LIMITED" }'
etc
to load up only 127 for testing. Mapping shows:-

curl -XGET 'http://localhost:9200/playcompany/companies/_mapping?pretty'
{
"companies" : {
"properties" : {
"name" : {
"type" : "string"
},
"number" : {
"type" : "string"
}
}
}
}

ALL I need is sensible matches, matches that humans would expect, in an
order that humans would expect.
when I do a simple :-
curl -XGET '
http://localhost:9200/playcompany/companies/_search?q=name:support%20services%20limited&pretty=true
'
I would hope to get "SUPPORT SERVICES LIMITED" as the first hit,
followed by other 'relevant' results, in some sort of explainable order.
'relevant' to me (or human searchers) means that the more words that
match, the more relevant. so 3 out of 3 words match should be the top, 3
out of 4 also pretty relevant. 3 words matched out of 6 words for example
are deemed less relevant.
I would hope to also like to match plurals and common endings, so I
would like a search for "SUPPORT SERVICES LIMITED" to also match "SUPPORT
SERVICE LIMITED" (singular) , but not as high as the exact match "SUPPORT
SERVICES LIMITED".
hope this makes sense. With only 10 or so names loaded, I get mostlt
what I want, but as soon as I load up over 100, the order goes out the
window and the _score values are all the same after the first 2 or 3
matches.

How do I do this?
Sorry this is such a long post, first one and not sure where to start.
Any help (more importantly examples that I can copy and paste to try
out) would be invaluable.
Thanks for you time, appreciated.
MArk Williams

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Luca Cavanna) #9

I don't think you need a custom similarity to achieve that, I would just
look at the explain output and try to understand the reason behind the
different scores that you get back. TF-IDF should work well for you.
Knowing the reason behind the current scoring you can tune and influence it
to achieve what you want I guess.

On Monday, September 9, 2013 12:33:43 PM UTC+2, MArk Williams wrote:

Thanks for that.
I can now see that effect with an index with only 120 entries. I have now
loaded up 4 million entries.
Just as you say, the scoring is fixed for large numbers. I tried
the search_type=dfs_query_then_fetch which made a massive difference in my
small dataset, but made no (or very little) difference in my 4 million
dataset. For the very small performance hit, we may still use
the dfs_query_then_fetch anyway to ensure order is always consistent.
We now only have to work on the similarity score so that 'SUPPORT SERVICES
LIMITED' is at the top when that is the search term and not 'A.I.M.S
SUPPORT SERVICES LIMITED', (followed by 78 other hits) which is what we
have at the moment.
I think we need to configure similarities now, but have no idea how to or
what to.
I have read
http://www.elasticsearch.org/guide/reference/index-modules/similarity/ but
do understand the differences between default similarity, bm25 similarity,
drf similarity and ib similarity. Which one (if any) would give me the
order I require?
Thanks for you help so far, very much appreciated as it has saved me a lot
of time.
MArk

On Thursday, 5 September 2013 16:02:22 UTC+1, Luca Cavanna wrote:

Hi Mark,
indeed in my explanation I didn't mention sharding, which is what Ivan
kindly added. In fact that's the problem you are hitting right now :slight_smile:

As you can see from the explain the inverted document frequency (idf) for
each term is always different for those three docs, which means they come
from different shards. Inverted document frequency defines how rare your
search terms are throughout the index, but here I mean the lucene index,
which is the elasticsearch shard. It's harder to compare idf's between
different shards, but these problems tend to even out as soon as you have
enough data. These are the options you have to somehow improve the current
situation:

  • add more data to have better scoring
  • index into a single shard (but given the number of docs you are going
    to index this doesn't seem like the solution for you)
  • use the dfs_query_then_fetch search type for better scoring, at the
    cost of little worse performance. Elasticsearch will try to collect more
    information from each shard in order to compute a more accurate score for
    the documents. Have a look here to know more:
    http://www.elasticsearch.org/guide/reference/api/search/search-type/ .

Cheers
Luca

On Thursday, September 5, 2013 12:34:31 PM UTC+2, MArk Williams wrote:

Taking advice from Luca and Ivan...
using snowball analyzer I am getting the matches I want, but not the
order.
I have used explain, but I do not understand the results :cry:
when I :-
curl -XPOST '
http://localhost:9200/playcompany/companies/_search?pretty&explain=true'
-d '{ "query": { "match": { "name" : { "query": "support services limited",
"type" : "phrase" } } } }'
I get 17 results. 3rd and 6th place I want in 1st and 2nd place. Why
does first match score higher than 3rd and 6th place.
Here are the explain section for 1st, 3rd and 6th if anyone can explain
what it means, and how to increase the score of 3rd place and 6th place.
Thanks so much.
1st is :-
"_score" : 2.5553496, "_source" : { "number" : "02733484", "name"
: "COUNSELLING SUPPORT SERVICES LIMITED" },
"_explanation" : {
"value" : 2.5553496,
"description" : "weight(name:"support servic limit" in 0)
[PerFieldSimilarity], result of:",
"details" : [ {
"value" : 2.5553496,
"description" : "fieldWeight in 0, product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(freq=1.0), with freq of:",
"details" : [ {
"value" : 1.0,
"description" : "phraseFreq=1.0"
} ]
}, {
"value" : 5.110699,
"description" : "idf(), sum of:",
"details" : [ {
"value" : 2.9459102,
"description" : "idf(docFreq=2, maxDocs=21)"
}, {
"value" : 0.95348,
"description" : "idf(docFreq=21, maxDocs=21)"
}, {
"value" : 1.2113091,
"description" : "idf(docFreq=16, maxDocs=21)"
} ]
}, {
"value" : 0.5,
"description" : "fieldNorm(doc=0)"
} ]
} ]

3rd is
"_score" : 2.2324283, "_source" : { "number" : "01366799", "name"
: "SUPPORT SERVICES LIMITED" },
"_explanation" : {
"value" : 2.2324283,
"description" : "weight(name:"support servic limit" in 0)
[PerFieldSimilarity], result of:",
"details" : [ {
"value" : 2.2324283,
"description" : "fieldWeight in 0, product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(freq=1.0), with freq of:",
"details" : [ {
"value" : 1.0,
"description" : "phraseFreq=1.0"
} ]
}, {
"value" : 4.4648566,
"description" : "idf(), sum of:",
"details" : [ {
"value" : 2.321756,
"description" : "idf(docFreq=7, maxDocs=30)"
}, {
"value" : 1.0,
"description" : "idf(docFreq=29, maxDocs=30)"
}, {
"value" : 1.1431009,
"description" : "idf(docFreq=25, maxDocs=30)"
} ]
}, {
"value" : 0.5,
"description" : "fieldNorm(doc=0)"
} ]
} ]
and 6th is:-
"_score" : 2.1750708, "_source" : { "number" : "01234567", "name"
: "SUPPORT SERVICE LIMITED" },
"_explanation" : {
"value" : 2.1750708,
"description" : "weight(name:"support servic limit" in 0)
[PerFieldSimilarity], result of:",
"details" : [ {
"value" : 2.1750708,
"description" : "fieldWeight in 0, product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(freq=1.0), with freq of:",
"details" : [ {
"value" : 1.0,
"description" : "phraseFreq=1.0"
} ]
}, {
"value" : 4.3501415,
"description" : "idf(), sum of:",
"details" : [ {
"value" : 2.299283,
"description" : "idf(docFreq=5, maxDocs=22)"
}, {
"value" : 0.9555482,
"description" : "idf(docFreq=22, maxDocs=22)"
}, {
"value" : 1.0953102,
"description" : "idf(docFreq=19, maxDocs=22)"
} ]
}, {
"value" : 0.5,
"description" : "fieldNorm(doc=0)"
} ]
} ]

On Tuesday, 3 September 2013 13:54:41 UTC+1, MArk Williams wrote:

Hi, new to all this ES and overwhelmed by all the options and syntax.
I (currently) only have 2 fields, company name and company number. I
want to search company names.
I have a a really simple search to do and cannot get the order I want
(or would expect)
I set up like:-
curl -XPOST 'http://localhost:9200/playcompany'
then run a bunch of :-
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "06026916", "name" : "FRASERS VENTURES LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "01366799", "name" : "SUPPORT SERVICES LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "01349558", "name" : "MCGINLEY SUPPORT SERVICES LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "01409241", "name" : "SUPPORT SERVICES (FILMS) LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "01470672", "name" : "A.C.S. (CONSULTANCY AND SUPPORT SERVICES)
LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "01475234", "name" : "GENERAL SUPPORT AND HANDLING SERVICES
LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "02795677", "name" : "SUPPORT SERVICES LIMITED" }'
etc
to load up only 127 for testing. Mapping shows:-

curl -XGET 'http://localhost:9200/playcompany/companies/_mapping?pretty
'
{
"companies" : {
"properties" : {
"name" : {
"type" : "string"
},
"number" : {
"type" : "string"
}
}
}
}

ALL I need is sensible matches, matches that humans would expect, in an
order that humans would expect.
when I do a simple :-
curl -XGET '
http://localhost:9200/playcompany/companies/_search?q=name:support%20services%20limited&pretty=true
'
I would hope to get "SUPPORT SERVICES LIMITED" as the first hit,
followed by other 'relevant' results, in some sort of explainable order.
'relevant' to me (or human searchers) means that the more words that
match, the more relevant. so 3 out of 3 words match should be the top, 3
out of 4 also pretty relevant. 3 words matched out of 6 words for example
are deemed less relevant.
I would hope to also like to match plurals and common endings, so I
would like a search for "SUPPORT SERVICES LIMITED" to also match "SUPPORT
SERVICE LIMITED" (singular) , but not as high as the exact match "SUPPORT
SERVICES LIMITED".
hope this makes sense. With only 10 or so names loaded, I get mostlt
what I want, but as soon as I load up over 100, the order goes out the
window and the _score values are all the same after the first 2 or 3
matches.

How do I do this?
Sorry this is such a long post, first one and not sure where to start.
Any help (more importantly examples that I can copy and paste to try
out) would be invaluable.
Thanks for you time, appreciated.
MArk Williams

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(MArk Williams) #10

My problem is that I do not know how to 'tune and influence it' :frowning:

On Monday, 9 September 2013 17:40:39 UTC+1, Luca Cavanna wrote:

I don't think you need a custom similarity to achieve that, I would just
look at the explain output and try to understand the reason behind the
different scores that you get back. TF-IDF should work well for you.
Knowing the reason behind the current scoring you can tune and influence it
to achieve what you want I guess.

On Monday, September 9, 2013 12:33:43 PM UTC+2, MArk Williams wrote:

Thanks for that.
I can now see that effect with an index with only 120 entries. I have now
loaded up 4 million entries.
Just as you say, the scoring is fixed for large numbers. I tried
the search_type=dfs_query_then_fetch which made a massive difference in my
small dataset, but made no (or very little) difference in my 4 million
dataset. For the very small performance hit, we may still use
the dfs_query_then_fetch anyway to ensure order is always consistent.
We now only have to work on the similarity score so that 'SUPPORT
SERVICES LIMITED' is at the top when that is the search term and not
'A.I.M.S SUPPORT SERVICES LIMITED', (followed by 78 other hits) which is
what we have at the moment.
I think we need to configure similarities now, but have no idea how to or
what to.
I have read
http://www.elasticsearch.org/guide/reference/index-modules/similarity/ but
do understand the differences between default similarity, bm25 similarity,
drf similarity and ib similarity. Which one (if any) would give me the
order I require?
Thanks for you help so far, very much appreciated as it has saved me a
lot of time.
MArk

On Thursday, 5 September 2013 16:02:22 UTC+1, Luca Cavanna wrote:

Hi Mark,
indeed in my explanation I didn't mention sharding, which is what Ivan
kindly added. In fact that's the problem you are hitting right now :slight_smile:

As you can see from the explain the inverted document frequency (idf)
for each term is always different for those three docs, which means they
come from different shards. Inverted document frequency defines how rare
your search terms are throughout the index, but here I mean the lucene
index, which is the elasticsearch shard. It's harder to compare idf's
between different shards, but these problems tend to even out as soon as
you have enough data. These are the options you have to somehow improve the
current situation:

  • add more data to have better scoring
  • index into a single shard (but given the number of docs you are going
    to index this doesn't seem like the solution for you)
  • use the dfs_query_then_fetch search type for better scoring, at the
    cost of little worse performance. Elasticsearch will try to collect more
    information from each shard in order to compute a more accurate score for
    the documents. Have a look here to know more:
    http://www.elasticsearch.org/guide/reference/api/search/search-type/ .

Cheers
Luca

On Thursday, September 5, 2013 12:34:31 PM UTC+2, MArk Williams wrote:

Taking advice from Luca and Ivan...
using snowball analyzer I am getting the matches I want, but not the
order.
I have used explain, but I do not understand the results :cry:
when I :-
curl -XPOST '
http://localhost:9200/playcompany/companies/_search?pretty&explain=true'
-d '{ "query": { "match": { "name" : { "query": "support services limited",
"type" : "phrase" } } } }'
I get 17 results. 3rd and 6th place I want in 1st and 2nd place. Why
does first match score higher than 3rd and 6th place.
Here are the explain section for 1st, 3rd and 6th if anyone can explain
what it means, and how to increase the score of 3rd place and 6th place.
Thanks so much.
1st is :-
"_score" : 2.5553496, "_source" : { "number" : "02733484", "name"
: "COUNSELLING SUPPORT SERVICES LIMITED" },
"_explanation" : {
"value" : 2.5553496,
"description" : "weight(name:"support servic limit" in 0)
[PerFieldSimilarity], result of:",
"details" : [ {
"value" : 2.5553496,
"description" : "fieldWeight in 0, product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(freq=1.0), with freq of:",
"details" : [ {
"value" : 1.0,
"description" : "phraseFreq=1.0"
} ]
}, {
"value" : 5.110699,
"description" : "idf(), sum of:",
"details" : [ {
"value" : 2.9459102,
"description" : "idf(docFreq=2, maxDocs=21)"
}, {
"value" : 0.95348,
"description" : "idf(docFreq=21, maxDocs=21)"
}, {
"value" : 1.2113091,
"description" : "idf(docFreq=16, maxDocs=21)"
} ]
}, {
"value" : 0.5,
"description" : "fieldNorm(doc=0)"
} ]
} ]

3rd is
"_score" : 2.2324283, "_source" : { "number" : "01366799", "name"
: "SUPPORT SERVICES LIMITED" },
"_explanation" : {
"value" : 2.2324283,
"description" : "weight(name:"support servic limit" in 0)
[PerFieldSimilarity], result of:",
"details" : [ {
"value" : 2.2324283,
"description" : "fieldWeight in 0, product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(freq=1.0), with freq of:",
"details" : [ {
"value" : 1.0,
"description" : "phraseFreq=1.0"
} ]
}, {
"value" : 4.4648566,
"description" : "idf(), sum of:",
"details" : [ {
"value" : 2.321756,
"description" : "idf(docFreq=7, maxDocs=30)"
}, {
"value" : 1.0,
"description" : "idf(docFreq=29, maxDocs=30)"
}, {
"value" : 1.1431009,
"description" : "idf(docFreq=25, maxDocs=30)"
} ]
}, {
"value" : 0.5,
"description" : "fieldNorm(doc=0)"
} ]
} ]
and 6th is:-
"_score" : 2.1750708, "_source" : { "number" : "01234567", "name"
: "SUPPORT SERVICE LIMITED" },
"_explanation" : {
"value" : 2.1750708,
"description" : "weight(name:"support servic limit" in 0)
[PerFieldSimilarity], result of:",
"details" : [ {
"value" : 2.1750708,
"description" : "fieldWeight in 0, product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(freq=1.0), with freq of:",
"details" : [ {
"value" : 1.0,
"description" : "phraseFreq=1.0"
} ]
}, {
"value" : 4.3501415,
"description" : "idf(), sum of:",
"details" : [ {
"value" : 2.299283,
"description" : "idf(docFreq=5, maxDocs=22)"
}, {
"value" : 0.9555482,
"description" : "idf(docFreq=22, maxDocs=22)"
}, {
"value" : 1.0953102,
"description" : "idf(docFreq=19, maxDocs=22)"
} ]
}, {
"value" : 0.5,
"description" : "fieldNorm(doc=0)"
} ]
} ]

On Tuesday, 3 September 2013 13:54:41 UTC+1, MArk Williams wrote:

Hi, new to all this ES and overwhelmed by all the options and syntax.
I (currently) only have 2 fields, company name and company number. I
want to search company names.
I have a a really simple search to do and cannot get the order I want
(or would expect)
I set up like:-
curl -XPOST 'http://localhost:9200/playcompany'
then run a bunch of :-
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "06026916", "name" : "FRASERS VENTURES LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "01366799", "name" : "SUPPORT SERVICES LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "01349558", "name" : "MCGINLEY SUPPORT SERVICES LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "01409241", "name" : "SUPPORT SERVICES (FILMS) LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "01470672", "name" : "A.C.S. (CONSULTANCY AND SUPPORT SERVICES)
LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "01475234", "name" : "GENERAL SUPPORT AND HANDLING SERVICES
LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "02795677", "name" : "SUPPORT SERVICES LIMITED" }'
etc
to load up only 127 for testing. Mapping shows:-

curl -XGET '
http://localhost:9200/playcompany/companies/_mapping?pretty'
{
"companies" : {
"properties" : {
"name" : {
"type" : "string"
},
"number" : {
"type" : "string"
}
}
}
}

ALL I need is sensible matches, matches that humans would expect, in
an order that humans would expect.
when I do a simple :-
curl -XGET '
http://localhost:9200/playcompany/companies/_search?q=name:support%20services%20limited&pretty=true
'
I would hope to get "SUPPORT SERVICES LIMITED" as the first hit,
followed by other 'relevant' results, in some sort of explainable order.
'relevant' to me (or human searchers) means that the more words that
match, the more relevant. so 3 out of 3 words match should be the top, 3
out of 4 also pretty relevant. 3 words matched out of 6 words for example
are deemed less relevant.
I would hope to also like to match plurals and common endings, so I
would like a search for "SUPPORT SERVICES LIMITED" to also match "SUPPORT
SERVICE LIMITED" (singular) , but not as high as the exact match "SUPPORT
SERVICES LIMITED".
hope this makes sense. With only 10 or so names loaded, I get mostlt
what I want, but as soon as I load up over 100, the order goes out the
window and the _score values are all the same after the first 2 or 3
matches.

How do I do this?
Sorry this is such a long post, first one and not sure where to start.
Any help (more importantly examples that I can copy and paste to try
out) would be invaluable.
Thanks for you time, appreciated.
MArk Williams

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Luca Cavanna) #11

No worries, we can work on that :wink:

Can you have a look at your explain output on the index with more
documents, where you don't have the idf problem anymore?
Just post it if you need help with it, I would like to better understand
what you get now and why, and what you would like to achieve.

Thanks

On Tuesday, September 10, 2013 3:59:52 PM UTC+2, MArk Williams wrote:

My problem is that I do not know how to 'tune and influence it' :frowning:

On Monday, 9 September 2013 17:40:39 UTC+1, Luca Cavanna wrote:

I don't think you need a custom similarity to achieve that, I would just
look at the explain output and try to understand the reason behind the
different scores that you get back. TF-IDF should work well for you.
Knowing the reason behind the current scoring you can tune and influence it
to achieve what you want I guess.

On Monday, September 9, 2013 12:33:43 PM UTC+2, MArk Williams wrote:

Thanks for that.
I can now see that effect with an index with only 120 entries. I have
now loaded up 4 million entries.
Just as you say, the scoring is fixed for large numbers. I tried
the search_type=dfs_query_then_fetch which made a massive difference in my
small dataset, but made no (or very little) difference in my 4 million
dataset. For the very small performance hit, we may still use
the dfs_query_then_fetch anyway to ensure order is always consistent.
We now only have to work on the similarity score so that 'SUPPORT
SERVICES LIMITED' is at the top when that is the search term and not
'A.I.M.S SUPPORT SERVICES LIMITED', (followed by 78 other hits) which is
what we have at the moment.
I think we need to configure similarities now, but have no idea how to
or what to.
I have read
http://www.elasticsearch.org/guide/reference/index-modules/similarity/ but
do understand the differences between default similarity, bm25 similarity,
drf similarity and ib similarity. Which one (if any) would give me the
order I require?
Thanks for you help so far, very much appreciated as it has saved me a
lot of time.
MArk

On Thursday, 5 September 2013 16:02:22 UTC+1, Luca Cavanna wrote:

Hi Mark,
indeed in my explanation I didn't mention sharding, which is what Ivan
kindly added. In fact that's the problem you are hitting right now :slight_smile:

As you can see from the explain the inverted document frequency (idf)
for each term is always different for those three docs, which means they
come from different shards. Inverted document frequency defines how rare
your search terms are throughout the index, but here I mean the lucene
index, which is the elasticsearch shard. It's harder to compare idf's
between different shards, but these problems tend to even out as soon as
you have enough data. These are the options you have to somehow improve the
current situation:

  • add more data to have better scoring
  • index into a single shard (but given the number of docs you are going
    to index this doesn't seem like the solution for you)
  • use the dfs_query_then_fetch search type for better scoring, at the
    cost of little worse performance. Elasticsearch will try to collect more
    information from each shard in order to compute a more accurate score for
    the documents. Have a look here to know more:
    http://www.elasticsearch.org/guide/reference/api/search/search-type/ .

Cheers
Luca

On Thursday, September 5, 2013 12:34:31 PM UTC+2, MArk Williams wrote:

Taking advice from Luca and Ivan...
using snowball analyzer I am getting the matches I want, but not the
order.
I have used explain, but I do not understand the results :cry:
when I :-
curl -XPOST '
http://localhost:9200/playcompany/companies/_search?pretty&explain=true'
-d '{ "query": { "match": { "name" : { "query": "support services limited",
"type" : "phrase" } } } }'
I get 17 results. 3rd and 6th place I want in 1st and 2nd place. Why
does first match score higher than 3rd and 6th place.
Here are the explain section for 1st, 3rd and 6th if anyone can
explain what it means, and how to increase the score of 3rd place and 6th
place. Thanks so much.
1st is :-
"_score" : 2.5553496, "_source" : { "number" : "02733484",
"name" : "COUNSELLING SUPPORT SERVICES LIMITED" },
"_explanation" : {
"value" : 2.5553496,
"description" : "weight(name:"support servic limit" in 0)
[PerFieldSimilarity], result of:",
"details" : [ {
"value" : 2.5553496,
"description" : "fieldWeight in 0, product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(freq=1.0), with freq of:",
"details" : [ {
"value" : 1.0,
"description" : "phraseFreq=1.0"
} ]
}, {
"value" : 5.110699,
"description" : "idf(), sum of:",
"details" : [ {
"value" : 2.9459102,
"description" : "idf(docFreq=2, maxDocs=21)"
}, {
"value" : 0.95348,
"description" : "idf(docFreq=21, maxDocs=21)"
}, {
"value" : 1.2113091,
"description" : "idf(docFreq=16, maxDocs=21)"
} ]
}, {
"value" : 0.5,
"description" : "fieldNorm(doc=0)"
} ]
} ]

3rd is
"_score" : 2.2324283, "_source" : { "number" : "01366799",
"name" : "SUPPORT SERVICES LIMITED" },
"_explanation" : {
"value" : 2.2324283,
"description" : "weight(name:"support servic limit" in 0)
[PerFieldSimilarity], result of:",
"details" : [ {
"value" : 2.2324283,
"description" : "fieldWeight in 0, product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(freq=1.0), with freq of:",
"details" : [ {
"value" : 1.0,
"description" : "phraseFreq=1.0"
} ]
}, {
"value" : 4.4648566,
"description" : "idf(), sum of:",
"details" : [ {
"value" : 2.321756,
"description" : "idf(docFreq=7, maxDocs=30)"
}, {
"value" : 1.0,
"description" : "idf(docFreq=29, maxDocs=30)"
}, {
"value" : 1.1431009,
"description" : "idf(docFreq=25, maxDocs=30)"
} ]
}, {
"value" : 0.5,
"description" : "fieldNorm(doc=0)"
} ]
} ]
and 6th is:-
"_score" : 2.1750708, "_source" : { "number" : "01234567",
"name" : "SUPPORT SERVICE LIMITED" },
"_explanation" : {
"value" : 2.1750708,
"description" : "weight(name:"support servic limit" in 0)
[PerFieldSimilarity], result of:",
"details" : [ {
"value" : 2.1750708,
"description" : "fieldWeight in 0, product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(freq=1.0), with freq of:",
"details" : [ {
"value" : 1.0,
"description" : "phraseFreq=1.0"
} ]
}, {
"value" : 4.3501415,
"description" : "idf(), sum of:",
"details" : [ {
"value" : 2.299283,
"description" : "idf(docFreq=5, maxDocs=22)"
}, {
"value" : 0.9555482,
"description" : "idf(docFreq=22, maxDocs=22)"
}, {
"value" : 1.0953102,
"description" : "idf(docFreq=19, maxDocs=22)"
} ]
}, {
"value" : 0.5,
"description" : "fieldNorm(doc=0)"
} ]
} ]

On Tuesday, 3 September 2013 13:54:41 UTC+1, MArk Williams wrote:

Hi, new to all this ES and overwhelmed by all the options and syntax.
I (currently) only have 2 fields, company name and company number. I
want to search company names.
I have a a really simple search to do and cannot get the order I want
(or would expect)
I set up like:-
curl -XPOST 'http://localhost:9200/playcompany'
then run a bunch of :-
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "06026916", "name" : "FRASERS VENTURES LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "01366799", "name" : "SUPPORT SERVICES LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "01349558", "name" : "MCGINLEY SUPPORT SERVICES LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "01409241", "name" : "SUPPORT SERVICES (FILMS) LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "01470672", "name" : "A.C.S. (CONSULTANCY AND SUPPORT SERVICES)
LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "01475234", "name" : "GENERAL SUPPORT AND HANDLING SERVICES
LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{
"number" : "02795677", "name" : "SUPPORT SERVICES LIMITED" }'
etc
to load up only 127 for testing. Mapping shows:-

curl -XGET '
http://localhost:9200/playcompany/companies/_mapping?pretty'
{
"companies" : {
"properties" : {
"name" : {
"type" : "string"
},
"number" : {
"type" : "string"
}
}
}
}

ALL I need is sensible matches, matches that humans would expect, in
an order that humans would expect.
when I do a simple :-
curl -XGET '
http://localhost:9200/playcompany/companies/_search?q=name:support%20services%20limited&pretty=true
'
I would hope to get "SUPPORT SERVICES LIMITED" as the first hit,
followed by other 'relevant' results, in some sort of explainable order.
'relevant' to me (or human searchers) means that the more words that
match, the more relevant. so 3 out of 3 words match should be the top, 3
out of 4 also pretty relevant. 3 words matched out of 6 words for example
are deemed less relevant.
I would hope to also like to match plurals and common endings, so I
would like a search for "SUPPORT SERVICES LIMITED" to also match "SUPPORT
SERVICE LIMITED" (singular) , but not as high as the exact match "SUPPORT
SERVICES LIMITED".
hope this makes sense. With only 10 or so names loaded, I get mostlt
what I want, but as soon as I load up over 100, the order goes out the
window and the _score values are all the same after the first 2 or 3
matches.

How do I do this?
Sorry this is such a long post, first one and not sure where to start.
Any help (more importantly examples that I can copy and paste to try
out) would be invaluable.
Thanks for you time, appreciated.
MArk Williams

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Kruti_Shukla) #12

I have problem with plurals and want to more relevance in the search.
please look at my question
here: https://groups.google.com/d/msg/elasticsearch/8yjfx2HLelc/2bEuar6NT9YJ.
Thank you.
Please help or suggest. Thank you for your time.

On Tuesday, September 3, 2013 8:54:41 AM UTC-4, MArk Williams wrote:

Hi, new to all this ES and overwhelmed by all the options and syntax.
I (currently) only have 2 fields, company name and company number. I want
to search company names.
I have a a really simple search to do and cannot get the order I want (or
would expect)
I set up like:-
curl -XPOST 'http://localhost:9200/playcompany'
then run a bunch of :-
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{ "number"
: "06026916", "name" : "FRASERS VENTURES LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{ "number"
: "01366799", "name" : "SUPPORT SERVICES LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{ "number"
: "01349558", "name" : "MCGINLEY SUPPORT SERVICES LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{ "number"
: "01409241", "name" : "SUPPORT SERVICES (FILMS) LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{ "number"
: "01470672", "name" : "A.C.S. (CONSULTANCY AND SUPPORT SERVICES) LIMITED"
}'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{ "number"
: "01475234", "name" : "GENERAL SUPPORT AND HANDLING SERVICES LIMITED" }'
curl -XPOST 'http://localhost:9200/playcompany/companies/' -d '{ "number"
: "02795677", "name" : "SUPPORT SERVICES LIMITED" }'
etc
to load up only 127 for testing. Mapping shows:-

curl -XGET 'http://localhost:9200/playcompany/companies/_mapping?pretty'
{
"companies" : {
"properties" : {
"name" : {
"type" : "string"
},
"number" : {
"type" : "string"
}
}
}
}

ALL I need is sensible matches, matches that humans would expect, in an
order that humans would expect.
when I do a simple :-
curl -XGET '
http://localhost:9200/playcompany/companies/_search?q=name:support%20services%20limited&pretty=true
'
I would hope to get "SUPPORT SERVICES LIMITED" as the first hit, followed
by other 'relevant' results, in some sort of explainable order.
'relevant' to me (or human searchers) means that the more words that
match, the more relevant. so 3 out of 3 words match should be the top, 3
out of 4 also pretty relevant. 3 words matched out of 6 words for example
are deemed less relevant.
I would hope to also like to match plurals and common endings, so I would
like a search for "SUPPORT SERVICES LIMITED" to also match "SUPPORT SERVICE
LIMITED" (singular) , but not as high as the exact match "SUPPORT SERVICES
LIMITED".
hope this makes sense. With only 10 or so names loaded, I get mostlt what
I want, but as soon as I load up over 100, the order goes out the window
and the _score values are all the same after the first 2 or 3 matches.

How do I do this?
Sorry this is such a long post, first one and not sure where to start.
Any help (more importantly examples that I can copy and paste to try out)
would be invaluable.
Thanks for you time, appreciated.
MArk Williams

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/351d3f11-4acf-4385-97e9-dcd2575308e8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #13