Performance of multi_match

Hi!

we're using elasticsearch for an open source geocoder called photon. We're
using solr previously but we switched to elasticsearch some time ago and
I'am using now multi_match's cross_field
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-multi-match-query.html#type-cross-fields
query (which is great by the way as it sorts out most problems we had
before).

I investigated the performance between both implementation and it turned
out that the elasticsearch is about 5 times slower than the solr
counterpart. The dataset (100,000,000 documents) is identical and the size
of both indices too. On the solr side, I am using an edismax
https://github.com/komoot/photon/blob/deprecated-solr-version/solrconfig/collection1/conf/solrconfig.xml#L122
query whilst it is a cross_field
https://github.com/christophlingg/photon/blob/komoot/website/photon/app.py#L25 on
elasticsearch. Average query time is 120ms vs. 1000s.

I adjusted the number of open file descriptors to 64k, during the benchmark
there is (almost) no IO whilst the cpu is very high (> 75%, 12 cores). As
cross_field is a very recent feature I tried out best_field
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-multi-match-query.html#type-best-fields as
well, but benchmark results weren't better.

Do you have any ideas on how I can dig more into performance issues like
this in elasticsearch? Do you have experience with both queries you can
share with me?

Thanks for your help!
Christoph

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5bff0274-ea12-4f28-a304-3f0ad691880c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hello,

It seems your Elasticsearch query is doing a lot more, there is custom
scoring, some filtering with OR on missing fields, sub queries, more
fields, etc.

Were you doing exactly the same filtering/scoring with Solr?

Can you incremently test and compare your queries performance,
starting with just the multi_match vs edismax, also compare the number
of results. Ensure the cross_fields parameter is acting as you want,
as you have lot of fields with maybe different analyzers.

Cédric Hourcade
ced@wal.fr

On Tue, Jun 24, 2014 at 5:09 PM, Christoph Lingg c.lingg@gmail.com wrote:

Hi!

we're using elasticsearch for an open source geocoder called photon. We're
using solr previously but we switched to elasticsearch some time ago and
I'am using now multi_match's cross_field query (which is great by the way as
it sorts out most problems we had before).

I investigated the performance between both implementation and it turned out
that the elasticsearch is about 5 times slower than the solr counterpart.
The dataset (100,000,000 documents) is identical and the size of both
indices too. On the solr side, I am using an edismax query whilst it is a
cross_field on elasticsearch. Average query time is 120ms vs. 1000s.

I adjusted the number of open file descriptors to 64k, during the benchmark
there is (almost) no IO whilst the cpu is very high (> 75%, 12 cores). As
cross_field is a very recent feature I tried out best_field as well, but
benchmark results weren't better.

Do you have any ideas on how I can dig more into performance issues like
this in elasticsearch? Do you have experience with both queries you can
share with me?

Thanks for your help!
Christoph

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5bff0274-ea12-4f28-a304-3f0ad691880c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJQxjPMParLX7mwJfPUz6L_VvGbdB9jeQ_5uP1Qy%2B06yM58wTw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hello,

It seems to me that the cross_field does more than the SOLR dismax query.
To compare the same thing in both ES and Solr, you could run the disMax
query with Es and start from there
==>

Hope it helps
Stéphane

On Tuesday, June 24, 2014 5:09:21 PM UTC+2, Christoph Lingg wrote:

Hi!

we're using elasticsearch for an open source geocoder called photon. We're
using solr previously but we switched to elasticsearch some time ago and
I'am using now multi_match's cross_field
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-multi-match-query.html#type-cross-fields
query (which is great by the way as it sorts out most problems we had
before).

I investigated the performance between both implementation and it turned
out that the elasticsearch is about 5 times slower than the solr
counterpart. The dataset (100,000,000 documents) is identical and the size
of both indices too. On the solr side, I am using an edismax
https://github.com/komoot/photon/blob/deprecated-solr-version/solrconfig/collection1/conf/solrconfig.xml#L122
query whilst it is a cross_field
https://github.com/christophlingg/photon/blob/komoot/website/photon/app.py#L25 on
elasticsearch. Average query time is 120ms vs. 1000s.

I adjusted the number of open file descriptors to 64k, during the
benchmark there is (almost) no IO whilst the cpu is very high (> 75%, 12
cores). As cross_field is a very recent feature I tried out best_field
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-multi-match-query.html#type-best-fields as
well, but benchmark results weren't better.

Do you have any ideas on how I can dig more into performance issues like
this in elasticsearch? Do you have experience with both queries you can
share with me?

Thanks for your help!
Christoph

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ec1e15ad-5e1e-4371-a587-1b34d9b54241%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hello Christoph,

Just wanted to add that it would be great if you could report back your
findings (good or bad) to the group. We're especially interested in this
because we're going to install Photon and would love it to work as fast as
possible :wink:

Stéphane Bastian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c4df2dd8-95ad-4145-a0e4-73c352736141%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi Cedric and Stephane,

Thanks for your feedback! Following your ideas I removed any filtering and
custom scoring from the query. I do get better results, but the efficiency
of multi_match is still not as good as edismax (3 or 4 times slower).

I do not understand how multi_match is more complex than edismax. AFAIK the
only difference is even out idf over multiple fields for the final scoring.

Is there any tool to trace the performance in elasticsearch?

Christoph

Am Mittwoch, 25. Juni 2014 09:22:07 UTC+2 schrieb Stephane Bastian:

Hello Christoph,

Just wanted to add that it would be great if you could report back your
findings (good or bad) to the group. We're especially interested in this
because we're going to install Photon and would love it to work as fast as
possible :wink:

Stéphane Bastian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/913717a6-e646-4704-aa07-e813176e4f86%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I guess you already know this tool, but just in case you don't. I usually
use BigDesk: https://github.com/lukas-vlcek/bigdesk to check if there is
something wrong with Heap size or any metrics that it provides (cache size,
etc)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/216e3267-9f82-43ba-9359-454197b8dece%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

What about your sharding? Is it the same as with solr?

Did you identify some particulier queries being slow? Can you compare
the number of results returned between elasticsearch and solr?
Cédric Hourcade
ced@wal.fr

On Wed, Jun 25, 2014 at 10:12 AM, Christoph Lingg c.lingg@gmail.com wrote:

Hi Cedric and Stephane,

Thanks for your feedback! Following your ideas I removed any filtering and
custom scoring from the query. I do get better results, but the efficiency
of multi_match is still not as good as edismax (3 or 4 times slower).

I do not understand how multi_match is more complex than edismax. AFAIK the
only difference is even out idf over multiple fields for the final scoring.

Is there any tool to trace the performance in elasticsearch?

Christoph

Am Mittwoch, 25. Juni 2014 09:22:07 UTC+2 schrieb Stephane Bastian:

Hello Christoph,

Just wanted to add that it would be great if you could report back your
findings (good or bad) to the group. We're especially interested in this
because we're going to install Photon and would love it to work as fast as
possible :wink:

Stéphane Bastian

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/913717a6-e646-4704-aa07-e813176e4f86%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJQxjPNW8Cr4h_6JQ1XHOTS7QS9-e%3D%2B324WHR71m3b9wLs_9Cg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi Cedric,

What about your sharding? Is it the same as with solr?

I have 5 shards without replication (one node). Would it be faster if it
were only one shard?

Did you identify some particulier queries being slow?

there is a general trend of all queries beeing slower, not only some outlier
http://www.dict.cc/englisch-deutsch/outlier.htmls.

Can you compare the number of results returned between elasticsearch and
solr?

Do you mean the limits I give?

Cheers,
Christoph

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2f1f5fec-ba47-4292-a7da-3513fd194ef5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

What about your sharding? Is it the same as with solr?

I have 5 shards without replication (one node). Would it be faster if it
were only one shard?

Same with solr?

Did you identify some particulier queries being slow?

there is a general trend of all queries beeing slower, not only some
outliers.

I mean if you can isolate a single query with a huge performance
difference, it would be easier to test/tweak it.

Can you compare the number of results returned between elasticsearch and
solr?

Do you mean the limits I give?

For Elasticsearch I mean the ["hits"]["total"] returned in the
response, the total number of documents that matched your query.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJQxjPM1-b8TS%2BgKMnNH3PPVbsz_CxpqmrJfcYaa-QV_bTe2PA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

What about your sharding? Is it the same as with solr?

I have 5 shards without replication (one node). Would it be faster if it
were only one shard?

Same with solr?

I didn't use sharding with solr. Does disabling sharding improve the
performance significantly, at least if you only plan to use it on one node?

Did you identify some particulier queries being slow?

there is a general trend of all queries beeing slower, not only some
outliers.

I mean if you can isolate a single query with a huge performance
difference, it would be easier to test/tweak it.

It would demand some work to isolate these queries. However, I managed to
find out the reason why the query last much longer: the number of queried
fields increased from 9 (solr) to 25 (es). I thought this had no impact:
the number of tokens in the index got not changed but is now more
distributed in different fields. In other words: it turned out that the
number of fields you query has a greater impact on performance than the
number of tokens stored in a indexed field. So I know what to do and try
union fields where possible. Thanks for your help!

Anyway, cross_field query is still a little bit slower than solr's edismax,
but given the higher complexity this is comprehensible

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/730c3819-65f8-45ac-ba7c-d4353af2eded%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

It may not be because you have more fields but because your
elasticsearch query matches a lot more documents than the solr one,
that's worth checking.
Cédric Hourcade
ced@wal.fr

On Wed, Jun 25, 2014 at 4:05 PM, Christoph Lingg c.lingg@gmail.com wrote:

What about your sharding? Is it the same as with solr?

I have 5 shards without replication (one node). Would it be faster if it
were only one shard?

Same with solr?

I didn't use sharding with solr. Does disabling sharding improve the
performance significantly, at least if you only plan to use it on one node?

Did you identify some particulier queries being slow?

there is a general trend of all queries beeing slower, not only some
outliers.

I mean if you can isolate a single query with a huge performance
difference, it would be easier to test/tweak it.

It would demand some work to isolate these queries. However, I managed to
find out the reason why the query last much longer: the number of queried
fields increased from 9 (solr) to 25 (es). I thought this had no impact: the
number of tokens in the index got not changed but is now more distributed in
different fields. In other words: it turned out that the number of fields
you query has a greater impact on performance than the number of tokens
stored in a indexed field. So I know what to do and try union fields where
possible. Thanks for your help!

Anyway, cross_field query is still a little bit slower than solr's edismax,
but given the higher complexity this is comprehensible

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/730c3819-65f8-45ac-ba7c-d4353af2eded%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJQxjPNeuQX4B-eeoqEWtsTUCOi0OsyRXceYxyfBqN_jRTA5jg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

It may not be because you have more fields but because your
elasticsearch query matches a lot more documents than the solr one,
that's worth checking.

thanks for that tip, but it's not the case here

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c1104064-1c20-4ac5-977d-968f4f1ed705%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I didn't use sharding with solr. Does disabling sharding improve the
performance significantly, at least if you only plan to use it on one
node?
Disabling sharding shouldn't make any significant difference

It would demand some work to isolate these queries. However, I managed
to find out the reason why the query last much longer: the number of
queried fields increased from 9 (solr) to 25 (es). I thought this had
no impact: the number of tokens in the index got not changed but is
now more distributed in different fields. In other words: it turned
out that the number of fields you query has a greater impact on
performance than the number of tokens stored in a indexed field. So I
know what to do and try union fields where possible. Thanks for your help!

To me this means that the documents that are indexed in solr and Es are
different. Is this correct?
Would you mind sharing the schema of the document you are indexing in
Solr and ES ? We may be able to provide tips / ideas to improve search
performance

All the best,
Stéphane Bastian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/53AAE219.4030900%40gmail.com.
For more options, visit https://groups.google.com/d/optout.

Disabling sharding shouldn't make any significant difference

thanks!

It would demand some work to isolate these queries. However, I managed
to find out the reason why the query last much longer: the number of
queried fields increased from 9 (solr) to 25 (es). I thought this had
no impact: the number of tokens in the index got not changed but is
now more distributed in different fields. In other words: it turned
out that the number of fields you query has a greater impact on
performance than the number of tokens stored in a indexed field. So I
know what to do and try union fields where possible. Thanks for your
help!

To me this means that the documents that are indexed in solr and Es are
different. Is this correct?

they are identical, but in the newer elasticsearch version the edgengrams
and the raw tokens are stored in separate field, which allows boosting
entire words. we are using the fields
https://github.com/christophlingg/photon/blob/komoot/es_config/mappings.json#L47 feature
for that.

Would you mind sharing the schema of the document you are indexing in
Solr and ES ? We may be able to provide tips / ideas to improve search
performance

sure!

Cheers!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4c059fc5-a33c-46a1-9f92-688798c9710e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Thanks. I'm starting to get a better idea of the whole picture :wink:

Could you also share the query you are running? do run the cross_field
query against the default field or the 'raw' field?

Stéphane Bastian

On 06/25/2014 05:07 PM, Christoph Lingg wrote:

Disabling sharding shouldn't make any significant difference

thanks!

> It would demand some work to isolate these queries. However, I
managed
> to find out the reason why the query last much longer: the
number of
> queried fields increased from 9 (solr) to 25 (es). I thought
this had
> no impact: the number of tokens in the index got not changed but is
> now more distributed in different fields. In other words: it turned
> out that the number of fields you query has a greater impact on
> performance than the number of tokens stored in a indexed field.
So I
> know what to do and try union fields where possible. Thanks for
your help!
>
To me this means that the documents that are indexed in solr and
Es are
different. Is this correct?

they are identical, but in the newer elasticsearch version the
edgengrams and the raw tokens are stored in separate field, which
allows boosting entire words. we are using the fields
https://github.com/christophlingg/photon/blob/komoot/es_config/mappings.json#L47 feature
for that.

Would you mind sharing the schema of the document you are indexing in
Solr and ES ? We may be able to provide tips / ideas to improve
search
performance

sure!

https://github.com/christophlingg/photon/blob/603c3991f19c969a7c80d601cabd9367136ca809/es_config/mappings.json
https://github.com/komoot/photon/blob/deprecated-solr-version/solrconfig/collection1/conf/schema.xml

Cheers!

You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/bjl2PJEhYsg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com
mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/4c059fc5-a33c-46a1-9f92-688798c9710e%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/4c059fc5-a33c-46a1-9f92-688798c9710e%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/53AAE830.4070807%40gmail.com.
For more options, visit https://groups.google.com/d/optout.

Could you also share the query you are running? do run the cross_field
query against the default field or the 'raw' field?

it looks like this:

{

"function_score": {"functions": [{"script_score": {"script": "1. + 50. *

doc['importance'].value"}}],

"boost_mode": "sum",

"score_mode": "sum",

"query": {

  "multi_match": {

    "analyzer": "search",

    "type": "cross_fields",

    "fields": [

      "name.default.raw^18", "name.default^2.5", "name.${lang}.raw^18", 

"name.${lang}^2.5", "name.alternatives.raw^14", "name.alternatives^1.5",

      "city.default.raw^8", "city.default^2", "city.${lang}.raw^8", 

"city.${lang}^2",

      "street.default.raw^8", "street.default^2", 

"street.${lang}.raw^8", "street.${lang}^2",

      "housenumber.raw^6", "housenumber",

      "postcode^5",

      "country.default.raw^3", "country.default", 

"country.${lang}.raw^3", "country.${lang}", "context.default.raw^3",
"context.default", "context.${lang}.raw^3", "context.${lang}"

    ],

    "minimum_should_match": ${should_match},

    "query": "${query}"

  }

}

}

}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/46157fe0-5397-4413-923d-8991ccbbeb02%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Out of curiosity, what kind of performance do you get when you only run
the search on '.raw' fields and not regular fields (with edgengram).
Obviously the result of the query will not be the same as before as the
whole world should match if the edgengram are out of the picture.
I had some pretty weird result in the past where under specific
circonstances I had better performance results with prefix queries than
edgengram with a huge volume of data.

This reminds me a project I worked on indexing data from geonames. One
thing we did with altername names and support for multiple languages was
to remove the field for the default language. A default language after
all is a language that exists (either 'en', 'fr', etc.). This will make
your index smaller and make you run the query on less fields (15 instead
of 25).

Also I noticed that there is no edgengram on the postcode. Any reason
for that? It might be useful to also do a partial match.

Stéphane

On 06/25/2014 05:33 PM, Christoph Lingg wrote:

Could you also share the query you are running? do run the
cross_field query against the default field or the 'raw'  field?

it looks like this:

{

"function_score": {"functions": [{"script_score": {"script": "1. +
50. * doc['importance'].value"}}],

"boost_mode": "sum",

"score_mode": "sum",

"query": {

"multi_match": {

  "analyzer": "search",

  "type": "cross_fields",

  "fields": [

    "name.default.raw^18", "name.default^2.5",
"name.${lang}.raw^18", "name.${lang}^2.5",
"name.alternatives.raw^14", "name.alternatives^1.5",

    "city.default.raw^8", "city.default^2", "city.${lang}.raw^8",
"city.${lang}^2",

    "street.default.raw^8", "street.default^2",
"street.${lang}.raw^8", "street.${lang}^2",

    "housenumber.raw^6", "housenumber",

    "postcode^5",

    "country.default.raw^3", "country.default",
"country.${lang}.raw^3", "country.${lang}",
"context.default.raw^3", "context.default",
"context.${lang}.raw^3", "context.${lang}"

  ],

  "minimum_should_match": ${should_match},

  "query": "${query}"

}

    }

  }

}

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/bjl2PJEhYsg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com
mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/46157fe0-5397-4413-923d-8991ccbbeb02%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/46157fe0-5397-4413-923d-8991ccbbeb02%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/53AAF6C8.8060604%40gmail.com.
For more options, visit https://groups.google.com/d/optout.

Out of curiosity, what kind of performance do you get when you only run
the search on '.raw' fields and not regular fields (with edgengram).
Obviously the result of the query will not be the same as before as the
whole world should match if the edgengram are out of the picture.
I had some pretty weird result in the past where under specific
circonstances I had better performance results with prefix queries than
edgengram with a huge volume of data.

Thanks for that tip, we'll try that out!

This reminds me a project I worked on indexing data from geonames. One

thing we did with altername names and support for multiple languages was to
remove the field for the default language. A default language after all is
a language that exists (either 'en', 'fr', etc.). This will make your index
smaller and make you run the query on less fields (15 instead of 25).

That's true, but then you loose the ability to search for the
local/official name, that might differ from your language. A german can
search for Strasbourg (local) and Straßburg (german) and getting the same
result. As long as the performance allows it I will continue to go for it.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f6ea758b-1057-4ec1-9b1e-b0eba2cd8ec4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hm, I encounter strange scoring results I do not understand.... I tracked
down the scoring and it seems like the 'queryWeight' is missing sometimes.
thats what explain give me for one document:

{
"value": 8.252264,
"description": "weight(collector_1.default.raw:salzburg^18.0 in
11412869) [PerFieldSimilarity], result of:",
"details": [
{
"value": 8.252264,
"description": "fieldWeight in 11412869, product of:",
"details": [
{
"value": 1,
"description": "tf(freq=1.0), with freq of:",
"details": [
{
"value": 1,
"description": "termFreq=1.0"
}
]
},
{
"value": 8.252264,
"description": "idf(docFreq=13182, maxDocs=18605118)"
},
{
"value": 1,
"description": "fieldNorm(doc=11412869)"
}
]
}
]
}

the scoring of 8 is quite high and it get's the first result. however, this
document (the one who should be in first position) get's a significant
lower scoring because of the queryWeight that pops up.

{
"value": 3.8485851,
"description": "weight(collector_1.default.raw:salzburg^18.0 in 8149365)
[PerFieldSimilarity], result of:",
"details": [
{
"value": 3.8485851,
"description": "score(doc=8149365,freq=1.0 = termFreq=1.0\n),
product of:",
"details": [
{
"value": 0.46578622,
"description": "queryWeight, product of:",
"details": [
{
"value": 18,
"description": "boost"
},
{
"value": 8.262557,
"description": "idf(docFreq=13047, maxDocs=18605118)"
},
{
"value": 0.0031318406,
"description": "queryNorm"
}
]
},
{
"value": 8.262557,
"description": "fieldWeight in 8149365, product of:",
"details": [
{
"value": 1,
"description": "tf(freq=1.0), with freq of:",
"details": [
{
"value": 1,
"description": "termFreq=1.0"
}
]
},
{
"value": 8.262557,
"description": "idf(docFreq=13047, maxDocs=18605118)"
},
{
"value": 1,
"description": "fieldNorm(doc=8149365)"
}
]
}
]
}
]
},

I expected both scorings to be equal, but due to the missing queryWeight of
the first documents the order of the results get messed up. I expected the
queryWeight to appear in both cases, do I get something wrong? can it be a
bug even?

Would be great if you can help!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/33493e48-a078-4d4d-ae68-347a27860fbb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

other unexpected results arise due to different queryNorms:

for the first result i get a query norm:
{
"value": 0.0059806756,
"description": "queryNorm"
}

for some other documents it's:
{
"value": 0.0031318406,
"description": "queryNorm"
}

the querynorm is multiplied to create the score, so it pushes some
documents by a factor of two leading to unexpected results i do not
understand. i digged into queryNorm and as far as i could understand it
should stay constant for all docs! the documentation
http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/search/Similarity.html
states:
queryNorm(q) is a normalizing factor used to make scores between queries
comparable. This factor does not affect document ranking (since all ranked
documents are multiplied by the same factor), but rather just attempts to
make scores from different queries (or even different indexes) comparable.

is it ok that queryNorm differs?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/88ac7b4a-ffc9-4fa3-88e1-0342424f1be5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.