Detail-questions on ES features

Erik_Fassler_2 · June 7, 2013, 7:47am

Hello all,

I've been using Solr for quite some time now in the project I'm working on.
When I started this project, ES was still a bit at the beginning so I chose
the - at the time - more advanced Solr.

Now I'm re-evaluating which technology would be most appropriate for me,
especially since I'm hitting a point in data growth where I'd like to - or
even must - shard / scale out my Lucene index.

I read through the ES docs in order to achieve an understanding of whether
the features required by my applications are met. I still have a few
questions:

Faceting method and performance: I have quite a LOT of possible facet
values, i.e. more than 1.5M. But I already know which terms I want to get
the facet values of. I can give a list of terms to Solr for which it
returns the counts to me. Can I do this with ES? I would need a few hundred
facet counts in one request. There are quite some options for term faceting
in ES, especially with the script fields and I'm not quite sure how
powerful this is. I saw you can explicitely EXCLUDE a list of terms; I'd
like to do the opposite. Can I do this? With Solr's fieldCache facet method
I'm able to get facet values in approx 1-2 seconds on a NOT sharded
environment with an index of 22M documents; I hope sharding will speed this
up enough to stay under one second of query time with a few hundred of
specified facet terms. If I could ask ES for specific term counts, could I
expect an answer in 1-2 seconds if I would not use the scalability features
(just for comparison!).
Additional facet term information: I have a use case where I want to
sort the term facet values by a kind of TF/IDF measure; i.e. I need the
facet count as well as the (total) document frequency of the facet term.
How could I get this information? Without it taking too long, of course,
but it's okay when it takes a few seconds.
Follow-up to the point above: If I had to use e.g. a plugin to get
the term document frequency values, how easy would it be to use this plugin
with ElasticSearch's scaleout-capabilities? I have such a plugin for Solr
but it only works with a single instance / node. I would have to write code
for the distributed case when sharding. Is it the same with ES or easier (I
would hope because of the distributed nature of ES from the beginning it
wouldn't be so hard)?
PreAnalyzed field values: I have quite a few fields with pre-analyzed
values, i.e. I already know the exact sequence of terms together with their
position increment, begin, start etc; for Solr there's the PreAnalyzed
field type, is there something similar in ES? Although I believe a custom
analyzer could do the trick, I haven't tried yet.

I hope I don't bother you too much with these questions. I'm trying to get
an overview about what Solr and ES can do / can't do (easily) for me. Since
I'm currently on Solr, I don't want to change without being informed
appropriatly before.

Thank you!

Best regards,

Erik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · June 7, 2013, 8:21am

Nice idea. Do you want local (per node) or global (per index or cluster)
tf/idf? Or selectable per parameter?

You have two options: let a new facet plugin do the work in the
background by embedding it directly into the facet JSON format, or add a
new plugin, which may derive from the Skywalker plugin, that can ask the
shards for the tf/idf information for a given list of terms, after you
have executed facet queries and extracted the terms (two API calls in
total).

Yes, of course it will scale over the shards, no problem at all because
this is the default in ES

Jörg

Am 07.06.2013 09:47, schrieb Erik Fäßler:

Additional facet term information: I have a use case where I want to
sort the term facet values by a kind of TF/IDF measure; i.e. I need
the facet count as well as the (total) document frequency of the facet
term.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Erik_Fassler_2 · June 7, 2013, 8:37am

Thank you for reply!

I'd like to have global tf/idf, I only plan to do sharding for performance
reasons.
The Skywalker plugin sounds interesting. But I'd like to know my options,
so let me ask, how would I do "let a new facet plugin do the work in the
background by embedding it directly into the facet JSON format"? Could you
give me a hint where to look?

It sounds great that there seem to be plugin-points where I wouldn't have
to worry about handling the shards, this could save me some work (not that
changing from Solr to ES would save work ;-))

Could you please also give me a hint how I can ask ES for specific term
counts? To be honest, currently I wouldn't know how to approach that
besides from a very ugly boolean script like "term == || term ==
|| .......". There is this nice "exclude" method where you can
specify a list, isn't there an "include" method?

Thank you!

Erik

On Friday, 7 June 2013 10:21:44 UTC+2, Jörg Prante wrote:

Nice idea. Do you want local (per node) or global (per index or cluster)
tf/idf? Or selectable per parameter?

You have two options: let a new facet plugin do the work in the
background by embedding it directly into the facet JSON format, or add a
new plugin, which may derive from the Skywalker plugin, that can ask the
shards for the tf/idf information for a given list of terms, after you
have executed facet queries and extracted the terms (two API calls in
total).

Yes, of course it will scale over the shards, no problem at all because
this is the default in ES

Jörg

Am 07.06.2013 09:47, schrieb Erik Fäßler:

Additional facet term information: I have a use case where I want to
sort the term facet values by a kind of TF/IDF measure; i.e. I need
the facet count as well as the (total) document frequency of the facet
term.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · June 7, 2013, 9:04am

Igor Motov wrote a script facet plugin
GitHub - imotov/elasticsearch-facet-script: Fully Scriptable Facets for ElasticSearch this might be
helpful. I don't know if tf/idf is available by using this plugin, but
it shows how facets can be extended. Also note, for ES > 0.90, a rework
of the facet framework for programming ES is in progress, with better
API extension points.

Asking ES for doing an action on term lists would have to be implemented
according to the use case, in a new plugin. Prefereably via a parameter
list, something like "terms=term1,term2,..."

Jörg

Am 07.06.13 10:37, schrieb Erik Fäßler:

Thank you for reply!

I'd like to have global tf/idf, I only plan to do sharding for
performance reasons.
The Skywalker plugin sounds interesting. But I'd like to know my
options, so let me ask, how would I do "let a new facet plugin do the
work in the
background by embedding it directly into the facet JSON format"? Could
you give me a hint where to look?

It sounds great that there seem to be plugin-points where I wouldn't
have to worry about handling the shards, this could save me some work
(not that changing from Solr to ES would save work ;-))

Could you please also give me a hint how I can ask ES for specific
term counts? To be honest, currently I wouldn't know how to approach
that besides from a very ugly boolean script like "term == ||
term == || .......". There is this nice "exclude" method where
you can specify a list, isn't there an "include" method?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Erik_Fassler_2 · June 7, 2013, 9:21am

Thanks again, this should get me going on this topic.

As for the term lists, there is already the list format "exclude" : ["term1"
, "term2"] (i.e. just JSON), I guess I would stick to that.

Can anyone say something to my other points, most importantly the ability
to have 1.6M facet values and still sane request times and how I could get
my pre-analyzed tokens into ES?

Best,

Erik

On Friday, 7 June 2013 11:04:36 UTC+2, Jörg Prante wrote:

Igor Motov wrote a script facet plugin
GitHub - imotov/elasticsearch-facet-script: Fully Scriptable Facets for ElasticSearch this might be
helpful. I don't know if tf/idf is available by using this plugin, but
it shows how facets can be extended. Also note, for ES > 0.90, a rework
of the facet framework for programming ES is in progress, with better
API extension points.

Asking ES for doing an action on term lists would have to be implemented
according to the use case, in a new plugin. Prefereably via a parameter
list, something like "terms=term1,term2,..."

Jörg

Am 07.06.13 10:37, schrieb Erik Fäßler:

Thank you for reply!

I'd like to have global tf/idf, I only plan to do sharding for
performance reasons.
The Skywalker plugin sounds interesting. But I'd like to know my
options, so let me ask, how would I do "let a new facet plugin do the
work in the
background by embedding it directly into the facet JSON format"? Could
you give me a hint where to look?

It sounds great that there seem to be plugin-points where I wouldn't
have to worry about handling the shards, this could save me some work
(not that changing from Solr to ES would save work ;-))

Could you please also give me a hint how I can ask ES for specific
term counts? To be honest, currently I wouldn't know how to approach
that besides from a very ugly boolean script like "term == ||
term == || .......". There is this nice "exclude" method where
you can specify a list, isn't there an "include" method?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

roytmana · June 8, 2013, 2:55am

To facet on desired list of values you could use facet_filter say terms filter since you have a
List o facet values you want to facet on.
http://www.elasticsearch.org/guide/reference/api/search/facets/
Or you can facet on values stored in another field fogot its name

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Erik_Fassler_2 · June 8, 2013, 5:39am

Thanks for the reply.

But when I understand facet filters correctly, they only reduce the set on
documents the facet values are computed on. They do not decide on term
level whether a term should be included or not.
Im my case, I will have a normal query which already restricts the document
set. Then, I want term facet counts only for specified terms relative to
this restricted document set.
Can facet filters do that for me?

On Saturday, 8 June 2013 04:55:34 UTC+2, AlexR wrote:

To facet on desired list of values you could use facet_filter say terms
filter since you have a
List o facet values you want to facet on.
Elasticsearch Platform — Find real-time answers at scale | Elastic
Or you can facet on values stored in another field fogot its name

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

roytmana · June 8, 2013, 3:13pm

I think if I understand them correctly, they may if you filter on faceted field. I think they will not affect the main document set but will affect set used by that facet. If facet field is single valued I think it will be the same as taking original set and calculating facet for specific values

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Erik_Fassler_2 · June 8, 2013, 3:40pm

Let's do an example: Say I have a document with some tags:
"tags":["furniture", "wood", "table",""].
Then what I want would be to be able to get the counts for the tags "wood"
and "table", without the other tags. I can't use the exclude feature
because there are too many tags. I think the facet filter would allow me to
restrict the set of documents I get the tags from, but I would still get
counts for all tags found in these documents.
But perhaps I'm wrong, which would be great since I want this functionality

On Saturday, 8 June 2013 17:13:14 UTC+2, AlexR wrote:

I think if I understand them correctly, they may if you filter on faceted
field. I think they will not affect the main document set but will affect
set used by that facet. If facet field is single valued I think it will be
the same as taking original set and calculating facet for specific values

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

roytmana · June 8, 2013, 4:31pm

As I said it works when faceting on a single value field say person.age because the filter while acting on the documents effectively restricts facet values as well.
It won't work for multivalued fields such as you tags

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

mattweber · June 8, 2013, 5:10pm

Erik, if you know the terms you can use the regex option on the terms
facet. I think something like this would work:

{
"query" : {
"match_all" : { }
},
"facets" : {
"tag" : {
"terms" : {
"field" : "tags",
"regex" : "(furniture|wood|table)"
}
}
}
}

On Sat, Jun 8, 2013 at 9:31 AM, AlexR roytmana@gmail.com wrote:

As I said it works when faceting on a single value field say person.age
because the filter while acting on the documents effectively restricts
facet values as well.
It won't work for multivalued fields such as you tags

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Erik_Fassler_2 · June 8, 2013, 6:07pm

Hi Matt,

thank you for the reply.

This would work definitvely, but with the regex I'm a bit worried about
performance. But that is mainly caused by my not existing knowledge about
performance with regular expressions. You would think using such verbatim
expressions wouldn't be more expensive than String.equals(), wouldn't you?
But I have honestly no idea

On Saturday, 8 June 2013 19:10:24 UTC+2, Matt Weber wrote:

Erik, if you know the terms you can use the regex option on the terms
facet. I think something like this would work:

{
"query" : {
"match_all" : { }
},
"facets" : {
"tag" : {
"terms" : {
"field" : "tags",
"regex" : "(furniture|wood|table)"
}
}
}
}

On Sat, Jun 8, 2013 at 9:31 AM, AlexR <royt...@gmail.com <javascript:>>wrote:

As I said it works when faceting on a single value field say person.age
because the filter while acting on the documents effectively restricts
facet values as well.
It won't work for multivalued fields such as you tags

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
More on Solr vs ES faceting Elasticsearch	32	2314	July 6, 2017
Greetings! Elasticsearch	8	921	July 6, 2017
Performance killed when faceting on high cardinality fields Elasticsearch	26	2752	July 6, 2017
Faceting memory issue ElasticSearch 0.17.6 vs Solr 3.3 Elasticsearch	8	459	July 6, 2017
Terms Faceting on multi-valued field Elasticsearch	4	837	July 6, 2017

Detail-questions on ES features

Related topics