Detail-questions on ES features

Hello all,

I've been using Solr for quite some time now in the project I'm working on.
When I started this project, ES was still a bit at the beginning so I chose
the - at the time - more advanced Solr.

Now I'm re-evaluating which technology would be most appropriate for me,
especially since I'm hitting a point in data growth where I'd like to - or
even must - shard / scale out my Lucene index.

I read through the ES docs in order to achieve an understanding of whether
the features required by my applications are met. I still have a few
questions:

  1. Faceting method and performance: I have quite a LOT of possible facet
    values, i.e. more than 1.5M. But I already know which terms I want to get
    the facet values of. I can give a list of terms to Solr for which it
    returns the counts to me. Can I do this with ES? I would need a few hundred
    facet counts in one request. There are quite some options for term faceting
    in ES, especially with the script fields and I'm not quite sure how
    powerful this is. I saw you can explicitely EXCLUDE a list of terms; I'd
    like to do the opposite. Can I do this? With Solr's fieldCache facet method
    I'm able to get facet values in approx 1-2 seconds on a NOT sharded
    environment with an index of 22M documents; I hope sharding will speed this
    up enough to stay under one second of query time with a few hundred of
    specified facet terms. If I could ask ES for specific term counts, could I
    expect an answer in 1-2 seconds if I would not use the scalability features
    (just for comparison!).
  2. Additional facet term information: I have a use case where I want to
    sort the term facet values by a kind of TF/IDF measure; i.e. I need the
    facet count as well as the (total) document frequency of the facet term.
    How could I get this information? Without it taking too long, of course,
    but it's okay when it takes a few seconds.
  3. Follow-up to the point above: If I had to use e.g. a plugin to get
    the term document frequency values, how easy would it be to use this plugin
    with ElasticSearch's scaleout-capabilities? I have such a plugin for Solr
    but it only works with a single instance / node. I would have to write code
    for the distributed case when sharding. Is it the same with ES or easier (I
    would hope because of the distributed nature of ES from the beginning it
    wouldn't be so hard)?
  4. PreAnalyzed field values: I have quite a few fields with pre-analyzed
    values, i.e. I already know the exact sequence of terms together with their
    position increment, begin, start etc; for Solr there's the PreAnalyzed
    field type, is there something similar in ES? Although I believe a custom
    analyzer could do the trick, I haven't tried yet.

I hope I don't bother you too much with these questions. I'm trying to get
an overview about what Solr and ES can do / can't do (easily) for me. Since
I'm currently on Solr, I don't want to change without being informed
appropriatly before.

Thank you!

Best regards,

Erik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Nice idea. Do you want local (per node) or global (per index or cluster)
tf/idf? Or selectable per parameter?

You have two options: let a new facet plugin do the work in the
background by embedding it directly into the facet JSON format, or add a
new plugin, which may derive from the Skywalker plugin, that can ask the
shards for the tf/idf information for a given list of terms, after you
have executed facet queries and extracted the terms (two API calls in
total).

Yes, of course it will scale over the shards, no problem at all because
this is the default in ES :slight_smile:

Jörg

Am 07.06.2013 09:47, schrieb Erik Fäßler:

Additional facet term information: I have a use case where I want to
sort the term facet values by a kind of TF/IDF measure; i.e. I need
the facet count as well as the (total) document frequency of the facet
term.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thank you for reply!

I'd like to have global tf/idf, I only plan to do sharding for performance
reasons.
The Skywalker plugin sounds interesting. But I'd like to know my options,
so let me ask, how would I do "let a new facet plugin do the work in the
background by embedding it directly into the facet JSON format"? Could you
give me a hint where to look?

It sounds great that there seem to be plugin-points where I wouldn't have
to worry about handling the shards, this could save me some work (not that
changing from Solr to ES would save work ;-))

Could you please also give me a hint how I can ask ES for specific term
counts? To be honest, currently I wouldn't know how to approach that
besides from a very ugly boolean script like "term == || term ==
|| .......". There is this nice "exclude" method where you can
specify a list, isn't there an "include" method?

Thank you!

Erik

On Friday, 7 June 2013 10:21:44 UTC+2, Jörg Prante wrote:

Nice idea. Do you want local (per node) or global (per index or cluster)
tf/idf? Or selectable per parameter?

You have two options: let a new facet plugin do the work in the
background by embedding it directly into the facet JSON format, or add a
new plugin, which may derive from the Skywalker plugin, that can ask the
shards for the tf/idf information for a given list of terms, after you
have executed facet queries and extracted the terms (two API calls in
total).

Yes, of course it will scale over the shards, no problem at all because
this is the default in ES :slight_smile:

Jörg

Am 07.06.2013 09:47, schrieb Erik Fäßler:

Additional facet term information: I have a use case where I want to
sort the term facet values by a kind of TF/IDF measure; i.e. I need
the facet count as well as the (total) document frequency of the facet
term.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Igor Motov wrote a script facet plugin
GitHub - imotov/elasticsearch-facet-script: Fully Scriptable Facets for ElasticSearch this might be
helpful. I don't know if tf/idf is available by using this plugin, but
it shows how facets can be extended. Also note, for ES > 0.90, a rework
of the facet framework for programming ES is in progress, with better
API extension points.

Asking ES for doing an action on term lists would have to be implemented
according to the use case, in a new plugin. Prefereably via a parameter
list, something like "terms=term1,term2,..."

Jörg

Am 07.06.13 10:37, schrieb Erik Fäßler:

Thank you for reply!

I'd like to have global tf/idf, I only plan to do sharding for
performance reasons.
The Skywalker plugin sounds interesting. But I'd like to know my
options, so let me ask, how would I do "let a new facet plugin do the
work in the
background by embedding it directly into the facet JSON format"? Could
you give me a hint where to look?

It sounds great that there seem to be plugin-points where I wouldn't
have to worry about handling the shards, this could save me some work
(not that changing from Solr to ES would save work ;-))

Could you please also give me a hint how I can ask ES for specific
term counts? To be honest, currently I wouldn't know how to approach
that besides from a very ugly boolean script like "term == ||
term == || .......". There is this nice "exclude" method where
you can specify a list, isn't there an "include" method?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks again, this should get me going on this topic.

As for the term lists, there is already the list format "exclude" : ["term1"
, "term2"] (i.e. just JSON), I guess I would stick to that.

Can anyone say something to my other points, most importantly the ability
to have 1.6M facet values and still sane request times and how I could get
my pre-analyzed tokens into ES?

Best,

Erik

On Friday, 7 June 2013 11:04:36 UTC+2, Jörg Prante wrote:

Igor Motov wrote a script facet plugin
GitHub - imotov/elasticsearch-facet-script: Fully Scriptable Facets for ElasticSearch this might be
helpful. I don't know if tf/idf is available by using this plugin, but
it shows how facets can be extended. Also note, for ES > 0.90, a rework
of the facet framework for programming ES is in progress, with better
API extension points.

Asking ES for doing an action on term lists would have to be implemented
according to the use case, in a new plugin. Prefereably via a parameter
list, something like "terms=term1,term2,..."

Jörg

Am 07.06.13 10:37, schrieb Erik Fäßler:

Thank you for reply!

I'd like to have global tf/idf, I only plan to do sharding for
performance reasons.
The Skywalker plugin sounds interesting. But I'd like to know my
options, so let me ask, how would I do "let a new facet plugin do the
work in the
background by embedding it directly into the facet JSON format"? Could
you give me a hint where to look?

It sounds great that there seem to be plugin-points where I wouldn't
have to worry about handling the shards, this could save me some work
(not that changing from Solr to ES would save work ;-))

Could you please also give me a hint how I can ask ES for specific
term counts? To be honest, currently I wouldn't know how to approach
that besides from a very ugly boolean script like "term == ||
term == || .......". There is this nice "exclude" method where
you can specify a list, isn't there an "include" method?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

To facet on desired list of values you could use facet_filter say terms filter since you have a
List o facet values you want to facet on.
http://www.elasticsearch.org/guide/reference/api/search/facets/
Or you can facet on values stored in another field fogot its name

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks for the reply.

But when I understand facet filters correctly, they only reduce the set on
documents the facet values are computed on. They do not decide on term
level whether a term should be included or not.
Im my case, I will have a normal query which already restricts the document
set. Then, I want term facet counts only for specified terms relative to
this restricted document set.
Can facet filters do that for me?

On Saturday, 8 June 2013 04:55:34 UTC+2, AlexR wrote:

To facet on desired list of values you could use facet_filter say terms
filter since you have a
List o facet values you want to facet on.
Elasticsearch Platform — Find real-time answers at scale | Elastic
Or you can facet on values stored in another field fogot its name

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I think if I understand them correctly, they may if you filter on faceted field. I think they will not affect the main document set but will affect set used by that facet. If facet field is single valued I think it will be the same as taking original set and calculating facet for specific values

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Let's do an example: Say I have a document with some tags:
"tags":["furniture", "wood", "table",""].
Then what I want would be to be able to get the counts for the tags "wood"
and "table", without the other tags. I can't use the exclude feature
because there are too many tags. I think the facet filter would allow me to
restrict the set of documents I get the tags from, but I would still get
counts for all tags found in these documents.
But perhaps I'm wrong, which would be great since I want this functionality
:slight_smile:

On Saturday, 8 June 2013 17:13:14 UTC+2, AlexR wrote:

I think if I understand them correctly, they may if you filter on faceted
field. I think they will not affect the main document set but will affect
set used by that facet. If facet field is single valued I think it will be
the same as taking original set and calculating facet for specific values

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

As I said it works when faceting on a single value field say person.age because the filter while acting on the documents effectively restricts facet values as well.
It won't work for multivalued fields such as you tags

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Erik, if you know the terms you can use the regex option on the terms
facet. I think something like this would work:

{
"query" : {
"match_all" : { }
},
"facets" : {
"tag" : {
"terms" : {
"field" : "tags",
"regex" : "(furniture|wood|table)"
}
}
}
}

On Sat, Jun 8, 2013 at 9:31 AM, AlexR roytmana@gmail.com wrote:

As I said it works when faceting on a single value field say person.age
because the filter while acting on the documents effectively restricts
facet values as well.
It won't work for multivalued fields such as you tags

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Matt,

thank you for the reply.

This would work definitvely, but with the regex I'm a bit worried about
performance. But that is mainly caused by my not existing knowledge about
performance with regular expressions. You would think using such verbatim
expressions wouldn't be more expensive than String.equals(), wouldn't you?
But I have honestly no idea :slight_smile:

On Saturday, 8 June 2013 19:10:24 UTC+2, Matt Weber wrote:

Erik, if you know the terms you can use the regex option on the terms
facet. I think something like this would work:

{
"query" : {
"match_all" : { }
},
"facets" : {
"tag" : {
"terms" : {
"field" : "tags",
"regex" : "(furniture|wood|table)"
}
}
}
}

On Sat, Jun 8, 2013 at 9:31 AM, AlexR <royt...@gmail.com <javascript:>>wrote:

As I said it works when faceting on a single value field say person.age
because the filter while acting on the documents effectively restricts
facet values as well.
It won't work for multivalued fields such as you tags

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.