Creating new facet types

I am currently attempting to convert a custom Lucene collector to an
ElasticSearch facet. The facet is similar to a RangeFacet, but with
dynamic ranges. I have tried to use the existing RangeFacet and
HistogramFacet, but have not been able to create the a facet with a
fixed number of buckets and no predefined ranges.

From what I have gathered from viewing the source, these appear to be
the main steps (and questions) when creating a new facet:

  1. Create a plugin that creates a new FacetProcessor (see below) to
    add to the FacetModule.

  2. Create a new FacetProcessor class. The types() and parse(...) are
    pretty simple to understand. The reduce method is a bit trickier, but
    I think I understand its usage. When/why do facets need to be merge?
    Also, what is the purpose of the various regsiterStreams in the
    FacetProcessor constuctors?

  3. Create a new FacetCollector class. The meat of the faceting
    process. Contains the logic that creates the actual facets.

  4. Create a new Facet class. Primarily a POJO.

Am I missing anything else? The process seems daunting, but hopefully
it will be alleviated since I should be able to build upon the
existing classes (skipping points 3 + 4). What the dynamic range
algorithm is actualy doing is executing a range facet on
pre-established static ranges (based on historical data) and then
combining those ranges evenly among a defined number of buckets.

Cheers,

Ivan

Any insights on creating a new facet type? Especially the purpose of
registerStreams().

--
Ivan

On Tue, Feb 14, 2012 at 4:01 PM, Ivan Brusic ivan@brusic.com wrote:

I am currently attempting to convert a custom Lucene collector to an
Elasticsearch facet. The facet is similar to a RangeFacet, but with
dynamic ranges. I have tried to use the existing RangeFacet and
HistogramFacet, but have not been able to create the a facet with a
fixed number of buckets and no predefined ranges.

From what I have gathered from viewing the source, these appear to be
the main steps (and questions) when creating a new facet:

  1. Create a plugin that creates a new FacetProcessor (see below) to
    add to the FacetModule.

  2. Create a new FacetProcessor class. The types() and parse(...) are
    pretty simple to understand. The reduce method is a bit trickier, but
    I think I understand its usage. When/why do facets need to be merge?
    Also, what is the purpose of the various regsiterStreams in the
    FacetProcessor constuctors?

  3. Create a new FacetCollector class. The meat of the faceting
    process. Contains the logic that creates the actual facets.

  4. Create a new Facet class. Primarily a POJO.

Am I missing anything else? The process seems daunting, but hopefully
it will be alleviated since I should be able to build upon the
existing classes (skipping points 3 + 4). What the dynamic range
algorithm is actualy doing is executing a range facet on
pre-established static ranges (based on historical data) and then
combining those ranges evenly among a defined number of buckets.

Cheers,

Ivan

registerStreams is aimed at registering how the facets will get serialized between nodes. The reduce part is important, because facet execute on each shard, and then they need to be reduced into a single result.

Note, this API is quite low level and expected to change, very very rough currently.

On Wednesday, February 22, 2012 at 8:24 AM, Ivan Brusic wrote:

Any insights on creating a new facet type? Especially the purpose of
registerStreams().

--
Ivan

On Tue, Feb 14, 2012 at 4:01 PM, Ivan Brusic <ivan@brusic.com (mailto:ivan@brusic.com)> wrote:

I am currently attempting to convert a custom Lucene collector to an
Elasticsearch facet. The facet is similar to a RangeFacet, but with
dynamic ranges. I have tried to use the existing RangeFacet and
HistogramFacet, but have not been able to create the a facet with a
fixed number of buckets and no predefined ranges.

From what I have gathered from viewing the source, these appear to be
the main steps (and questions) when creating a new facet:

  1. Create a plugin that creates a new FacetProcessor (see below) to
    add to the FacetModule.

  2. Create a new FacetProcessor class. The types() and parse(...) are
    pretty simple to understand. The reduce method is a bit trickier, but
    I think I understand its usage. When/why do facets need to be merge?
    Also, what is the purpose of the various regsiterStreams in the
    FacetProcessor constuctors?

  3. Create a new FacetCollector class. The meat of the faceting
    process. Contains the logic that creates the actual facets.

  4. Create a new Facet class. Primarily a POJO.

Am I missing anything else? The process seems daunting, but hopefully
it will be alleviated since I should be able to build upon the
existing classes (skipping points 3 + 4). What the dynamic range
algorithm is actualy doing is executing a range facet on
pre-established static ranges (based on historical data) and then
combining those ranges evenly among a defined number of buckets.

Cheers,

Ivan

Out of curiosity, is there any plans for revisiting the facet API in
the near future? After months of delaying creating a new facet, the
time has finally come for me to write it!

--
Ivan

On Fri, Feb 24, 2012 at 5:26 AM, Shay Banon kimchy@gmail.com wrote:

registerStreams is aimed at registering how the facets will get serialized
between nodes. The reduce part is important, because facet execute on each
shard, and then they need to be reduced into a single result.

Note, this API is quite low level and expected to change, very very rough
currently.

On Wednesday, February 22, 2012 at 8:24 AM, Ivan Brusic wrote:

Any insights on creating a new facet type? Especially the purpose of
registerStreams().

--
Ivan

On Tue, Feb 14, 2012 at 4:01 PM, Ivan Brusic ivan@brusic.com wrote:

I am currently attempting to convert a custom Lucene collector to an
Elasticsearch facet. The facet is similar to a RangeFacet, but with
dynamic ranges. I have tried to use the existing RangeFacet and
HistogramFacet, but have not been able to create the a facet with a
fixed number of buckets and no predefined ranges.

From what I have gathered from viewing the source, these appear to be
the main steps (and questions) when creating a new facet:

  1. Create a plugin that creates a new FacetProcessor (see below) to
    add to the FacetModule.

  2. Create a new FacetProcessor class. The types() and parse(...) are
    pretty simple to understand. The reduce method is a bit trickier, but
    I think I understand its usage. When/why do facets need to be merge?
    Also, what is the purpose of the various regsiterStreams in the
    FacetProcessor constuctors?

  3. Create a new FacetCollector class. The meat of the faceting
    process. Contains the logic that creates the actual facets.

  4. Create a new Facet class. Primarily a POJO.

Am I missing anything else? The process seems daunting, but hopefully
it will be alleviated since I should be able to build upon the
existing classes (skipping points 3 + 4). What the dynamic range
algorithm is actualy doing is executing a range facet on
pre-established static ranges (based on historical data) and then
combining those ranges evenly among a defined number of buckets.

Cheers,

Ivan

On Tuesday, June 26, 2012 12:21:08 PM UTC-7, Ivan Brusic wrote:

Out of curiosity, is there any plans for revisiting the facet API in
the near future? After months of delaying creating a new facet, the
time has finally come for me to write it!

I also wanted to get working on that geo-clustering facet [1], so this
would be good to know!

[1] Facet for clustering geo_points · Issue #1689 · elastic/elasticsearch · GitHub

Yes, there is a plan to revisit the facet execution model, so there will
probably be changes.
@eric: Really want to get the geo clustering as a facet, want to tackle it
after the refactoring.

On Wed, Jun 27, 2012 at 1:10 AM, Eric Jain eric.jain@gmail.com wrote:

On Tuesday, June 26, 2012 12:21:08 PM UTC-7, Ivan Brusic wrote:

Out of curiosity, is there any plans for revisiting the facet API in
the near future? After months of delaying creating a new facet, the
time has finally come for me to write it!

I also wanted to get working on that geo-clustering facet [1], so this
would be good to know!

[1] Facet for clustering geo_points · Issue #1689 · elastic/elasticsearch · GitHub

Hi,

As far I understand from the structure of custom facet plugin development,
the code within the the Collector class is executed as many times as the
number of shards, one thread for each shard, whereas the code within the
Processor class is executed once for aggregating the result available from
each Collector response. Please correct me if I am wrong in this basic
structure.

Assuming the above, is there any way to access the lucene based file system
for a shard directly in its raw format from the Collector class? What I am
trying is to bypass the doCollect(docId) or forEachValueInDoc(aggregator)
methods, which deals with the tokens as available from the index, and
access the raw lucene file system directly for some analysis purpose. Is
this easily possible?

Thanks,
Sujoy.

On Wednesday, June 27, 2012 11:39:54 PM UTC+5:30, kimchy wrote:

Yes, there is a plan to revisit the facet execution model, so there will
probably be changes.
@eric: Really want to get the geo clustering as a facet, want to tackle it
after the refactoring.

On Wed, Jun 27, 2012 at 1:10 AM, Eric Jain eric.jain@gmail.com wrote:

On Tuesday, June 26, 2012 12:21:08 PM UTC-7, Ivan Brusic wrote:

Out of curiosity, is there any plans for revisiting the facet API in
the near future? After months of delaying creating a new facet, the
time has finally come for me to write it!

I also wanted to get working on that geo-clustering facet [1], so this
would be good to know!

[1] Facet for clustering geo_points · Issue #1689 · elastic/elasticsearch · GitHub

I'm note sure this applies in your case, but in a regular Lucene
Collector you can access the field cache for each segment within a
Lucene index.
I believe this is the recommended way to work with a few fields within a
doc and not try to load the whole document. See information in the book
"Lucene in Action" for how this is done.
If you can put just what you need for collection into a nice small field
then any one field cache per segment can be small.

-Paul

On 7/18/2012 6:35 AM, Sujoy Sett wrote:

Assuming the above, is there any way to access the lucene based file
system for a shard directly in its raw format from the Collector
class? What I am trying is to bypass the doCollect(docId)
or forEachValueInDoc(aggregator) methods, which deals with the tokens
as available from the index, and access the raw lucene file system
directly for some analysis purpose. Is this easily possible?

Hi Paul,

Thanks for your response.

My question was more directed towards Elasticsearch facet module structure,
though eventually it redirects towards using lucene fields in its raw forms.
Let me illustrate my requirement.

  1. First of all, search is not my primary target, but faceting is.
  2. A custom analyzer parses (text parsing, using nlp libraries) a
    particular field of the document and indexes it with some associated
    information available from the parsing.
  3. Now developing custom faceting logic, I want to aggregate (using
    distributed map-reduce from elasticsearch facet architecture) the elements
    from the said field from all docs (returned by search on some other field)
    as per another text aggregating logic.
  4. While doing the above, I have the elements from said field as a java
    String (or java basic types) in the faceting module (FacetCollector classes
    and all). What I am currently doing is parsing the elements again in this
    phase, and applying the text aggregating logic. (I am partially redoing
    here what the analyzer has already done.)
  5. If I can access the elements in raw form with details (I mean as
    lucene TermAttribute, etc., instead of java basic types), I will be saving
    a huge percentage of the time needed here, cutting out the redoing of text
    parsing.
  6. A bit of more explanation, getting the indexed field content as it
    was is not my requirement. I am not doing search/highlight. I want some
    information retrieval to run on a field and give me composite result.
    Elasticsearch facet provides the perfect map-reduce distributed framework
    for the purpose. I am trying to add to it information retrieval logic.

I hope I have explained my requirement well. I can provide more
clarification if needed.
Ivan's starting post giving a gist towards developing custom facets was a
great motivation towards using elasticsearch for this task.
However, I am yet not quite familiar with all the internal code details of
elasticsearch, so a suggestion / guideline on the requirement as stated
above will be really helpful.

Thanks,
Sujoy.

On Thursday, July 19, 2012 5:53:28 AM UTC+5:30, P Hill wrote:

I'm note sure this applies in your case, but in a regular Lucene
Collector you can access the field cache for each segment within a
Lucene index.
I believe this is the recommended way to work with a few fields within a
doc and not try to load the whole document. See information in the book
"Lucene in Action" for how this is done.
If you can put just what you need for collection into a nice small field
then any one field cache per segment can be small.

-Paul

On 7/18/2012 6:35 AM, Sujoy Sett wrote:

Assuming the above, is there any way to access the lucene based file
system for a shard directly in its raw format from the Collector
class? What I am trying is to bypass the doCollect(docId)
or forEachValueInDoc(aggregator) methods, which deals with the tokens
as available from the index, and access the raw lucene file system
directly for some analysis purpose. Is this easily possible?