Facet configuration & showing conditions

Hi,

We power our multi-tenant application with elasticsearch. For each tenant
the documents may have different fields (think a tenant may have legal
documents vs. another may have media articles), i.e. we can't hardcode the
facets but need to make them configurable per tenant. I understand that
elasticsearch moves the facet configuration to the application by requiring
to send it with each search request.

Just to make sure that we don't reinvent the wheel: Are there any
libraries/projects that already implement facet configuration on top of
elasticsearch?

We're especially wondering how to best implement the conditions when to
show a facet because some conditions require parts of the execution to be
evaluated. Some example conditions:

  • Facet drill: Show facet if another facet was selected, e.g. show
    "city" facet if the user choose a value from the "country" facet -> simple
  • *Auto-drill: Auto-select a higher level facet if most hits are in one
    facet value, e.g. show directly "city" facet if 95% of results are in
    country=Netherlands
  • *#hits *threshold: Show facet if the number of hits is
    larger/smaller than a threshold
  • Coverage: Show facet if at least XX% of hits have a value for it,
    e.g. only show media type if at least 30% of hits would be in the facet
    (this avoids showing local facets for in-homogeneous search results)

The only solution we see so far is to request from elasticsearch all
possible facets, and to filter them afterwards. Somehow this doesn't sound
very efficient as we may have 100+ possible facets of which we would show
max. 6-7. Any advice on how to implement this better?

Many thanks!

Ron

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

It's better you know what fields are for faceting, otherwise you will
run into all kinds of troubles (where heap busting is just one of it)

Consider adding a configuration for each tenant what field are to be
registered for faceting. Then a faceting framework can pick up the field
list more efficiently.

On Open Data Day, I used https://github.com/okfn/facetview/ with success
for a quick demo of Elasticsearch. "Facetview" uses preconfigured field
name lists.

Just as a side note, you are right, for the described conditions, they
are very hard to be computable in a reasonable time by a generic
framework. They depend strongly on the application domain. It is more
reasonable to add a simple logic to the application-specific
tenant-aware indexing API that can check the sources and manage field
coverage and autoselects tenant-only facets based upon this. Looking up
all fields in all mappings of all indexes is expensive. Also because ES
is schemaless, I suggest to configure rules to constrain the fields of a
tenant to candidate fields for facet checks, or you will sooner or later
waste resources by checking all fields of all tenants, many that will
never get faceted. One easy rule would be "all facet fields have names
that must start with 'facet_'".

Jörg

Am 09.03.13 12:03, schrieb Ron:

Hi,

We power our multi-tenant application with elasticsearch. For each
tenant the documents may have different fields (think a tenant may
have legal documents vs. another may have media articles), i.e. we
can't hardcode the facets but need to make them configurable per
tenant. I understand that elasticsearch moves the facet configuration
to the application by requiring to send it with each search request.

Just to make sure that we don't reinvent the wheel: Are there any
libraries/projects that already implement facet configuration on top
of elasticsearch?

We're especially wondering how to best implement the conditions when
to show a facet because some conditions require parts of the execution
to be evaluated. Some example conditions:

  • Facet drill: Show facet if another facet was selected, e.g. show
    "city" facet if the user choose a value from the "country" facet
    -> simple
  • *Auto-drill: Auto-select a higher level facet if most hits are
    in one facet value, e.g. show directly "city" facet if 95% of
    results are in country=Netherlands
    *
  • *#hits *threshold: Show facet if the number of hits is
    larger/smaller than a threshold
  • Coverage: Show facet if at least XX% of hits have a value for
    it, e.g. only show media type if at least 30% of hits would be in
    the facet (this avoids showing local facets
    for in-homogeneous search results)

The only solution we see so far is to request from elasticsearch all
possible facets, and to filter them afterwards. Somehow this doesn't
sound very efficient as we may have 100+ possible facets of which we
would show max. 6-7. Any advice on how to implement this better?

Many thanks!

Ron

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Joerg,

Thank you so much for your input. Very helpful!

Let me clarify one point a bit further: We have around 50-100 fields per
tenant (of which 10-20 are the same across all tenants), i.e. we have
highly structured data. The super set of all fields across all tenants may
go into the 100,000s. In line with your comment, we assumed that
tenant-specific index&facet configuration is a must.

Could you elaborate further on the "tenant-aware indexing API that can
check the sources and manage field coverage"?
To give an example for auto-drill: A tenant stores legal documents, which
have as meta data a.o. the country and the city where the document was
created. We want to have facets over the country and the city (shown once a
country is chosen). If a user searches for "Dutch compliance regulation"
(in Dutch), the vast majority (but not necessarily all) of the hits for
this particular query will be clustered in the facet value "Netherlands".
Showing the country facet wouldn't really help the user to refine its
search results because nearly all hits are under the value "Netherlands".
Instead, we would want to directly show a facet with all cities from which
the user could more effectively refine.
And another example for coverage: The tenant stores the 'company type'
(e.g. joint stock company, limited) to which a document relates. This field
is only relevant for documents in company law (e.g. not for penal law). A
search for a generic search term like the place where the document was
created would lead to an in-homogeneous result set, consisting of e.g.
company and penal law documents. In this case, showing the 'company type'
facet wouldn't help the user as just a small percentage of the documents
would have a value for the facet. In contrast, a search for the corporate
lawyer who created the document (which are typically specialized in one
field of law) would lead to a more homogeneous result set, and showing the
'company type' facet would be of great value to the user.
Are these use cases that we could somehow handle via the indexing API?

Ron

On Saturday, March 9, 2013 12:43:04 PM UTC+1, Jörg Prante wrote:

It's better you know what fields are for faceting, otherwise you will
run into all kinds of troubles (where heap busting is just one of it)

Consider adding a configuration for each tenant what field are to be
registered for faceting. Then a faceting framework can pick up the field
list more efficiently.

On Open Data Day, I used https://github.com/okfn/facetview/ with success
for a quick demo of Elasticsearch. "Facetview" uses preconfigured field
name lists.

Just as a side note, you are right, for the described conditions, they
are very hard to be computable in a reasonable time by a generic
framework. They depend strongly on the application domain. It is more
reasonable to add a simple logic to the application-specific
tenant-aware indexing API that can check the sources and manage field
coverage and autoselects tenant-only facets based upon this. Looking up
all fields in all mappings of all indexes is expensive. Also because ES
is schemaless, I suggest to configure rules to constrain the fields of a
tenant to candidate fields for facet checks, or you will sooner or later
waste resources by checking all fields of all tenants, many that will
never get faceted. One easy rule would be "all facet fields have names
that must start with 'facet_'".

Jörg

Am 09.03.13 12:03, schrieb Ron:

Hi,

We power our multi-tenant application with elasticsearch. For each
tenant the documents may have different fields (think a tenant may
have legal documents vs. another may have media articles), i.e. we
can't hardcode the facets but need to make them configurable per
tenant. I understand that elasticsearch moves the facet configuration
to the application by requiring to send it with each search request.

Just to make sure that we don't reinvent the wheel: Are there any
libraries/projects that already implement facet configuration on top
of elasticsearch?

We're especially wondering how to best implement the conditions when
to show a facet because some conditions require parts of the execution
to be evaluated. Some example conditions:

  • Facet drill: Show facet if another facet was selected, e.g. show
    "city" facet if the user choose a value from the "country" facet
    -> simple
  • *Auto-drill: Auto-select a higher level facet if most hits are
    in one facet value, e.g. show directly "city" facet if 95% of
    results are in country=Netherlands
    *
  • *#hits *threshold: Show facet if the number of hits is
    larger/smaller than a threshold
  • Coverage: Show facet if at least XX% of hits have a value for
    it, e.g. only show media type if at least 30% of hits would be in
    the facet (this avoids showing local facets
    for in-homogeneous search results)

The only solution we see so far is to request from elasticsearch all
possible facets, and to filter them afterwards. Somehow this doesn't
sound very efficient as we may have 100+ possible facets of which we
would show max. 6-7. Any advice on how to implement this better?

Many thanks!

Ron

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.