Greetings!

Hi Shay et al.

Awesome website you've got there - really gets me interested in
trying out this project. Your documentation gives a great feel for
how ElasticSearch can be used. Many of my questions can be answered
by digging into the source code, but I was wondering if you could give
a little overview (or point to the relevant docs) on how you do a) the
faceted search part, and b) in what way is the search "real-time". As
someone who's worked quite a bit on getting distributed faceted real-
time search on top of Lucene to perform well ( see: http://zoie.googlecode.com
and http://bobo-browse.googlecode.com ), I'm interested to see what
ElasticSearch's approach was!

-jake

http://www.linkedin.com/in/jakemannix
http://www.twitter.com/pbrane

Hi Jake,

Thanks for the compliments!, I invested quite a bit of time on the site,
and its really great to get positive feedback on it. Regarding your
questions:

a) facets in ES currently only support facet queries. It basically revolves
around filters that represent the facet query (cached on the index reader
level). Its the most straightforward solution that I wanted to implement to
get something out there (and I think the most flexible facet solution out of
all the rest). The nice bit about caching is the fact that the index is
sharded, so memory is (potentially) not a problem since you can simply fire
up more nodes. I plan to add support for other types of facets (and make the
search process completely pluggable).

b) The search is near real time with the current (only) implementation of
the Engine (
http://www.elasticsearch.com/docs/elasticsearch/index_modules/engine/robin/)
and it uses Lucene NRT. I got back good results with that with a near real
time factor of 1 second (you see changes maximum 1 second after they were
indexed) and Lucene 3.1 should be even better. There are other ways to
implement real time, one of them, which I have done in Compass ages ago is
to have an in memory index and a "more persistent" index, and do the ops on
the in memory index one. NRT might still be used to get the changes from the
in memory index, but its on a smaller index so the sync points there will
potentially be smaller (in terms of time spent). The nice thing about ES is
the fact that it has a transaction log, so you don't need to commit in any
case, which makes the in memory index even better solution since you don't
potentially loose operations. I have planned for such solution(s) upfront
with the Engine abstraction I have in ES.

Hope this answers the question a bit. Its such a broad area ... .

-shay.banon

On Tue, Feb 9, 2010 at 1:22 AM, jake.mannix jake.mannix@gmail.com wrote:

Hi Shay et al.

Awesome website you've got there - really gets me interested in
trying out this project. Your documentation gives a great feel for
how Elasticsearch can be used. Many of my questions can be answered
by digging into the source code, but I was wondering if you could give
a little overview (or point to the relevant docs) on how you do a) the
faceted search part, and b) in what way is the search "real-time". As
someone who's worked quite a bit on getting distributed faceted real-
time search on top of Lucene to perform well ( see:
http://zoie.googlecode.com
and http://bobo-browse.googlecode.com ), I'm interested to see what
Elasticsearch's approach was!

-jake

http://www.linkedin.com/in/jakemannix
http://www.twitter.com/pbrane

On Mon, Feb 8, 2010 at 4:12 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Hi Jake,

Thanks for the compliments!, I invested quite a bit of time on the site,
and its really great to get positive feedback on it. Regarding your
questions:

a) facets in ES currently only support facet queries. It basically revolves
around filters that represent the facet query (cached on the index reader
level). Its the most straightforward solution that I wanted to implement to
get something out there (and I think the most flexible facet solution out of
all the rest). The nice bit about caching is the fact that the index is
sharded, so memory is (potentially) not a problem since you can simply fire
up more nodes. I plan to add support for other types of facets (and make the
search process completely pluggable).

Filters keyed on indexreader, ok, fairly straightforward (although if you
want to do multi-select, this will get tricky: if the user selects
"color:red" AND "month:Jan", then you want to filter by both of them for the
search results, but also collect the number of hits on the other colors (as
long as month:Jan matches), and the number of hits on the other months (as
long as the color:red matches), etc...).

How to you expect the caching will work if you were indexing in real-time
though? The DocIdSetIterator is cached on the per IndexReader, not down at
the SegmentReader, right? Then when you do a
reopen/IndexWriter.getReader(), does this stay up to date?

b) The search is near real time with the current (only) implementation of
the Engine (
http://www.elasticsearch.com/docs/elasticsearch/index_modules/engine/robin/)
and it uses Lucene NRT. I got back good results with that with a near real
time factor of 1 second (you see changes maximum 1 second after they were
indexed) and Lucene 3.1 should be even better. There are other ways to
implement real time, one of them, which I have done in Compass ages ago is
to have an in memory index and a "more persistent" index, and do the ops on
the in memory index one. NRT might still be used to get the changes from the
in memory index, but its on a smaller index so the sync points there will
potentially be smaller (in terms of time spent). The nice thing about ES is
the fact that it has a transaction log, so you don't need to commit in any
case, which makes the in memory index even better solution since you don't
potentially loose operations. I have planned for such solution(s) upfront
with the Engine abstraction I have in ES.

Ok, NRT with 1 second turnaround is pretty good, esp since it's distributed
(have you done much performance analysis under load?). If you want to do
the partial RAMDir / FSDir thing, you should check out zoie, it's also
apache licensed, and takes care of all of that stuff as its core focus
(including optimized segment mergers for the realtime case, and a
docid<->uid mapping), and works best in cases where there is a transaction
log. The indexing paradigm is one of StreamDataProvider / DataConsumer -
you hook in your data provider (fed by, eg. your txlog), and zoie provides a
DataConsumer which indexes in real time, exposing an IndexReaderFactory
which gives you a handle on a List getReaders() which
is real-time up to the couple-of-milliseconds level. Should be pretty easy
to plug in if you wanted to use it.

Hope this answers the question a bit. Its such a broad area ... .

It is, I'm interested to try out ES and see how you got it working! Very
cool stuff!

-jake

I see what you mean now. In a simple facet usage, then each time a facet is
clicked, the facet is added as a filter to the query (later on becoming a
boolean filter). But, in this case, the results that you get will always be
narrows down to the query with the filters, so getting filters on just the
query (stuff) and not the filter (color:red) is not possible.

So, in order to do that, you will need, for the ones the go outside of the
faceted filtering, execute another count search for the facets you want with
the original queries, which results in unnecessary calls.

This is a nice scenario, and can be solved quite easily actually by adding
to the facet query the ability to override which query it facets on (so some
facets will run on the "master" query, which is "stuff", and others will run
on the filtered query). This solution is heavily based on the fact that
filters are easily cached, so you have the docidsets in memory already.

I can have a look at bobo browse to see what you are doing, wouldn't mind
trying to get its facet support instead of reimplemting it myself. There are
some important ground features that I don't want to loose with facets, and
the most important one is to be able to define them dynamically (i.e. per
request there can be different facets) and not define them upfront.

Cheers,
Shay

On Tue, Feb 9, 2010 at 11:28 PM, Jake Mannix jake.mannix@gmail.com wrote:

On Tue, Feb 9, 2010 at 12:47 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Filters keyed on indexreader, ok, fairly straightforward (although if you

want to do multi-select, this will get tricky: if the user selects
"color:red" AND "month:Jan", then you want to filter by both of them for the
search results, but also collect the number of hits on the other colors (as
long as month:Jan matches), and the number of hits on the other months (as
long as the color:red matches), etc...).

Not sure I understand, you can wrap a query with a filter, and then use
that. You will get the count (restricted to the query you ran) of "color:red
AND month:Jan". Unless you mean that you want to get counts for color:red
and also counts for month:Jan, in this case you simply have two facet
queries.

Here's what I mean: if you are displaying facet information for both color
and month, you can let people select from both, so that the results returned
are filtered, as you say, by "color:red AND month:Jan", that is great. But
let's look at what pieces of info the user should have: At first, they have
added no facet filters to query "stuff", and we return all matches for
"stuff", the total count("stuff") as well as some facet data:

{color :
{red : count("stuff AND color:red") },
{blue : count("stuff AND color:blue") },
{green : count("stuff AND color:green") }
},
{month:
{jan : count("stuff AND month:jan") },
{feb : count("stuff AND month:feb") },
{mar : count("stuff AND month:mar") }
}

Now they click on color:red, and we return all the matches for "stuff AND
color:red", along with count("stuff AND color:red"), and facet data:

{color :
{red : count("stuff AND color:red") /* this link won't be clickable
because we're here already /},
{blue : count("stuff AND color:blue") /
this link _is_clickable, and can
applies the filter "color:blue OR color:red" /},
{green : count("stuff AND color:green") /
as with color:blue above */}
},
{month :
{jan : count("stuff AND color:red AND month:jan") },
{feb : count("stuff AND color:red AND month:feb") },
{mar : count("stuff AND color:red AND month:mar") }
}

The counts for color without red being applied should be returned because
we may want to allow users to be able to select a couple of facet values
OR'ed together (within a field - filters across fields are AND'ed, as
usual).

Now comes the tricky part, the users clicks on "month:jan", and we return
results filtered by "stuff AND color:red AND month:jan", along with
count("stuff AND color:red AND month:jan"), and facet data:

{color :
{red : count("stuff AND color:red AND month:jan") /* this link won't be
clickable because we're here already /},
{blue : count("stuff AND color:blue AND month:jan") /
_is_clickable, and
switches the filter to "month:jan AND (color:blue OR color:red)" /},
{green : count("stuff AND color:green") /
as with color:blue above /}
},
{month :
{jan : count("stuff AND color:red AND month:jan") /
no longer clickable,
we're here already / },
{feb : count("stuff AND color:red AND month:feb") /
is clickable, and
switches the filter to "(month:jan OR month:feb) AND color:red" / },
{mar : count("stuff AND color:red AND month:mar") /
similar to month:feb
above */ }
}

This is what the user expects from faceted search, in the ui, but I'm
pretty sure that the way Solr computes this, is as you say - by executing
multiple facet queries, but that is horribly inefficient (esp as the number
of fields to facet on grows) - it's much nicer if you can return all of
these counts in one request, it just requires some work to do it
efficiently (this is what we do in bobo-browse).

-jake

On Wed, Feb 10, 2010 at 3:00 AM, Shay Banon shay.banon@elasticsearch.comwrote:

I see what you mean now. In a simple facet usage, then each time a facet is
clicked, the facet is added as a filter to the query (later on becoming a
boolean filter). But, in this case, the results that you get will always be
narrows down to the query with the filters, so getting filters on just the
query (stuff) and not the filter (color:red) is not possible.

So, in order to do that, you will need, for the ones the go outside of the
faceted filtering, execute another count search for the facets you want with
the original queries, which results in unnecessary calls.

Exactly. I'm pretty sure this is what Solr does too, and it's not scalable
to large numbers of facets.

This is a nice scenario, and can be solved quite easily actually by adding
to the facet query the ability to override which query it facets on (so some
facets will run on the "master" query, which is "stuff", and others will run
on the filtered query). This solution is heavily based on the fact that
filters are easily cached, so you have the docidsets in memory already.

Even having the docIdSets in memory, it's tricky to be able to do all the
counting you need in one traversal of the master query's hit list (well, you
don't need do traverse the whole thing if you have facets selected on two
or more fields, but still). It's not rocket science, but yeah, you need to
keep track of the largest set of docIds which could contribute to a count as
you walk (if you've got color:red and date:feb both selected, then you need
to walk the docs which match (stuff AND (color:red OR date:feb)), and on
each doc, determine whether (color:red AND date:feb) (so it's an actual hit
to be collected), or else it only matches one of them (in which case if it
matches color:red but date:jan instead of date:feb, you need to not do a
real "collect()", but you do want to increment the counter for date:jan).

I can have a look at bobo browse to see what you are doing, wouldn't mind
trying to get its facet support instead of reimplemting it myself. There are
some important ground features that I don't want to loose with facets, and
the most important one is to be able to define them dynamically (i.e. per
request there can be different facets) and not define them upfront.

Dynamic facets in bobo-browse are built out of what we call a
RuntimeFacetHandler, which can be built on top of other FacetHandlers (if
for example, you want a dynamic facet, which could for example be faceting
based on intersection with a generic query (QueryWrapperFilter). It won't
be as efficient as a static facet field, because it would need to set itself
up at query time (caching would help, of course), instead of IndexReader
load time (which is what static facets do: a full-forward lookup of the
facet values for the static fields is completely loaded in the background at
load time, so that everything is available in memory at query time).

Is that the kind of feature you wanted to make sure was there, or is it
something else you were referring to?

-jake

On Thu, Feb 11, 2010 at 12:43 AM, Jake Mannix jake.mannix@gmail.com wrote:

On Wed, Feb 10, 2010 at 3:00 AM, Shay Banon shay.banon@elasticsearch.comwrote:

I see what you mean now. In a simple facet usage, then each time a facet
is clicked, the facet is added as a filter to the query (later on becoming a
boolean filter). But, in this case, the results that you get will always be
narrows down to the query with the filters, so getting filters on just the
query (stuff) and not the filter (color:red) is not possible.

So, in order to do that, you will need, for the ones the go outside of the
faceted filtering, execute another count search for the facets you want with
the original queries, which results in unnecessary calls.

Exactly. I'm pretty sure this is what Solr does too, and it's not scalable
to large numbers of facets.

This is a nice scenario, and can be solved quite easily actually by
adding to the facet query the ability to override which query it facets on
(so some facets will run on the "master" query, which is "stuff", and others
will run on the filtered query). This solution is heavily based on the fact
that filters are easily cached, so you have the docidsets in memory already.

Even having the docIdSets in memory, it's tricky to be able to do all the
counting you need in one traversal of the master query's hit list (well, you
don't need do traverse the whole thing if you have facets selected on two
or more fields, but still). It's not rocket science, but yeah, you need to
keep track of the largest set of docIds which could contribute to a count as
you walk (if you've got color:red and date:feb both selected, then you need
to walk the docs which match (stuff AND (color:red OR date:feb)), and on
each doc, determine whether (color:red AND date:feb) (so it's an actual hit
to be collected), or else it only matches one of them (in which case if it
matches color:red but date:jan instead of date:feb, you need to not do a
real "collect()", but you do want to increment the counter for date:jan).

I can have a look at bobo browse to see what you are doing, wouldn't mind
trying to get its facet support instead of reimplemting it myself. There are
some important ground features that I don't want to loose with facets, and
the most important one is to be able to define them dynamically (i.e. per
request there can be different facets) and not define them upfront.

Dynamic facets in bobo-browse are built out of what we call a
RuntimeFacetHandler, which can be built on top of other FacetHandlers (if
for example, you want a dynamic facet, which could for example be faceting
based on intersection with a generic query (QueryWrapperFilter). It won't
be as efficient as a static facet field, because it would need to set itself
up at query time (caching would help, of course), instead of IndexReader
load time (which is what static facets do: a full-forward lookup of the
facet values for the static fields is completely loaded in the background at
load time, so that everything is available in memory at query time).

Is that the kind of feature you wanted to make sure was there, or is it
something else you were referring to?

Yep, thats what I was talking about. How embeddable is Bobo browse (and is
there a chance to get Spring out of it :slight_smile: )?

-jake

On Wed, Feb 10, 2010 at 2:47 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Is that the kind of feature you wanted to make sure was there, or is it

something else you were referring to?

Yep, thats what I was talking about. How embeddable is Bobo browse (and is
there a chance to get Spring out of it :slight_smile: )?

Excellent. Bobo is easily embeddable - it's what it's for! Spring is a
completely optional dependency, you can instantiate your
FacetHandlerFactories directly in code, or contribute a patch which gets
Guice in there (we'd love to give our users that option too!). Spring was
just for convenience, and because many of us use it (and it was all there
was 3 years ago!).

I guess I could be wrong, spring might be required at build time, but you
don't need to use it... we should fix that, because it's not integral in any
way.

-jake

On Thu, Feb 11, 2010 at 12:53 AM, Jake Mannix jake.mannix@gmail.com wrote:

On Wed, Feb 10, 2010 at 2:47 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Is that the kind of feature you wanted to make sure was there, or is it

something else you were referring to?

Yep, thats what I was talking about. How embeddable is Bobo browse (and is
there a chance to get Spring out of it :slight_smile: )?

Excellent. Bobo is easily embeddable - it's what it's for! Spring is a
completely optional dependency, you can instantiate your
FacetHandlerFactories directly in code, or contribute a patch which gets
Guice in there (we'd love to give our users that option too!). Spring was
just for convenience, and because many of us use it (and it was all there
was 3 years ago!).

I guess I could be wrong, spring might be required at build time, but you
don't need to use it... we should fix that, because it's not integral in any
way.

Cool. I will have a look at it and see how it goes once I get some other
major features that I want to add to 0.5.0 out of the way :slight_smile:

-jake