Extracting fields in bulk - using ES as a data store

Hi guys,

We're experimenting with ES as the primary data store for a new project.
During our trials have seen blazing performance which, of course, is great
news.

Our main challenge is finding an efficient way of extracting the ID field
from potentially millions of records and processing those further down the
pipe. The process itself needs to take place where perceived performance is
important, so we can't push it to an asynchronous job iterating the full ES
hits list. This of course becomes a huge issue when dealing with 1.000.000
records.

It would be interesting to hear from the community how issues like this one
has been handled previously. We have looked into multi GET, scroll searches
and similar but none seem to offer a fast-enough experience. We're open to
thoughts of custom plugins or other alternatives and ideas you may have.

Thanks.

--

Can you please elaborate on "processing the ID field from potentially
millions of records" and "fast-enough experience"? What is your usecase
like?

I am afraid I do not fully understand because I know of no other method
than retrieving the documents by get, multi get, scrolling over a result
set after a scan query, or a simple query.

Best regards,

Jörg

--

Of course, sorry. I should have described our setup in more detail from the
start.

We are basically building a traditional booking system, backed by ES. We
query it and rely on facets to allow the user to select a bunch of
"products" (one product == one document) anywhere in the range from 1 to
1,000,000 or so.

Once the booking has been finalized, we need to ensure the ordered products
are unavailable to other users on the specific date specified at the time
of the order. To avoid writing availability stuff back into the index, we
have a Redis store and leverage a custom scorer (native) which dynamically
sets a score reflecting each product's availability on the queried date.
Using min_score we can filter out the unavailable documents. So, we
basically keep availability in Redis and filter ES' hits based on that data.

To speed things up, our ES plugin keeps an in-memory bitmap/bitset
containing the availability. That way we won't have to hit Redis too much
while scoring documents.

Our challenge is to quickly (or as quickly as possible) write the purchased
document ID's into Redis upon order finalization. For that to take place,
we need to iterate the hits and gradually add them to the appropriate list
in Redis. That's currently our one bottleneck causing a big delay.

I am not necessarily looking for "the one true solution" but would love to
hear more on how you guys have solved challenges like this one earlier, if
ever. I would really appreciate your thoughts and ideas re: availability
data on fairly large data sets.

Thanks again.

On Mon, Jan 21, 2013 at 4:58 PM, Jörg Prante joergprante@gmail.com wrote:

Can you please elaborate on "processing the ID field from potentially
millions of records" and "fast-enough experience"? What is your usecase
like?

I am afraid I do not fully understand because I know of no other method
than retrieving the documents by get, multi get, scrolling over a result
set after a scan query, or a simple query.

Best regards,

Jörg

--

--

You are probably aware of an older thread?

https://groups.google.com/forum/?fromgroups=#!topic/elasticsearch/zaadE1FVOKQ

So, Elasticsearch versioning could serve as an instrument for atomic
updates. But the optimistic locking makes two phase commits hard which is
typical for a booking system.

I feel the usual method is to update Redis first (with all the
transactional environment) and then hand the data over to Elasticsearch for
search. So I'm a little bit confused because it seems data is updated in
Elasticsearch first.

To me, it is not easy to understand the extra plugin with an in-memory
bitset aside Redis and Elasticseach since both already behave similar to
in-memory bitsets. Are you simulating DocValues? Note, in Lucene 4, there
are DocValues for fast in-place value updates.

http://searchhub.org/2011/05/31/simon-willnauer-column-stride-fields-or-docvalues-and-improving-on-fieldcache/

This means for certain use cases Elasticsearch may compete better with
key/value stores. The price to pay is a new type of un-inverted documents
which may be harder to organize for filter and facet caching. Let's hope
the DocValues are exposed in Elasticsearch soon. Lucene 4.1 with improved
DocValues is knocking at the door and Elasticsearch 0.21 is not out yet.

https://groups.google.com/forum/?fromgroups=#!topic/elasticsearch/TTgV1_nJiA0

Jörg

--