Hi Adam and Derry,
I'm still getting up to speed on ES, but with quite a bit of Lucene
experience in the past, I'm wondering if ES sort parameter will function to
your needs. In lucene sort specifications only pull into the priority queue
(the result set in accumulating form) the fields specified. 3.x appears to
have added the option to retain the score sortable when this is done. But
you are still stuck with a pretty heavyweight object flowing through the
system.
Many of the shops I've worked at in fact maintain the canonical version of
the document source in CMS systems, precisely for the need to to reindex or
retrieve the full record as fast as possible based on a docid. Obviously
this requires a field in the indexed view that uniquely identifies access
to the CMS version. All kinda convoluted, but not inconsistent with
traditional notions of a search engine as an index, more than a document
store.
If you can afford to mmapdirectory the index on each shard, the cost of at
least creating that full sort field result object might be acceptable. And
of course, its not really necessary to use a CMS, if you use the filesystem
and explictly mmap the collections of original documents you might be able
to get better performance than your current "fetch" operation that
instantiates the document from Lucene segments.
I wonder if you can use the "logical" (as I call it) field _source to get
better performance than requesting the document itself. tho they may just
be an alias for each other.
On Tuesday, September 25, 2012 3:59:16 AM UTC-4, Derry O' Sullivan wrote:
Hi Adam,
We use Scan/scroll to run through a number of indexes and pull back all
the values for some client side processing (not the same volume as you
though!). We do it via the java API with code similar to:
Elasticsearch Platform — Find real-time answers at scale | Elastic
I noticed that you are able to add sort/searchType criteria in the call
e.g.:
client.prepareSearch().setIndices()
.setQuery(matchAllQuery()).addSort("created_at",SortOrder.ASC).setSize().execute().actionGet();
Maybe worth testing that out? (if you have a date range, you could change
the query to return the required rows as they may be a much smaller range -
vs 100m records?)
Also worth noting that sorting pushes the data into memory (bottom of
page):
Elasticsearch Platform — Find real-time answers at scale | Elastic
On Monday, 24 September 2012 03:43:31 UTC+1, Adam Estrada wrote:
I have read a lot of posts in this group about what the best method of
getting data back out of ES is but no one method seems to be definitive.
Scan seems that fastest way but I have not tried it. I have been using
query_then_fetch to grab data out of my index. I am grabbing 1000 records
at a time but oh man is it slow. My index has 100million records in it and
I need to write them to file based on a date range query.
at a high level, it looks like this.
http://.../_search?sort=created_at:asc&search_type=query_then_fetch&from0&size=1000.
I have the records returning in an ascending order which makes the paging
happen, right? The results are then written to a flat text file that I use
in another process. I would like to try scan but according to the
documentaiton it doesn't sort so I am confused about how it knows how to
grab unique data in each batch of 1000.
Thoughts?
Adam
--