We are primarily using ElasticSearch for search index purposes, but we now have an internal use case where we are using ElasticSearch as a key value store through the bulk get API (_mget) to fetch a lot of documents at once using the _id field. However, we have our doubts about the suitability of ES for this due to the way Lucene stores data internally, so the fear is that it underneath just produces a lot of random lookups when throwing 1K keys at it in each bulk request. Somebody has probably asked something similar previously, but I could not find a clear cut answer to my question, so I would really appreciate if someone was able to clarify me about this. Thanks.
we do a ton of primary key lookups during indexing. I think 1k keys should work just fine I guess. That's a gut-feeling though.
We've been experimenting with the same idea. Want to process roughly 10M records an hour. Records are not unique on what we're interested in so workflow is batch records, hit ES with mget, build CDC (change data capture) like structure and index only new records or records whose fields changed. Batch size is 5k. As index size grows mget performance seriously degrades. Seems like it's linear to the size of index.
Index at ~3M records and ~1GB:
5k ids mget ~0.2s
Index at about 130M records and ~40GB:
5k ids mget ~10s
p.s. This is 'default' setup. Single EC2 instance with attached EBS SSD. One index with 5 shards.
Just started with ES so any resource or idea is appreciated.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.