I am very often confronted to the following pattern when doing batch
indexing in update mode. Let me picture it with a short example:
Let's pretend I am indexing social bookmarking data: a very crude document
would be something like:
My indexer processes a very large list of user bookmarks, and batch
updates/upserts the document in Elasticsearch. My problem is that if I
simply use concatenation in the update script, I may end up with lots of
duplicate values in my usertag array, as many users potentially used the
same tag over and over again on a given url. Instead I would like to have a
set logic on the array values.
Currently I have this pattern on a bunch of uses cases, and I generally
handle that within the batch program by deduplicating values, and using a
BerkeleyDB to have as much data as I can in memory. However the performance
cost becomes prohibitive when I have to perform set logic over millions of
records. Below 5M I manage to have an acceptable cost, but past 5M
insertion time in my BDB becomes unacceptable.
Another way would be not to deduplicate and to use terms facet at query
time to obtain deduplicated values, bit the index size will potentially
Lastly one could put in place a post-processing batch to deduplicate the
values, but that sums up tp reindexing everything. Using batch treatment
and parallel execution this could probably scale pretty well, but would be
This is probably a very common pattern, however I'd very much appreciate to
have some pointers on how other Elasticsearch users dealt with it.
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to firstname.lastname@example.org.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4d8a3512-08d2-4783-b1d2-622d2b546502%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.