Implementing set logic on an array field


I am very often confronted to the following pattern when doing batch
indexing un update mode. Let me picture it with a short example:

Let's pretend I am indexing social bookmarking data: a very crude document
would be something like:

"url": "",
"usertags": [ "tag1", "tag2", ..., "tagN" ]

My indexer processes a very large list of user bookmarks, and batch
updates/upserts the document in Elasticsearch. My problem is that if I
simply use concatenation in the update script, I may end up with lots of
duplicate values in my /usertag/ array, as many users potentially used the
same tag over and over again on a given url. Instead I would like to have a
set logic on the array values.

Currently I have this pattern on a bunch of uses cases, and I generally
handle that within the batch program by deduplicating values, and using a
BerkeleyDB to have as much data as I can in memory. However the performance
cost becomes prohibitive when I have to perform set logic over millions of
records. Below 5M I manage to have an acceptable cost, but past 5M insertion
time in my BDB becomes unacceptable.

Another way would be not to deduplicate and to use terms facet at query time
to obtain deduplicated values, bit the index size will potentially grow

Lastly one could put in place a post-processing batch to deduplicate the
values, but that sums up tp reindexing everything. Using batch treatment and
parallel execution this could probably scale pretty well, but would be time

This is probably a very common pattern, however I'd very much appreciate to
have some pointers on how other Elasticsearch users dealt with it.

Best regards,


View this message in context:
Sent from the ElasticSearch Users mailing list archive at

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
To view this discussion on the web visit
For more options, visit