Implementing set logic on an array field

Nikojiro · July 9, 2014, 8:12am

Hi,

I am very often confronted to the following pattern when doing batch
indexing in update mode. Let me picture it with a short example:

Let's pretend I am indexing social bookmarking data: a very crude document
would be something like:

{ "url": "http://somecoolwebsite.org http://somecoolwebsite.org/",
... "usertags": [ "tag1", "tag2", ..., "tagN" ] ... }

My indexer processes a very large list of user bookmarks, and batch
updates/upserts the document in Elasticsearch. My problem is that if I
simply use concatenation in the update script, I may end up with lots of
duplicate values in my usertag array, as many users potentially used the
same tag over and over again on a given url. Instead I would like to have a
set logic on the array values.

Currently I have this pattern on a bunch of uses cases, and I generally
handle that within the batch program by deduplicating values, and using a
BerkeleyDB to have as much data as I can in memory. However the performance
cost becomes prohibitive when I have to perform set logic over millions of
input
records. Below 5M I manage to have an acceptable cost, but past 5M
insertion time in my BDB becomes unacceptable.

Another way would be not to deduplicate and to use terms facet at query
time to obtain deduplicated values, bit the index size will potentially
grow significantly.

Lastly one could put in place a post-processing batch to deduplicate the
values, but that sums up tp reindexing everything. Using batch treatment
and parallel execution this could probably scale pretty well, but would be
time consuming.

This is probably a very common pattern, however I'd very much appreciate to
have some pointers on how other Elasticsearch users dealt with it.

Best regards,

Nicolas

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4d8a3512-08d2-4783-b1d2-622d2b546502%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

vineeth_mohan_2 · July 9, 2014, 6:26pm

Hello Nicolas ,

Why don't you handle this during the update itself.
Update can be done using a script.
So something like -

curl -XPOST 'localhost:9200/test/type1/1/_update' -d '{
"script" : "ctx._source.tags.contains(tag) ? true :
ctx._source.tags += tag",

"params" : {
    "tag" : "blue"
}

}'

Link -

Thanks
Vineeth

On Wed, Jul 9, 2014 at 1:42 PM, Nicolas Giraud nicosensei@gmail.com wrote:

Hi,

I am very often confronted to the following pattern when doing batch
indexing in update mode. Let me picture it with a short example:

Let's pretend I am indexing social bookmarking data: a very crude document
would be something like:

{ "url": "http://somecoolwebsite.org http://somecoolwebsite.org/",
... "usertags": [ "tag1", "tag2", ..., "tagN" ] ... }

My indexer processes a very large list of user bookmarks, and batch
updates/upserts the document in Elasticsearch. My problem is that if I
simply use concatenation in the update script, I may end up with lots of
duplicate values in my usertag array, as many users potentially used
the same tag over and over again on a given url. Instead I would like to
have a set logic on the array values.

Currently I have this pattern on a bunch of uses cases, and I generally
handle that within the batch program by deduplicating values, and using a
BerkeleyDB to have as much data as I can in memory. However the performance
cost becomes prohibitive when I have to perform set logic over millions
of input
records. Below 5M I manage to have an acceptable cost, but past 5M
insertion time in my BDB becomes unacceptable.

Another way would be not to deduplicate and to use terms facet at query
time to obtain deduplicated values, bit the index size will potentially
grow significantly.

Lastly one could put in place a post-processing batch to deduplicate the
values, but that sums up tp reindexing everything. Using batch treatment
and parallel execution this could probably scale pretty well, but would be
time consuming.

This is probably a very common pattern, however I'd very much appreciate
to have some pointers on how other Elasticsearch users dealt with it.

Best regards,

Nicolas

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/4d8a3512-08d2-4783-b1d2-622d2b546502%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/4d8a3512-08d2-4783-b1d2-622d2b546502%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGdPd5%3DESbSDBjm0f1N6o_LrGPaLT2-0mzPut%3De8zzgtDJQEDg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.