I have a large set of documents, each with an item_ids array of integers, like such:
{
"item_ids" : [ 1, 2, 3, 3, 4 5, 5 ]
}
What I would like to do is filter a search by finding documents where the same item ID appears more than once. From what I understand, this is not possible because of the way ES indexes invert the array items (please correct me if this is not the case).
To get around this, my idea was to create a separate array called primary_item_ids, which would contain IDs that exist in item_ids more than once. In my above example, it would look like this when updated:
When you want to query to find documents that have no duplicate item_id, you just need a nested query with a range query inside that says item_count not greater than 1
That's definitely an option, but our dataset is quite large and re-indexing every document like this would take a very long time, and a lot of resources. I was hoping to use update_by_query so I could use the Task API.
I figured out a way to do this using an inline Painless script, coupled with the _update_by_query endpoint. Here's how I did it. Hopefully this will help others:
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.