Find & Separately Store Array Duplicates

kbarresi · November 17, 2017, 5:23pm

Hello,

I have a large set of documents, each with an item_ids array of integers, like such:

{
   "item_ids" : [ 1, 2, 3, 3, 4 5, 5 ]
}

What I would like to do is filter a search by finding documents where the same item ID appears more than once. From what I understand, this is not possible because of the way ES indexes invert the array items (please correct me if this is not the case).

To get around this, my idea was to create a separate array called primary_item_ids, which would contain IDs that exist in item_ids more than once. In my above example, it would look like this when updated:

{
   "item_ids" : [ 1, 2, 3, 3, 4 5, 5 ],
   "primary_item_ids" : [ 3, 5 ]
}

I don't really care if the primary_item_ids has duplicates or not, since I'm just using it to filter searches.

My initial thought of how to do this was use update_by_query to have a Painless script:

Identify duplicate values in item_ids
Save those duplicates in the new primary_item_ids field.

I have no idea how to approach this in Painless. Any suggestions or ideas?

Rory · November 17, 2017, 6:57pm

I think the best way to do this is to store the documents with the count of items while indexing. You should look at nested documents: https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html

What your document should look like is:

{
  "item_id" : [
    {
      "item_value" : 1,
      "item_count" : 1
    },
    {
      "item_value" : 2,
      "item_count" : 1
    },
    {
      "item_value" : 3,
      "item_count" : 2
    },
    {
      "item_value" : 4,
      "item_count" : 1
    },
    {
      "item_value" : 5,
      "item_count" : 2
    }
  ]
}

When you want to query to find documents that have no duplicate item_id, you just need a nested query with a range query inside that says item_count not greater than 1

"query": {
    "bool": {
      "must_not": [
        {
          "nested": {
            "path": "item_id",
            "query": {
              "range": {
                "item_id.item_count": {
                  "gt": 1
                }
              }
            }
          }
        }
      ]
    }
  }

Let me know if that helps.

kbarresi · November 17, 2017, 7:06pm

That's definitely an option, but our dataset is quite large and re-indexing every document like this would take a very long time, and a lot of resources. I was hoping to use update_by_query so I could use the Task API.

kbarresi · November 22, 2017, 4:52pm

Any other suggestions on this?

kbarresi · November 27, 2017, 5:16pm

I figured out a way to do this using an inline Painless script, coupled with the _update_by_query endpoint. Here's how I did it. Hopefully this will help others:

{
  "script": {
    "inline": "<script>...",  
    "lang": "painless",
    "params" : { }
  },
  "slice": {
    "id": 1,
    "max": 6
  }
}

And here is the script itself:

HashMap items = new HashMap();
ArrayList primaries = new ArrayList(); 
for (int i = 0; i < ctx._source.item_ids.length; i++) { 
    if (items.containsKey(ctx._source.item_ids[i])) { 
        primaries.add(ctx._source.item_ids[i]); 
    } 
    items.put(ctx._source.item_ids[i], true); 
} 
ctx._source.primary_item_ids = primaries;

system · December 25, 2017, 5:16pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Delete duplicate items Elasticsearch	1	329	July 6, 2017
Indexing-time document deduplication Elasticsearch	6	2589	July 6, 2017
Multiple documents with the same _id Elasticsearch	4	822	July 6, 2017
Trying to find duplicates in a nested array Elasticsearch	3	4038	July 6, 2017
Dealing with duplicate documents Elasticsearch	4	1498	July 5, 2017

Find & Separately Store Array Duplicates

Related topics