Find & Separately Store Array Duplicates

Hello,

I have a large set of documents, each with an item_ids array of integers, like such:

{
   "item_ids" : [ 1, 2, 3, 3, 4 5, 5 ]
}

What I would like to do is filter a search by finding documents where the same item ID appears more than once. From what I understand, this is not possible because of the way ES indexes invert the array items (please correct me if this is not the case).

To get around this, my idea was to create a separate array called primary_item_ids, which would contain IDs that exist in item_ids more than once. In my above example, it would look like this when updated:

{
   "item_ids" : [ 1, 2, 3, 3, 4 5, 5 ],
   "primary_item_ids" : [ 3, 5 ]
}

I don't really care if the primary_item_ids has duplicates or not, since I'm just using it to filter searches.

My initial thought of how to do this was use update_by_query to have a Painless script:

  1. Identify duplicate values in item_ids
  2. Save those duplicates in the new primary_item_ids field.

I have no idea how to approach this in Painless. Any suggestions or ideas?

I think the best way to do this is to store the documents with the count of items while indexing. You should look at nested documents: https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html

What your document should look like is:

{
  "item_id" : [
    {
      "item_value" : 1,
      "item_count" : 1
    },
    {
      "item_value" : 2,
      "item_count" : 1
    },
    {
      "item_value" : 3,
      "item_count" : 2
    },
    {
      "item_value" : 4,
      "item_count" : 1
    },
    {
      "item_value" : 5,
      "item_count" : 2
    }
  ]
}

When you want to query to find documents that have no duplicate item_id, you just need a nested query with a range query inside that says item_count not greater than 1

"query": {
    "bool": {
      "must_not": [
        {
          "nested": {
            "path": "item_id",
            "query": {
              "range": {
                "item_id.item_count": {
                  "gt": 1
                }
              }
            }
          }
        }
      ]
    }
  }

Let me know if that helps.

That's definitely an option, but our dataset is quite large and re-indexing every document like this would take a very long time, and a lot of resources. I was hoping to use update_by_query so I could use the Task API.

Any other suggestions on this?

I figured out a way to do this using an inline Painless script, coupled with the _update_by_query endpoint. Here's how I did it. Hopefully this will help others:

{
  "script": {
    "inline": "<script>...",  
    "lang": "painless",
    "params" : { }
  },
  "slice": {
    "id": 1,
    "max": 6
  }
}

And here is the script itself:

HashMap items = new HashMap();
ArrayList primaries = new ArrayList(); 
for (int i = 0; i < ctx._source.item_ids.length; i++) { 
    if (items.containsKey(ctx._source.item_ids[i])) { 
        primaries.add(ctx._source.item_ids[i]); 
    } 
    items.put(ctx._source.item_ids[i], true); 
} 
ctx._source.primary_item_ids = primaries;
1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.