Avoiding duplication whilst ingesting events


(Kai Hendry) #1

Hi guys,

I'm new to ES & Big data pipelines so please go easy.

Imagine you're ingesting a video play progress event. e.g. { video: 1234, user: 98755, event: "progress", t: 5 }

Though we'd like to search these events by the associated meta data like, details of what video watched, details of who the user is etc etc. ... now we have: { video : { id: 1234, title: "Bladerunner" ... and so on}, user: { id: 98755, firstname: "Colin" ... etc etc v large }, event: "progress", t: 6 }}

Now the issue I have is that these video and user objects are huge & usually static (it's conceivable that some data will change but on a daily basis really). Every new progress event largely has exactly the same data as before except the for time t value. Is there a way to reference them so that I can make a query by a video title or by user with first name "Colin"?

Kind regards,


(Peter Dyson) #2

Hi Kai,

There's a few ways to go about avoiding duplication generally.
But if you use your own document ids rather than let Elasticsearch autogenerate them for you, you can update an existing doc using its document id.

So:

POST /testing-updates/doc/
{
  "user": "user1",
  "enabled": true
}

vs

POST /testing-updates/doc/user2
{
  "user": "user2",
  "enabled": true
}

If we take a look at the _id for each of those docs:

GET /testing-updates/_search

resulting in:

{
      "took": 0,
      "timed_out": false,
      "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
      },
      "hits": {
        "total": 2,
        "max_score": 1,
        "hits": [
          {
            "_index": "testing-updates",
            "_type": "doc",
            "_id": "AV1kA52WMIr8LISA3BqI",
            "_score": 1,
            "_source": {
              "user": "user1",
              "enabled": true
            }
          },
          {
            "_index": "testing-updates",
            "_type": "doc",
            "_id": "user2",
            "_score": 1,
            "_source": {
              "user": "user2",
              "enabled": true
            }
          }
        ]
      }
    }

You'll see that user2 has an id of user2 whereas user1 has an autogenerated doc id, because we specified the id of user2 but didn't for user1.

So if we try to update both of those users, setting that "enabled" field to false for example, we see that user2 is successfully updated but there's now two user1 docs as a new one was created with a new auto generated document id. The new one has the updated value, the old one has the original value for "enabled":

POST /testing-updates/doc/
{
  "user": "user1",
  "enabled": false
}

POST /testing-updates/doc/user2
{
  "user": "user2",
  "enabled": false
}

resulting in:

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 1,
    "hits": [
      {
        "_index": "testing-updates",
        "_type": "doc",
        "_id": "AV1kBhENMIr8LISA3ByZ",
        "_score": 1,
        "_source": {
          "user": "user1",
          "enabled": false
        }
      },
      {
        "_index": "testing-updates",
        "_type": "doc",
        "_id": "AV1kA52WMIr8LISA3BqI",
        "_score": 1,
        "_source": {
          "user": "user1",
          "enabled": true
        }
      },
      {
        "_index": "testing-updates",
        "_type": "doc",
        "_id": "user2",
        "_score": 1,
        "_source": {
          "user": "user2",
          "enabled": false
        }
      }
    ]
  }
}

So in your case, if each document represents a video then you might use the video id as the document id.


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.