How to force entries to be unique

Dear experts,

I have a simple Filebeat-Elasticsearch-Kibana configuration. In my logs every entry has a uniqueID field. Everything seems to work fine until, for some reason, all entries become duplicated.

I've noticed that the field _id is actually different in each of the duplicates so I've tried using a pipeline with the set processor to overwrite _id with the value of my uniqueID. I was hoping ES would just update/overwrite entries when _id already exists but no luck...

I've also tried to overwrite the field _uid, but it seems I am not allowed.

Do you have any suggestion?

Regards

Can you share your pipeline and what the documents look like before/after transformation?

Here is a simple tests using _simulate

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "date": {
          "field": "CompletionDate",
          "target_field": "@timestamp",
          "formats": [
            "UNIX_MS"
          ],
          "timezone": "Europe/Amsterdam"
        },
        "set": {
          "field": "_id",
          "value": "{{GlobalJobId}}"
        }
      }
    ]
  },
  "docs": [
    { "_index": "index",
    "_type": "type",
    "_id": "1",
    "_version": 4,
    "found": true,
    "_source": {
      "Owner": "owner1",
      "GlobalJobId": "halley#2213.99#1532617316",
      "JobCurrentStartDate": "1532617485000",
      "CompletionDate": "1532617496000",
      "UsedTime": 11,
      "UsedCpu": 0,
      "MemoryUsage": 1
    }
    }
  ]
}

and the result

{
  "docs": [
    {
      "doc": {
        "_index": "index",
        "_type": "type",
        "_id": "halley#2213.99#1532617316",
        "_source": {
          "Owner": "owner1",
          "GlobalJobId": "halley#2213.99#1532617316",
          "@timestamp": "2018-07-26T17:04:56.000+02:00",
          "UsedCpu": 0,
          "JobCurrentStartDate": "1532617485000",
          "UsedTime": 11,
          "CompletionDate": "1532617496000",
          "MemoryUsage": 1
        },
        "_ingest": {
          "timestamp": "2018-07-27T12:37:29.904Z"
        }
      }
    }
  ]
}

Regards

That looks like it's working: _id looks like it's 1 going in and it's halley#2213.99#1532617316 coming out. Are you saying you're seeing duplicate _id of values of halley#2213.99#1532617316

Exactly. It works fine for a few days but at some point I start seeing duplicate _id values.

Are you maybe seeing duplicate _ids across different indices?

As far as I understand, I only have one index. I tried to change the least possible from the default configuration. How can I rule that out?

You said in the beginning:

all entries become duplicated

Can you show a few examples of these duplicates and a search that produces them?

Here is an example. The only difference is indeed the field "_index". I thought including the date in the name of the index was recommended and different dates did not count as different indices. I'll test using a static name.

GET condor-*/_search
{
  "query": {
    "match": {
      "_id": {
        "query": "halley#2213.98#1532617316"
      }
    }
  }
}

Then, the result,

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 20,
    "successful": 20,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 1,
    "hits": [
      {
        "_index": "condor-2018.07.26",
        "_type": "doc",
        "_id": "halley#2213.98#1532617316",
        "_score": 1,
        "_source": {
          "GlobalJobId": "halley#2213.98#1532617316",
          "Owner": "owner1",
          "offset": 1786491,
          "UsedCpu": 0,
          "JobCurrentStartDate": "1532617484000",
          "input_type": "log",
          "source": "/var/log/condor_history.log",
          "type": "log",
          "MemoryUsage": 1,
          "@timestamp": "2018-07-26T17:04:56.000+02:00",
          "UsedTime": 12,
          "beat": {
            "hostname": "halley.fisica.unimi.it",
            "name": "halley.fisica.unimi.it",
            "version": "5.6.9"
          },
          "CompletionDate": "1532617496000"
        }
      },
      {
        "_index": "condor-2018.07.27",
        "_type": "doc",
        "_id": "halley#2213.98#1532617316",
        "_score": 1,
        "_source": {
          "GlobalJobId": "halley#2213.98#1532617316",
          "Owner": "owner1",
          "offset": 1786491,
          "UsedCpu": 0,
          "JobCurrentStartDate": "1532617484000",
          "input_type": "log",
          "source": "/var/log/condor_history.log",
          "type": "log",
          "MemoryUsage": 1,
          "@timestamp": "2018-07-26T17:04:56.000+02:00",
          "UsedTime": 12,
          "beat": {
            "hostname": "halley.fisica.unimi.it",
            "name": "halley.fisica.unimi.it",
            "version": "5.6.9"
          },
          "CompletionDate": "1532617496000"
        }
      },
      {
        "_index": "condor-2018.07.28",
        "_type": "doc",
        "_id": "halley#2213.98#1532617316",
        "_score": 1,
        "_source": {
          "GlobalJobId": "halley#2213.98#1532617316",
          "Owner": "owner1",
          "offset": 1786491,
          "UsedCpu": 0,
          "JobCurrentStartDate": "1532617484000",
          "input_type": "log",
          "source": "/var/log/condor_history.log",
          "type": "log",
          "MemoryUsage": 1,
          "@timestamp": "2018-07-26T17:04:56.000+02:00",
          "UsedTime": 12,
          "beat": {
            "hostname": "halley.fisica.unimi.it",
            "name": "halley.fisica.unimi.it",
            "version": "5.6.9"
          },
          "CompletionDate": "1532617496000"
        }
      }
    ]
  }
}

I thought including the date in the name of the index was recommended and different dates did not count as different indices.

Generally, including the date in the name of the index for timeseries data is recommended (or using rollover with its numeric postfix). One of the reasons for that is that you're probably going to want to delete some data at some point and Elasticsearch is way more efficient at deleting entire indices than "delete by query." So when the data from 2018-07-26 is no longer useful to your business, you just delete the entire index. Also, if your queries have times in them, Elasticsearch can quickly rule out indices/shards that can't possibly match the date of the query.

Anyway, Elasticsearch IDs are only unique to a single index. You're talking about going to a single index to try to solve this, but another solution is to make sure the same ID always goes into the same index. That is, rather than using the current date at the end of the index, you could use something like CompletionDate. There's even an ingest node processor to do this for you.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.