How to force entries to be unique

mivipe · July 26, 2018, 2:03pm

Dear experts,

I have a simple Filebeat-Elasticsearch-Kibana configuration. In my logs every entry has a uniqueID field. Everything seems to work fine until, for some reason, all entries become duplicated.

I've noticed that the field _id is actually different in each of the duplicates so I've tried using a pipeline with the set processor to overwrite _id with the value of my uniqueID. I was hoping ES would just update/overwrite entries when _id already exists but no luck...

I've also tried to overwrite the field _uid, but it seems I am not allowed.

Do you have any suggestion?

Regards

shanec · July 26, 2018, 5:41pm

Can you share your pipeline and what the documents look like before/after transformation?

mivipe · July 27, 2018, 12:49pm

Here is a simple tests using _simulate

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "date": {
          "field": "CompletionDate",
          "target_field": "@timestamp",
          "formats": [
            "UNIX_MS"
          ],
          "timezone": "Europe/Amsterdam"
        },
        "set": {
          "field": "_id",
          "value": "{{GlobalJobId}}"
        }
      }
    ]
  },
  "docs": [
    { "_index": "index",
    "_type": "type",
    "_id": "1",
    "_version": 4,
    "found": true,
    "_source": {
      "Owner": "owner1",
      "GlobalJobId": "halley#2213.99#1532617316",
      "JobCurrentStartDate": "1532617485000",
      "CompletionDate": "1532617496000",
      "UsedTime": 11,
      "UsedCpu": 0,
      "MemoryUsage": 1
    }
    }
  ]
}

and the result

{
  "docs": [
    {
      "doc": {
        "_index": "index",
        "_type": "type",
        "_id": "halley#2213.99#1532617316",
        "_source": {
          "Owner": "owner1",
          "GlobalJobId": "halley#2213.99#1532617316",
          "@timestamp": "2018-07-26T17:04:56.000+02:00",
          "UsedCpu": 0,
          "JobCurrentStartDate": "1532617485000",
          "UsedTime": 11,
          "CompletionDate": "1532617496000",
          "MemoryUsage": 1
        },
        "_ingest": {
          "timestamp": "2018-07-27T12:37:29.904Z"
        }
      }
    }
  ]
}

Regards

shanec · July 30, 2018, 12:29am

That looks like it's working: _id looks like it's 1 going in and it's halley#2213.99#1532617316 coming out. Are you saying you're seeing duplicate _id of values of halley#2213.99#1532617316

mivipe · July 30, 2018, 9:04am

Exactly. It works fine for a few days but at some point I start seeing duplicate _id values.

shanec · July 30, 2018, 7:40pm

Are you maybe seeing duplicate _ids across different indices?

mivipe · July 30, 2018, 8:31pm

As far as I understand, I only have one index. I tried to change the least possible from the default configuration. How can I rule that out?

shanec · July 30, 2018, 9:24pm

You said in the beginning:

all entries become duplicated

Can you show a few examples of these duplicates and a search that produces them?

mivipe · July 31, 2018, 7:58am

Here is an example. The only difference is indeed the field "_index". I thought including the date in the name of the index was recommended and different dates did not count as different indices. I'll test using a static name.

GET condor-*/_search
{
  "query": {
    "match": {
      "_id": {
        "query": "halley#2213.98#1532617316"
      }
    }
  }
}

Then, the result,

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 20,
    "successful": 20,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 1,
    "hits": [
      {
        "_index": "condor-2018.07.26",
        "_type": "doc",
        "_id": "halley#2213.98#1532617316",
        "_score": 1,
        "_source": {
          "GlobalJobId": "halley#2213.98#1532617316",
          "Owner": "owner1",
          "offset": 1786491,
          "UsedCpu": 0,
          "JobCurrentStartDate": "1532617484000",
          "input_type": "log",
          "source": "/var/log/condor_history.log",
          "type": "log",
          "MemoryUsage": 1,
          "@timestamp": "2018-07-26T17:04:56.000+02:00",
          "UsedTime": 12,
          "beat": {
            "hostname": "halley.fisica.unimi.it",
            "name": "halley.fisica.unimi.it",
            "version": "5.6.9"
          },
          "CompletionDate": "1532617496000"
        }
      },
      {
        "_index": "condor-2018.07.27",
        "_type": "doc",
        "_id": "halley#2213.98#1532617316",
        "_score": 1,
        "_source": {
          "GlobalJobId": "halley#2213.98#1532617316",
          "Owner": "owner1",
          "offset": 1786491,
          "UsedCpu": 0,
          "JobCurrentStartDate": "1532617484000",
          "input_type": "log",
          "source": "/var/log/condor_history.log",
          "type": "log",
          "MemoryUsage": 1,
          "@timestamp": "2018-07-26T17:04:56.000+02:00",
          "UsedTime": 12,
          "beat": {
            "hostname": "halley.fisica.unimi.it",
            "name": "halley.fisica.unimi.it",
            "version": "5.6.9"
          },
          "CompletionDate": "1532617496000"
        }
      },
      {
        "_index": "condor-2018.07.28",
        "_type": "doc",
        "_id": "halley#2213.98#1532617316",
        "_score": 1,
        "_source": {
          "GlobalJobId": "halley#2213.98#1532617316",
          "Owner": "owner1",
          "offset": 1786491,
          "UsedCpu": 0,
          "JobCurrentStartDate": "1532617484000",
          "input_type": "log",
          "source": "/var/log/condor_history.log",
          "type": "log",
          "MemoryUsage": 1,
          "@timestamp": "2018-07-26T17:04:56.000+02:00",
          "UsedTime": 12,
          "beat": {
            "hostname": "halley.fisica.unimi.it",
            "name": "halley.fisica.unimi.it",
            "version": "5.6.9"
          },
          "CompletionDate": "1532617496000"
        }
      }
    ]
  }
}

shanec · July 31, 2018, 9:34pm

I thought including the date in the name of the index was recommended and different dates did not count as different indices.

Generally, including the date in the name of the index for timeseries data is recommended (or using rollover with its numeric postfix). One of the reasons for that is that you're probably going to want to delete some data at some point and Elasticsearch is way more efficient at deleting entire indices than "delete by query." So when the data from 2018-07-26 is no longer useful to your business, you just delete the entire index. Also, if your queries have times in them, Elasticsearch can quickly rule out indices/shards that can't possibly match the date of the query.

Anyway, Elasticsearch IDs are only unique to a single index. You're talking about going to a single index to try to solve this, but another solution is to make sure the same ID always goes into the same index. That is, rather than using the current date at the end of the index, you could use something like CompletionDate. There's even an ingest node processor to do this for you.

system · August 28, 2018, 9:37pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Avoid duplicates via node ingest pipelines Logs	10	2397	December 3, 2021
Duplicate entries in Kibana - but Showing unique _id for each entry Kibana	4	3595	July 6, 2017
Setting unique _id Elasticsearch	1	615	April 25, 2018
Question on using my own value as _id Elasticsearch	5	451	July 19, 2021
Avoid duplication Logstash	13	5235	December 7, 2018

How to force entries to be unique

Related topics