Too large data for _id

(Kevin) #1

Hey Guys,

I'm developing my own centralized log management and using Elasticsearch with daily indices which matches my claims perfectly, so far. Now i have a problem on my search queries due to the amount of data and the _id field which i am mainly using for sorting:

Caused by: org.elasticsearch.ElasticsearchException: java.util.concurrent.ExecutionException:
CircuitBreakingException[[fielddata] Data too large, data for [_id] would be [7960997201/7.4gb], which is larger than the limit of [7699562496/7.1gb]]

I spend a lot of time reading in the forum and the documentation and also read this part too:

The value of the _id field is also accessible in aggregations or for sorting, but doing so is
discouraged as it requires to load a lot of data in memory. In case sorting or aggregating on the
_id field is required, it is advised to duplicate the content of the _id field in another field that
has doc_values enabled.

After this i have a few questions:

  1. Could duplicating the content of the meta field into a custom document field probably solve my issue? If yes, how can i achieve this at indexing time or maybe as mapping in the index template? Or do i have to update every document by myself?
  2. Is using more nodes maybe a solution for this? (Currently using one node for development)
  3. Generating my own unique id for documents a better way to go? (I would rather not do that :sweat_smile:)

Hopefully getting some help.
Thanks in advance.

(Luiz Santos) #3

Hi @Trucke!

The Elasticsearch auto-generated IDs are like this:

hNICo2oBMjVKqXSiguIu
gtICo2oBMjVKqXSiduJz
g9ICo2oBMjVKqXSifeIM

They don't have the notion of order. For example, consider that we created the following documents:

POST log/_doc
{
  "field": 1
}

POST log/_doc
{
  "field": 2
}

POST log/_doc
{
  "field": 3
}

When I try to sort for the _id, the result can have the document with field = 3 in the first position:

GET log/_search
{
  "sort": [
    {
      "_id": {
        "order": "desc"
      }
    }
  ]
}

{
  "took" : 26,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : null,
    "hits" : [
      {
        "_index" : "log",
        "_type" : "_doc",
        "_id" : "hNICo2oBMjVKqXSiguIu",
        "_score" : null,
        "_source" : {
          "field" : 3
        },
        "sort" : [
          "hNICo2oBMjVKqXSiguIu"
        ]
      },
      {
        "_index" : "log",
        "_type" : "_doc",
        "_id" : "gtICo2oBMjVKqXSiduJz",
        "_score" : null,
        "_source" : {
          "field" : 1
        },
        "sort" : [
          "gtICo2oBMjVKqXSiduJz"
        ]
      },
      {
        "_index" : "log",
        "_type" : "_doc",
        "_id" : "g9ICo2oBMjVKqXSifeIM",
        "_score" : null,
        "_source" : {
          "field" : 2
        },
        "sort" : [
          "g9ICo2oBMjVKqXSifeIM"
        ]
      }
    ]
  }
}

So my question is: could you describe your use case that requires to sort on auto-generated IDs?

Trying to answer your questions directly:

  1. You cannot duplicate the content of an auto-generated id during indexing time because it doesn't exist yet. For informed IDs you can set up an ingest processor. [1] [2].
  2. I don't think adding more nodes is a solution because it seems a modeling issue instead of resources.
  3. It's much faster to ingest documents when Elasticsearch using auto-generated IDs because it assumes that the generated ids do not exist in the index yet. I think the decision here depends on my first question: what is your use case?

Please, also take a look into the index sorting, it might be related to your question.

I hope it helps!

Cheers,
Luiz Santos

(Kevin) #4

Hi @luiz.santos

Thanks for your reply.
First of all your post is really helpful to me because i think i have misunderstand a little bit about the auto-generated IDs.

For example, a user interface to search and view log data stored in the Elasticsearch, similar to Kibana Discovery page. Scrolling over the end of the results triggers request for the next documents using search_after query. To provide correct order on the log data i am sorting the documents by a timestamp and _id field due to the amount of data and multiple documents with the same timestamp. I thought adding the _id field on sort will give me on every request within the same time range the same documents in the same order also if they have the same timestamp. This fact is important to me for log analysis.

Thanks to your reply i now realize using _id field is not the right way for such a use case :sweat_smile:
I would really appreciate it if you had some tips for me.

Best regards,
Kevin

(Luiz Santos) #5

Hello @Trucke,

Thank you for explaining your use case. Can you clarify which is the timestamp format? If it's something like 2019-05-15T17:44:30 and you can't change the format to include the milliseconds, you can create a ingest processor to add the timestamp that the document was indexed with milliseconds:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "_description",
    "processors": [
      {
        "set": {
          "field": "_source.indexed_at",
          "value": "{{_ingest.timestamp}}"
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_id": "id",
      "_source": {
        "foo": "bar"
      }
    }
  ]
}

{
  "docs" : [
    {
      "doc" : {
        "_index" : "index",
        "_type" : "_type",
        "_id" : "id",
        "_source" : {
          "indexed_at" : "2019-05-15T17:50:59.650388Z",
          "foo" : "bar"
        },
        "_ingest" : {
          "timestamp" : "2019-05-15T17:50:59.650388Z"
        }
      }
    }
  ]
}

Hopefully, you won't have documents ingested in the same millisecond.

If your use case requires more precision, you may want to check the date_nanos format available in Elasticsearch 7.