Bulk API & date index name ingest processor

Hi,

Is there a way to send a bulk (using the bulk api) to a specific ingest pipeline without having to specify an index name ?

So far, the current situation can lead to funny things, especially when using the date index name processor (or any processor that re-route documents).

My current cluster config is (but doesn't matter) :

ip        heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
127.0.0.1           23          87   2    0.24    0.23     0.21 i         -      ingest1
127.0.0.1           39          87   2    0.24    0.23     0.21 m         *      master1
127.0.0.1           27          87   2    0.24    0.23     0.21 d         -      warm1
127.0.0.1           26          87   2    0.24    0.23     0.21 d         -      warm2
127.0.0.1           27          87   2    0.24    0.23     0.21 d         -      hot1
127.0.0.1           23          87   2    0.24    0.23     0.21 m         -      master2
127.0.0.1           23          87   2    0.24    0.23     0.21 m         -      master3
127.0.0.1           28          87   2    0.24    0.23     0.21 d         -      hot2
127.0.0.1           21          87   2    0.24    0.23     0.21 -         -      client

The existing indices on this cluster are but again, it doesn't really matter) :

health status index   uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   .kibana Ciaa8qCeSR-hp_BrCSmMNg   1   1          1            0      6.3kb          3.1kb

First, let's define a pipeline to compute the index name using the '@timestamp' field that we expect to find in our documents :

PUT _ingest/pipeline/weeklyindex
{
  "description": "weekly date-time index naming",
  "processors" : [
    {
      "date_index_name" : {
        "field" : "@timestamp",
        "index_name_prefix" : "myindex-",
        "date_rounding" : "w"
      }
    }
  ]
}

With that done, let's send a bulk to the "weeklyindex" pipeline.

PUT _bulk?pipeline=weeklyindex
{ "create": {"_index" : "foo", "_type" : "bar", "_id": 1}}
{ "text": "Some log message", "@timestamp": "2016-04-25T12:02:01.789Z" }
{ "create": {"_index" : "foo", "_type" : "bar", "_id": 2}}
{ "text": "Some log message", "@timestamp": "2016-04-26T12:02:01.789Z" }
{ "create": {"_index" : "foo", "_type" : "bar", "_id": 3}}
{ "text": "Some log message", "@timestamp": "2016-04-27T12:02:01.789Z" }
{ "create": {"_index" : "foo", "_type" : "bar", "_id": 4}}
{ "text": "Some log message", "@timestamp": "2016-04-28T12:02:01.789Z" }
{ "create": {"_index" : "foo", "_type" : "bar", "_id": 5}}
{ "text": "Some log message", "@timestamp": "2016-04-29T12:02:01.789Z" }
{ "create": {"_index" : "foo", "_type" : "bar", "_id": 6}}
{ "text": "Some log message", "@timestamp": "2016-04-30T12:02:01.789Z" }
{ "create": {"_index" : "foo", "_type" : "bar", "_id": 7}}
{ "text": "Some log message", "@timestamp": "2016-05-01T12:02:01.789Z" }
{ "create": {"_index" : "foo", "_type" : "bar", "_id": 8}}
{ "text": "Some log message", "@timestamp": "2016-05-02T12:02:01.789Z" }

{
  "took": 494,
  "ingest_took": 1,
  "errors": false,
  "items": [ ... ]
}

By the looks of it, one might think our documents will be indexed to the "foo" index, but that's not what happens, as the bulk is processed by the ingest pipeline "weeklyindex" which re-route every single document to a indexes that have nothing to do with the "foo" index, as demonstrated by a new call to _cat/indices:

health status index              uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   myindex-2016-04-25 JscqJWrGSySIMzEJDhWFvw   2   1          7            0     15.9kb          7.9kb
green  open   myindex-2016-05-02 WUNcbUDDQJO3hXHnyyqq8g   2   1          1            0      7.8kb          3.9kb
green  open   .kibana            Ciaa8qCeSR-hp_BrCSmMNg   1   1          1            0      6.3kb          3.1kb

=> is there a cleaner solution / is there a way to send a bulk (using the bulk api) to a specific ingest pipeline without having to specify an index name ?

Best regards,

Charles.w

On a side note, I am fully aware that this processor overrides the _index property of any given document it handles, but I can't help thinking that's not a very clean solution according to the principle of least surprise / astonishment.

Hey,

currently there is not (except specifying the index in the URL, instead as in each payload, however this does not improve the core problem). You might want to open an Elasticsearch github issue and discuss it there.

--Alex

Hi,

You're totally right but I wanted to check if there was an other way and if I was the only one to find it quite annoying. So I'll file a github issue on the subject this evening.

Anyhow, thanks for your answer :wink:

Best regards,

Charles.w

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.