Ingest pipeline seems to freeze during bulk import

dheuvels · January 11, 2018, 5:55pm

Hi all,

I am using filebeat with ingest pipelines and I suspect that one of them stalls the ingest node from time to time.

When the pipeline runs under load, the bulk threadpool fills up and starts rejecting requests.
I see EsRejectedExecutionException in the log, a maxed queue (500) and the "completed" value in the bulk thread doesn't change anymore:

GET /_nodes/stats/thread_pool
(..)
  "thread_pool": {
	"bulk": {
	  "threads": 12,
	  "queue": 500,
	  "active": 12,
	  "rejected": 6063,
	  "largest": 12,
	  "completed": 34109
	},

All 12 CPU cores are stuck at 100% and tasks won't finish. The value of "running_time" keeps going up:

GET _tasks?pretty
(..)
  "tasks": {
    "nBBzTCUCR3SmV-NRfhGmvQ:57344": {
      "node": "nBBzTCUCR3SmV-NRfhGmvQ",
      "id": 57344,
      "type": "transport",
      "action": "indices:data/write/bulk",
      "start_time_in_millis": 1515672022666,
      "running_time_in_nanos": 12585319138455,
      "cancellable": false
    },

As far as I understand it, all threads are used by the same pipeline:

GET /_nodes/stats/ingest
(..)
  "ingest": {
    "total": {
      "count": ..,
      "time_in_millis": 6046,
      "current": 12, <========== total
      "failed": 0
    },
    ..
    "ait_appng-log-pipeline": {
      "count": ..,
      "time_in_millis": 438,
      "current": 12, <========== the only pipeline that has a "current" value greater zero
      "failed": 0
    },

No more messages will be indexed, until I restart the node.
My EL version is 5.6.5 with lucene 6.6.1.

I'm not sure if my assumption is correct, that the problem is caused by the pipeline. But if so, is there a way to dig deeper, to find out what the ingest node is actually doing and why it blocks?

This is the pipeline:

GET _ingest/pipeline/ait_appng-log-pipeline
{
  "ait_appng-log-pipeline": {
    "processors": [
      {
        "grok": {
          "field": "message",
          "ignore_missing": true,
          "pattern_definitions": {
            "SESSION_ID": "[0-9A-F]{32}",
            "UNIQUE_ID": "(?:[0-9A-Fa-f_]+:){3}[0-9A-Fa-f_]+",
            "APPNG_USER": "(?:%{EMAILADDRESS}|%{USERNAME})",
            "GREEDYMULTILINE": "(.|\n)*"
          },
          "patterns": [
            """^%{TIMESTAMP_ISO8601:appng.core.log.timestamp}%{SPACE}%{LOGLEVEL:appng.core.log.loglevel}%{SPACE}\[%{DATA:appng.core.log.thread}\] \[(?:%{UNIQUE_ID:haproxy.unique_id})? (?:%{SESSION_ID:appng.core.log.session_id})? (?:%{APPNG_USER:appng.core.log.username})?@(?:%{WORD:appng.core.log.locale})? (?:%{URIPROTO:appng.core.log.ffor_proto})? (?:%{PATH:appng.core.log.path})?\] %{DATA:appng.core.log.category}: %{GREEDYMULTILINE:appng.core.log.message1}"""
          ]
        }
      },
      {
        "grok": {
          "field": "appng.core.log.message1",
          "ignore_missing": true,
          "ignore_failure": true,
          "pattern_definitions": {
            "JAVA_EXCEPTION": """(?:[a-zA-Z0-9_-]+\.)+[a-zA-Z0-9_-]*Exception\b"""
          },
          "patterns": [
            "\\n%{JAVA_EXCEPTION:appng.core.log.exception}"
          ]
        }
      },
      {
        "remove": {
          "field": "appng.core.log.message1",
          "ignore_failure": true
        }
      },
      {
        "date": {
          "field": "appng.core.log.timestamp",
          "target_field": "@timestamp",
          "formats": [
            "YYYY-MM-dd HH:mm:ss,SSS"
          ],
          "timezone": "Europe/Berlin"
        }
      },
      {
        "remove": {
          "field": "appng.core.log.timestamp",
          "ignore_failure": true
        }
      }
    ],
    "on_failure": [
      {
        "set": {
          "field": "error",
          "value": "{{ _ingest.on_failure_message }}"
        }
      }
    ]
  }
}

Best regards,
Dirk

dadoonet · January 11, 2018, 6:08pm

May be you should try then to dedicate some nodes to be ingest only (data: false, master: false)?

What gives GET _cat/nodes?v?

dheuvels · January 11, 2018, 6:33pm

To be honest, there is only this very one node, but it doesn't seem to have problems when about the same quantity of logs, are fed through other pipelines.

This is a POC and we will certainly add additional nodes, before going to production. However I'm afraid that adding capacity at this point won't help, if this is a software or configuration issue.

The point is, that even if I stop feeding logs in the described situation, the running tasks will never finish. Shouldn't the cluster recover after some time, if this was a load problem?

ip         heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.x.x.x           70          89  16    2.15    2.33     2.49 mdi       *      logsrv0

dadoonet · January 11, 2018, 7:06pm

It should probably. Unless the memory just goes like crazy and Old GC is too much filed....
But you should see things in logs...

system · February 8, 2018, 7:06pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Failed to execute pipeline for a bulk using ingest node Elasticsearch	7	5769	April 24, 2017
Bulk index error during ingestion [Challenge] Elasticsearch	3	593	June 12, 2018
EsRejectedExecution Exception Elasticsearch	3	2148	July 5, 2017
Failed to execute pipeline for a bulk request after a successfull ingest plugin Elasticsearch	2	783	September 22, 2017
Cluster suddenly starts rejecting bulk requests due to thread pool queue exhaustion Elasticsearch	3	3662	May 8, 2017

Ingest pipeline seems to freeze during bulk import

Related topics