In studying https://www.elastic.co/guide/en/elasticsearch/reference/current/handling-failure-in-pipelines.html, I see that "a pipeline defines a list of processors that are executed sequentially, and processing halts at the first exception". This appears to be a hard stop and the stream stops. This behavior
Consider this pipeline:
PUT _ingest/pipeline/rename-log4j-timestamp
{
"description": "rename the timestamp field in a log4j2 row",
"processors": [
{
"convert": {
"field": "timeMillis",
"type": "string",
"ignore_failure" : true
}
},
{
"date": {
"field": "timeMillis",
"formats": [
"UNIX_MS"
],
"ignore_failure" : true
}
}
]
}
If the pipeline gets a row that fits the schema, everything is great, but if there is a problem with a row of the pipeline, the pipeline appears to stall. The semantics of this do not seem to be documented very well, but it is convenient that no data seems to be lost until the pipeline is fixed. Do I understand this correctly?
My bigger problem is how to handle data that does not have timeMillis
as a field. In this case, I just want to ignore the record (in my case, accepting the @timestamp
that is already there). The definition I cited above states that the pipeline halts at the first exception. My preference is that the pipeline would be ignored if there is a failure anywhere in it.
In my example here, I added "ignore_failure" to each element. That works, but in turn, every pipeline element continues to be evaluated. What if continued processing after a failed pipeline processor would be harmful? In other words, if later stages depend on previous stages, there seems to be no way to gracefully exit the pipeline early and leave the record unmodified, but without halting the pipeline altogether.
One solution would be an on_success
element that complimented the on_failure
. In that case, pipeline elements could cascade. A second solution would be that the pipeline definition itself would have a standard behavior. In that case, the ignore_failure
could be removed from each processor. This should lead to more robust code, especially in large pipelines that are maintained by several developers.
Thoughts? I'm new to pipelines and not sure if I am looking at them right.
Thanks! Brian