Hi,
I am currently using injest pipelines to enrich my document before it wrote into index.
I am wondering if this process would have a potenial hugh resouce(heap ram or cpu) comsumed for my Elasticsearch node behind the scene.
Below is the processors that composes my injest pipeline
The reason I have this question is because, Elasticsearch has to loop through every document to see if it match to my enrich policy, executing the script processor, and other set processor, and it sounds to have a lot of work for Elasticsearch since I am using this injest pipeline for my data stream, therefore, I am wondering if this could have a huge resource comsumed, if it does, is there any way I can optimize it?
ingest pipeline have consume various amounts of CPU and RAM depending on the complexity on the ingest pipeline and what it does.
Things like poor regex can be very expensive.
A quick look at your pipeline it is fairly low complexity.
This is not the correct way to look at that ... Elastic does not "loop" through the enrich index to enrich the source it does a term lookup which is extremely fast and efficient in Elasticsearch...
So I would say your pipeline is fairly low resource intensive... run it .. test it.
I am currently on 8.8 version. Good to know that my injest pipeline is in low complexity and thanks for pointing out that poor regex will make intensive resouce comsumed, I will pay attension in future query writting.
And I run GET /_nodes/hot_threads, it looks very normal
1.4% [cpu=1.4%, idle=98.6%] (500ms out of 500ms) cpu usage by thread 'elasticsearch[instance-0000000004][transport_worker][T#1]'
I am a bit intersted about the term look up regarding how Elastic enrich the document. I am not sure if I understand the flow correctly
I found this picture in offical document.
For example, my incoming document has a field called email and my source index also has an email field, as long as the incoming document email value match the email value in the source index, this document will be write into an index called "enrich index", I assum that would look like something like below inside the "enrich" index.
Then Elastic perform a term look up which like jump exactly to the document location, for example, the docuemnt just be indexed into enrich index, and then elastic take it out for the rest of processor execution. Because of this, it is extremely fast.
The 2nd picture is doc values which are not really used for this part.
The enrich index is compacted and optimized for the term lookup.
Pretty much depends on how deep you want to get but a term lookup uses the inverted index to find the record extremely quickly. Doc Values are used for aggregations sorting not really lookup.
So yeah..its fast and efficient... especially for reasonable-sized enrich indices (lets say 100Ks to Low Millions)
There are lots of data structures in elastic it may take a while to figure it all out.
Sorry, I have one more question that I am curious, let say we have 500000 different email address in the source index, and every incoming document contains an email address, does Elastic search has to loop through the soruce index and compare one by one with the incoming document to find the matched email so that then it can injest the data to the incoming data?
if this is the case, would it be resource intensive when my source index's size keep going up?
I am not following.. elastic does not loop through anything.
If by source index you mean the source of the enrich index i.e
The look up index no it uses the term query / inverted index as discussed to find the matching enrich data very efficiently
As the picture shows as each new document comes in the enrich / lookup happens... Very fast.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.