Updating enrich index for pipeline

veryelastic · July 31, 2023, 8:24pm

Hello,

I have an 8.8.1 cluster, and am running documents through a series of ingest pipelines. One of these pipelines is an enrich stage.

This data which is used to enrich the documents is sourced from an index via an enrich policy.

This source data needs to be refreshed periodically (ideally, every 30 minutes) and I was hoping to get advice from people here on what the recommended method / steps are for doing that.

My process so far involves:

retrieving the data from an API via a Lambda
deleting the source index to remove the now-stale data
recreating the source index
inserting the data
triggering the enrich policy

Is there a better / more Elastic way of performing this refresh?

Thanks!

stephenb · July 31, 2023, 11:37pm

Hi @veryelastic That seems about right... pretty much right out of the docs

veryelastic · August 1, 2023, 8:33am

Thanks Stephen!

Are there any recommendations on which data tier the source index should be at? I have hot and warm tiers available. Would this affect where the cached enriched index would be?

I'm also seeing some discrepancy between the number of documents in the source index vs. the number of documents in the processed ".enrich..." index created by the policy execution, with the latter being lower than the former. This suggests that there is data missing from the cached index.

Additionally, I have dedicated ingest nodes in the cluster, and I'm seeing a marked performance difference between the two - identical - instances, with one performing ~10-20% better than the other. Do the cached .enrich indices get distributed to the ingest nodes, or do those nodes have to make a round-trip to wherever the index lives for each document?

stephenb · August 2, 2023, 5:12am

Wow lots questions...

Will do my best there is some subtly here.

enrich index on hot tier
just for clarity enrich indices live on data nodes not ingest only nodes where pipelines run... So just be aware the ingest node actually has to query the data node where the enrich index... What this means it may or may not be more efficient to have the ingest and data nodes dual roled... I let that sink in a bit
why your cached has less documents unclear how did you determine that... Which command? You could test by running a test with the full set of docs and only save show no matches

leandrojmp · August 2, 2023, 5:38am

The .enrich-* indices have a _tier_preference of data_content, so they will be on the nodes with this role.

Normally you would put this role on the hot nodes with the data_hot role as well.

veryelastic · August 2, 2023, 9:06am

Thanks again for the reply.

Taking in @leandrojmp 's response, below, I have confirmed that while the source index is on the data_hot tier, the .enrich-* index is on data_content.

@stephenb , your point about the enrich indices being on the data nodes makes sense. I was hoping, though, to get confirmation that the enrichment data would be cached in memory on the ingest nodes to avoid having to perform a lookup against the .enrich index for each document.

I have checked the read iops on the data_content tier, and nothing is being adversely impacted, so I suspect that the data is cached in memory there. It would be nice to eliminate the round-trip latency and the implications for bandwidth utilisation, given that this cluster is spread across multiple AZs in AWS.

I currently have separate ingest and data_hot nodes, and am reluctant to merge those roles as the cluster is very ingest-heavy, and both the data_hot and ingest nodes are running about as warm as I am comfortable with.

The docs also suggest that it could be prudent to have distinct ingest-only nodes for heavy loads.

As for how I determined the docs count, I have stack monitoring set up for the cluster and noted log messages about the enrich policy execution which indicated that there was a suspiciously-round number of documents processed. I went into Stack Management -> Index Management -> Indices, found the source index, and viewed the stats. I then did the same for the relevant .enrich index and found the discrepancy.

It has occurred fairly often, to the extent that my data has been negtively affected, and that I have had manually trigger a refresh of the source data, wait until the counts matched, and then disable the periodic refresh of the source index data to prevent the discrepancy returning.

Information overload again, sorry!

stephenb · August 2, 2023, 1:24pm

Hi @veryelastic
Unless something has changed. Very recently. No enrich index is not cached on ingest only.l node.

By definition ingest only node do not have data, enrich index is data and therefore it only can live and it's cache can only live on the data node.

If everything is working that's fine... I was just pointing that out.

I do think at some point they may try to figure out how to get those cached over to ingest only nodes

veryelastic · August 12, 2023, 7:50pm

Hello,

To follow up on this, I have periodically kept an eye on the request rate on the .enrich-xxx index produced by my enrich policy, and it most definitely is not getting hit on each pipeline execution,.

I can only assume that this points towards the processed enrich index data being cached in-memory on the ingest-only nodes, unless the query stats don't get produced for those .enrich-xxx indices as they do for all others.

My original question (what is the recommended process for updating source index for an enrich policy) has been answered, so I'm happy to close this now.

Thanks all for the help.

system · September 9, 2023, 7:51pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Enrich policy execution in elastic search Elasticsearch	1	377	August 21, 2020
Enrich Processor is slow on multi nodes Elasticsearch ingest-pipeline	10	1096	January 6, 2021
Enrichement process not consistently enrching Elasticsearch ingest-pipeline	16	53	December 12, 2024
Use Enrich policy and enrich pipeline processor to check secondary index and update a value Elasticsearch ingest-pipeline	2	575	March 22, 2021
Enrich processor high cpu load Elasticsearch ingest-pipeline	8	1141	October 21, 2021

Updating enrich index for pipeline

Related topics