Updating enrich index for pipeline

Hello,

I have an 8.8.1 cluster, and am running documents through a series of ingest pipelines. One of these pipelines is an enrich stage.

This data which is used to enrich the documents is sourced from an index via an enrich policy.

This source data needs to be refreshed periodically (ideally, every 30 minutes) and I was hoping to get advice from people here on what the recommended method / steps are for doing that.

My process so far involves:

  • retrieving the data from an API via a Lambda
  • deleting the source index to remove the now-stale data
  • recreating the source index
  • inserting the data
  • triggering the enrich policy

Is there a better / more Elastic way of performing this refresh?

Thanks!

Hi @veryelastic That seems about right... pretty much right out of the docs

Thanks Stephen!

Are there any recommendations on which data tier the source index should be at? I have hot and warm tiers available. Would this affect where the cached enriched index would be?

I'm also seeing some discrepancy between the number of documents in the source index vs. the number of documents in the processed ".enrich..." index created by the policy execution, with the latter being lower than the former. This suggests that there is data missing from the cached index.

Additionally, I have dedicated ingest nodes in the cluster, and I'm seeing a marked performance difference between the two - identical - instances, with one performing ~10-20% better than the other. Do the cached .enrich indices get distributed to the ingest nodes, or do those nodes have to make a round-trip to wherever the index lives for each document?

Wow lots questions...

Will do my best there is some subtly here.

  1. enrich index on hot tier

  2. just for clarity enrich indices live on data nodes not ingest only nodes where pipelines run... So just be aware the ingest node actually has to query the data node where the enrich index... What this means it may or may not be more efficient to have the ingest and data nodes dual roled... I let that sink in a bit :slight_smile:

  3. why your cached has less documents unclear how did you determine that... Which command? You could test by running a test with the full set of docs and only save show no matches

The .enrich-* indices have a _tier_preference of data_content, so they will be on the nodes with this role.

Normally you would put this role on the hot nodes with the data_hot role as well.

Thanks again for the reply.

Taking in @leandrojmp 's response, below, I have confirmed that while the source index is on the data_hot tier, the .enrich-* index is on data_content.

@stephenb , your point about the enrich indices being on the data nodes makes sense. I was hoping, though, to get confirmation that the enrichment data would be cached in memory on the ingest nodes to avoid having to perform a lookup against the .enrich index for each document.

I have checked the read iops on the data_content tier, and nothing is being adversely impacted, so I suspect that the data is cached in memory there. It would be nice to eliminate the round-trip latency and the implications for bandwidth utilisation, given that this cluster is spread across multiple AZs in AWS.

I currently have separate ingest and data_hot nodes, and am reluctant to merge those roles as the cluster is very ingest-heavy, and both the data_hot and ingest nodes are running about as warm as I am comfortable with.

The docs also suggest that it could be prudent to have distinct ingest-only nodes for heavy loads.

As for how I determined the docs count, I have stack monitoring set up for the cluster and noted log messages about the enrich policy execution which indicated that there was a suspiciously-round number of documents processed. I went into Stack Management -> Index Management -> Indices, found the source index, and viewed the stats. I then did the same for the relevant .enrich index and found the discrepancy.

It has occurred fairly often, to the extent that my data has been negtively affected, and that I have had manually trigger a refresh of the source data, wait until the counts matched, and then disable the periodic refresh of the source index data to prevent the discrepancy returning.

Information overload again, sorry!

Hi @veryelastic
Unless something has changed. Very recently. No enrich index is not cached on ingest only.l node.

By definition ingest only node do not have data, enrich index is data and therefore it only can live and it's cache can only live on the data node.

If everything is working that's fine... I was just pointing that out.

I do think at some point they may try to figure out how to get those cached over to ingest only nodes

Hello,

To follow up on this, I have periodically kept an eye on the request rate on the .enrich-xxx index produced by my enrich policy, and it most definitely is not getting hit on each pipeline execution,.

I can only assume that this points towards the processed enrich index data being cached in-memory on the ingest-only nodes, unless the query stats don't get produced for those .enrich-xxx indices as they do for all others.

My original question (what is the recommended process for updating source index for an enrich policy) has been answered, so I'm happy to close this now.

Thanks all for the help.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.