Usually data are not enriched

Hi,

Does exists some situactions, when data are not enriched ? (i'm excluding scenario when there are no matching data)
Example : Can be skipped for perfomance reasons?

It is possible that enrich processor in pipeline can be skipped?
By adding tags can I exclude scenatio with skipping enrich processor ?

Flow :
Logs -> Logstash -> Elasticsearch pipeline (where enrich processor exists) -> Elasticsearch Index

Indexing speed(enriched index) : approx : 80k in 45minutes

Elasticsearch 7.17

You can configure your processor to use a conditional and it will only run if the conditional is true.

Check this part of the documentation.

Also, since you are using Logstash, and depending on what you are enrich, it would be much easier and fast to enrich the data in Logstash.

1 Like

Thank you for reply.

OK, i understood that using if statements can skip enrich processor in ingest pipeline

What I want to enrich?
Example :

Index : "products" contains : Name of product, type of product , ID, and color
Index : "pricing" contains : Price, tax, ID

I want to enrich "products" with Price, using enrich processor in elasticsearch pipeline
Additionally i understood that i can do it in logstash

When enrich processor in elastic pipeline can be skipped ? (excluding : if statements, there are no matching data, for enrichment i want to use elastic pipeline with enrich processor)

The ingest pipeline will run the processors in the order they are configured, if you have an enrich processor in the ingest pipeline for your products index, it will be executed for every event.

If you want to skip a processor you need a conditional based on some kind of data, there is no other way to skip the execution of a processor.

Also, the enrich processor is recommended for static data, if you will need to constantly update your source index, you will need to manually call the excute api to update the data of the enrich index every time the source index is updated.

OK,

so.. What will happen when i'm not update data of the enrich index after adding some data(in source index)?
Enrich processor will run but without results?
By update data of the enrich index, you mean call /_enrich/policy/name_of_policy/_execute ?

so... if I exclude if statements, and enrich processor will be defined in pipeline, then it will ALWAYS execute? it cannot be skipped?

or..

When data can be not enriched? (excluding not updating enrich index after adding some data)

If you add new data to the source index of your enrich policy, this new data will only be available after you run a new _execute on the policy, this will create a new enrich index.

If you do not have an if conditional in your enrich processer it will always be executed.

As I said in the previous answer the processors are executed in the order they are configured in the ingest pipeline and they are always executed, if you want to skip some processors you need to use a conditional on that processor to check if it can be executed or not, in this case the conditional will be always executed.

If there is no match in the enrich processor, the data will not be enriched, but the processor will always be executed.

How many replicas should have .enrich-INDEXNAME indicle ?

Should be a sum of : Hot + Warm + Ingest ? or maybe : only ingest nodes?
What roles should have HOT/WARM/Ingest nodes? (3x HOT(drt), 3x WARM(drt), 6x Ingest(di) )

I'm trying to deal with situation that usually data are not enriched, but not because there is no matching data...

You should leave it using the default configuration, If i'm not wrong it will auto-expand to every data node, or at elast every node with a data_content role.

What situation? Provide more context.

There are two independent sources gathered by Logstash (two indexes: index1, index2)
Every source, has own Logstash pipeline.

Index1 looks simple.
Logstash crawls CSV files, there are some filter rules like, csv, translate,dissect and fingerprint
fingerprint is calculated from two fields
concatenate_sources => true

Logstash in output, has interesting options
doc_as_upsert => true
document_id => "%{fingerprint}"

Also here is defined pipeline where document will be sent

Ingest pipeline has few processor like date, grok, date
_enrich/policy is executed 3x per day

          "match_field" : "index1.number",
          "enrich_fields" : [
            "geo.region_iso_code_old",
            "index1.commune.id",
            "index1.company.id",
            "index1.company.name",
            "index1.service"
]

CSV file is created once per day (~00:00AM)


Above image shows that ~98% records are updated

Index2 are more complex.
Logstash crawls CSV files, there are some filter rules like, csv, dissect, mutate, and it is sent to ingest pipeline.

That pipeline has many sub-pipelines, but in one of them contains enrich processor, which depends on index1

        "enrich" : {
          "tag" : "index1 b",
          "ignore_missing" : true,
          "policy_name" : "index1",
          "field" : "tmp.enrich_pl.value",
          "target_field" : "tmp.enrich_pl.b",
          "max_matches" : "1"
        }

Final effect

I'm sorry, but i didn't get what is the issue with just that information.

You didn't share what your data looks like, you didn't share what the data of your enrich index looks like, for example, in your last screenshot you shared a field named b.company.name, where this comes from? It is not possible to know what may be the issue from what you shared.

You need to share some sample data of both your index and enrich data so it is possible to someone to try to replicate your issue.

Can you share an example of a document that should've be enriched, but wasn't, and also the data from the enrich index?

1 Like