I have log indexes with 500 million records daily in one index (logs-20230320,logs-20230321,...)
And i have malicious IP addresses list ( ~150.000 records) in another index (blacklist-202303) (rebuilt every day)
I need to create a report for all malicious IP addresses in logs for a week.
How can i join log and blacklist indexes or is there any option for using an index as a lookup?
i found this solution:
Here is my example:
But i'm not sure this solution is suitable for ~150.000 records.
I saw this topic but it's too expensive for my data:
The best approach is to enrich it during indexing, if you are using Logstash than it is pretty easy to do what you want, if you are sending it directly to Elasticsearch, then you will need to make some changes in your blacklist indices and you can try to use an enrich processor in an ingest pipeline.
I'm indexing data with my own software.
Blacklist data is dynamic (rebuilt every day). For example 99.86.38.68 address in blacklist today, but may be tomorrow not in blacklist. So enrich method is not correct for me.
I need a join in search time.
Elasticsearch does not support query time joins so I do not think there is any efficient way to do what you are looking for. I would recommend the approach around enriching at index time that Leandro suggested. If you together with this monitor changes to the blacklist and update indexed logs through update-by-query whenever the blacklist is modified, you have a solution that could work as long at the blacklist is not frequently updated and each blacklist item matches relatively few log entries.
As already explained, there is no join in Elasticsearch, so you need to add the information during ingestion time.
Not sure why you think enrich is not correct, you can change the enrich data.
In your case, since you are not using Logstash, you would need to have an ingest pipeline that would run while you are indexing your data from your own software, in this ingest pipeline you would have an enrich processor, this enrich processor runs an enrich policy would then add the information of your index with the blacklist in your current document if there is a match.
But you would basically need to change the structure of the blacklist index to have an document per ip address instead of an array with multiple ip address.
To do what you want in Elasticsearch in need to enrich your data while indexing so you need the following steps:
Create your source index with your blacklisted IPs
Create an enrich policy using the blacklisted IPs as the source index.
Tell elasticsearch to run this ingest pipeline while indexing your data, this can be done by adding the setting index.final_pipeline to your indices settings/templates.
When you need to update the data you just need to recreate the source index from step 1 and execute the enrich policy again from step 2, this will update the enrich indice used by the enrich processor.
Just one thing, the enrich processor may impact the indexing performance.
I do a similar thing as you, I have a couple of IP lists with blacklisted IP addresses or know IP addresses that I use to enrich my index, but I use a combination of Logstash + Memcached.
Fields that are retrieved by runtime fields of type lookup can be used to enrich the hits in a search response. It’s not possible to query or aggregate on these fields.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.