How to join two indexes or use an index as a lookup

alissan · March 20, 2023, 12:32pm

I have log indexes with 500 million records daily in one index (logs-20230320,logs-20230321,...)
And i have malicious IP addresses list ( ~150.000 records) in another index (blacklist-202303) (rebuilt every day)

I need to create a report for all malicious IP addresses in logs for a week.

How can i join log and blacklist indexes or is there any option for using an index as a lookup?

i found this solution:

Here is my example:

github.com

alissan/logs/blob/main/elasticsearch_lookup_test

DELETE blacklist_ip_lookup
DELETE traffic_logs

//--------------------------------------------------------------

POST /blacklist_ip_lookup/_doc/1
{ 
  "ip_list" : [ 
{  "ip":"99.86.38.68","desc":"known attacker", "location":"USA" }, 
{  "ip":"99.9.12.173", "desc":"SSSH Dictionary Attack", "location":"India" },
{  "ip":"99.90.54.53", "desc":"stamparm ipsum", "location":"Indonesia" }]
}

//--------------------------------------------------------------

PUT /traffic_logs/_doc/1
{ "src_ip":"10.10.1.5", "dst_ip":"99.86.38.68", "action":"allow" }

PUT /traffic_logs/_doc/2
{ "src_ip":"10.10.1.6", "dst_ip":"99.9.12.173", "action":"allow" }

This file has been truncated. show original

But i'm not sure this solution is suitable for ~150.000 records.

I saw this topic but it's too expensive for my data:

leandrojmp · March 20, 2023, 12:48pm

How are you indexing your data?

The best approach is to enrich it during indexing, if you are using Logstash than it is pretty easy to do what you want, if you are sending it directly to Elasticsearch, then you will need to make some changes in your blacklist indices and you can try to use an enrich processor in an ingest pipeline.

alissan · March 20, 2023, 1:12pm

Thanks @leandrojmp ,

I'm indexing data with my own software.
Blacklist data is dynamic (rebuilt every day). For example 99.86.38.68 address in blacklist today, but may be tomorrow not in blacklist. So enrich method is not correct for me.
I need a join in search time.

Christian_Dahlqvist · March 20, 2023, 1:59pm

Elasticsearch does not support query time joins so I do not think there is any efficient way to do what you are looking for. I would recommend the approach around enriching at index time that Leandro suggested. If you together with this monitor changes to the blacklist and update indexed logs through update-by-query whenever the blacklist is modified, you have a solution that could work as long at the blacklist is not frequently updated and each blacklist item matches relatively few log entries.

leandrojmp · March 20, 2023, 4:02pm

As already explained, there is no join in Elasticsearch, so you need to add the information during ingestion time.

Not sure why you think enrich is not correct, you can change the enrich data.

In your case, since you are not using Logstash, you would need to have an ingest pipeline that would run while you are indexing your data from your own software, in this ingest pipeline you would have an enrich processor, this enrich processor runs an enrich policy would then add the information of your index with the blacklist in your current document if there is a match.

But you would basically need to change the structure of the blacklist index to have an document per ip address instead of an array with multiple ip address.

To do what you want in Elasticsearch in need to enrich your data while indexing so you need the following steps:

Create your source index with your blacklisted IPs
Create an enrich policy using the blacklisted IPs as the source index.
Create and ingest pipeline with the enric processor.
Tell elasticsearch to run this ingest pipeline while indexing your data, this can be done by adding the setting index.final_pipeline to your indices settings/templates.

When you need to update the data you just need to recreate the source index from step 1 and execute the enrich policy again from step 2, this will update the enrich indice used by the enrich processor.

Just one thing, the enrich processor may impact the indexing performance.

I do a similar thing as you, I have a couple of IP lists with blacklisted IP addresses or know IP addresses that I use to enrich my index, but I use a combination of Logstash + Memcached.

warkolm · March 21, 2023, 1:19am

You can do this with a runtime field, kinda. Retrieve a runtime field | Elasticsearch Guide [8.6] | Elastic goes into it but has the caveat;

Fields that are retrieved by runtime fields of type lookup can be used to enrich the hits in a search response. It’s not possible to query or aggregate on these fields.

system · April 18, 2023, 1:20am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch parent child mergin two tables together Elasticsearch	6	643	March 9, 2017
I need advice on this new use case - IOC Elasticsearch	4	1457	September 25, 2017
Searching for data from multiple index ( join) Elasticsearch	6	572	February 7, 2020
JOIN different data sources Logstash	6	3283	July 6, 2017
Querying in Elasticsearch Elasticsearch	4	340	May 23, 2018

How to join two indexes or use an index as a lookup

Related topics