Best strategy to "JOIN" data from diferent .json files (Data enrichment? Rollup? Transform? Dictionary?)

syunusic · September 11, 2024, 8:45pm

Let's say I have three .json files containing this information:
employees.json:

{
"employee_id":1,
"name":"John"
},
{
"employee_id":2,
"name":"Elton"
}

companies.json:

{
"company_id":1001,
"company_name":"ACME"
},
{
"company_id":1002,
"company_name":"Capsule"
}

where_they_work.json:

{
"company_id":1001,
"employee_id":1
},
{
"company_id":1001,
"employee_id":"2
}

What I want to have, at the end, is one document in Elasticsearch with this information:
Expected result:

{
"employee_id":1,
"name":"John",
"company_name": "ACME"
},
{
"employee_id":2,
"name":"Elton",
"company_name": "ACME"
}

What's the best approach to achive it? I can imagine at least this options:

Dictionaries (logstash, not good idea, because I'll need to manually update the dictinonary all the time)
Data Enrichment: Can work, but don't know if it will work with the three .json/indexes, maybe with two works fine.. I don't know.
"Fancier" options: Rollups? Transforms? Other?

Please help!

leandrojmp · September 11, 2024, 8:54pm

Are you already using Logstash? How are you indexing your data?

This can be done in a easy way using the translate filter in logstash, but you would need to change the format of your json files.

But how, depends on which information is the source and which one you want to enrich.

For example, your documents have the company_id and employee_id fields with their numeric IDs and you want to enrich with the the name of company and employee?

syunusic · September 11, 2024, 9:24pm

Yes, I'm using Logstash to ship the data from the .json files to Elasticsearch.
The usage of the translate filter plugin is the one I meant when I said using dictionaries. I can do that, but the problem is this data will be dynamic, so I'll be obly to update the dictionary too often.
So far, the only data is in ES is the one is in employees.json of my example. I have to deicide how to put the other information in ES in order to have the company_name in the same index where now is the original info (employee_id and name in my example).

leandrojmp · September 11, 2024, 10:00pm

I would say that is way easier to enrich the data before indexing it.

Rollups are deprecated, it was replaced by downsample, but this only works for metrics, it is not used to enrich data.

Transform is also not used to enrich data, but to summarize it like get the last login of a user and things like that.

To enrich the data you have may use some filters in Logstash, like the translate filter or the enrich processor in an Elasticsearch ingest pipeline.

Personally I do not like using the enrich processor, it is less flexible than the translate filter in Logstash and I only use it when I'm not using Logstash.

So the best way in my opinion would be to use the translate filter.

If your events arriving in Logstash are like this:

{ "employee_id":1, "name":"John" }

You would need at least 2 translate filters, one to get the company_id for the user and other to get the company_name for the company_id.

The first dictionary could be like this, where the key is the employee id and the value is the company id:

"1": "1001"
"2": "1001"

And the second the key would be the company id and the value the company name:

"1001": "ACME"
"1002": "Capsule"

Your translate filters could look like this:

translate {
	source => "employee_id"
	target => "company_id"
	dictionary_path => "/path/to/the/file/company_id.yml"
	refresh_interval => 60
	fallback => "unknown company id"
}

translate {
	source => "company_id"
	target => "company_name"
	dictionary_path => "/path/to/the/file/company_name.yml"
	refresh_interval => 60
	fallback => "unknown company name"
	remove_field => ["company_id"]
}

This will result in your expected result.

Can you not automate this? The dictionary can be an external file and the translate filter can refresh it automatically, so the process of updating the dictionary can be automated.

I have multiple scenarios like this with automation scripts in bash or python updating the dictionary files.

syunusic · September 11, 2024, 10:05pm

I guess you're right. I'll try to figure it out how to automate the refreshing of that dictionaries. Thanks for your help.

Topic		Replies	Views
Problem Statement Elasticsearch	5	653	July 20, 2018
Merging JSON files Logstash	2	1081	December 20, 2017
How to use translate plugin with ES index for lookup data Logstash	6	446	August 21, 2019
Join information between 2 Data inputs Logstash	2	290	May 26, 2020
Merge and transform more than one csv or json files to elasticsearch from logstash or ingest node pipeline (beginner) Logstash	5	670	September 24, 2019

Best strategy to "JOIN" data from diferent .json files (Data enrichment? Rollup? Transform? Dictionary?)

Related topics