Best way to store 2 data sources with 1 linked field (document_type OR new index)?

nandrik · November 28, 2018, 10:57pm

Hi, I've read the blog post on Index Vs. Type but I find it hard to understand what's better in my case.

I have a certain application's log files and a great majority of the log lines mention a few remote IP addresses that my server communicates with on a regular basis
I also have a second data set from network packet analysis, which includes source/destination IP addresses along with network statistics.

The two data sets are linked by the destination IP address field.
The second data set (network stats) has much higher data frequency and volume.

Is the relationship b/w the datasets one that can be called a parent/child relationship?
Should I be storing those as separate indexes or on the same index, different document_type?
If on the same index, should I use the same field name for both IP addresses fields?

Thanks in advance.

warkolm · November 28, 2018, 11:56pm

You could, but it doesn't really make sense in this use case because which one is the parent?
Are the structures very different, similar or the same?
If you want to be able to easily query, it makes sense.

nandrik · November 29, 2018, 12:07am

Thanks @warkolm for the quick reply.

The parent should be the application log and the child would be the network statistics under each IP address mentioned in the application's log.

The data structures are totally different with the only common field being the IP address.

The causal link is that the remote IP that interacts with the application/server doing some action a (mentioned in the app's log files) generates network traffic and therefore network statistics which are brought in with this second data set.

I definitely want to be able to query as this is the reason for bringing these two related data sets together.

warkolm · November 29, 2018, 12:23am

But if you have multiple application logs, how do you figure out which networks statistics are related to that log entry?

nandrik · November 29, 2018, 1:03am

I only have 1 application and all network statistics with matching remote IP addresses should be related.

warkolm · November 29, 2018, 1:33am

Are you saying that there is only ever one application log per IP, and all network logs for that IP relate to that log? Is that per minute/hour/day, or forever?

nandrik · November 29, 2018, 1:52am

That’s right. The network stats logs are per second as the interactions at the application layer happen.

nandrik · November 29, 2018, 4:52am

There’s a causal relationship. Each reported event at the application layer should have corresponding network statistics logs.

warkolm · November 29, 2018, 5:51am

But what I am getting at is that it doesn't seem plausible that each unique application use IP should have all the networks logs against that IP associated with it.

I mean, I use a VPN pretty much 24/7, so would others. What if you used a CDN, or a load balancer? Hard linking one application event to all network events from that IP means you would have hundreds or thousands over time.
If you are only linking them per second, what if the events go past that specific second.

It just seems a little strict and unforgiving.

nandrik · November 29, 2018, 6:03am

The IPs in the network stats mention a specific port which is linked to that app. So with 100% certainty I can attribute the network stats to the application log events involving that IP.

As you mentioned, the app interactions with that IP will keep repeating and so will the network stats. And the network stats would be dozens of log lines of packet analysis per app interaction.

I would care only about network stats likely immediately before each app log event , involving that specific IP.

So, ideally, I would want all stats for that IP within a certain time window tied to specific app log events. This would provide me with a sense of network performance related to the app-level interaction.

Is that possible?
@warkolm, what's the right structure for this?
Is there a way of defining that parent-child relationship within a certain time window?

nandrik · November 30, 2018, 3:37am

The recommendation from an ELK expert outside this forum was to use two separate indeces sharing a common field name for the IP address.

This should allow faster filtering using just the common (IP) field because of the index separation.

Do you concur @warkolm ?

warkolm · November 30, 2018, 7:25am

That would make sense.

system · December 28, 2018, 7:25am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to link two indices? Elasticsearch	4	658	July 5, 2017
Complex data query Elasticsearch	2	278	July 6, 2017
Combining two data sources into one or making a parent and child relation with two different sources Elasticsearch	3	230	August 18, 2021
Parent Child Relationships within a single index with same document types Logstash	1	397	October 25, 2017
Multiple types using new join datatype Elasticsearch	1	477	September 6, 2018

Best way to store 2 data sources with 1 linked field (document_type OR new index)?

Related topics