Best way to store 2 data sources with 1 linked field (document_type OR new index)?


#1

Hi, I've read the blog post on Index Vs. Type but I find it hard to understand what's better in my case.

  • I have a certain application's log files and a great majority of the log lines mention a few remote IP addresses that my server communicates with on a regular basis
  • I also have a second data set from network packet analysis, which includes source/destination IP addresses along with network statistics.

The two data sets are linked by the destination IP address field.
The second data set (network stats) has much higher data frequency and volume.

  1. Is the relationship b/w the datasets one that can be called a parent/child relationship?
  2. Should I be storing those as separate indexes or on the same index, different document_type?
  3. If on the same index, should I use the same field name for both IP addresses fields?

Thanks in advance.


(Mark Walkom) #2
  1. You could, but it doesn't really make sense in this use case because which one is the parent?
  2. Are the structures very different, similar or the same?
  3. If you want to be able to easily query, it makes sense.

#3

Thanks @warkolm for the quick reply.

The parent should be the application log and the child would be the network statistics under each IP address mentioned in the application's log.

The data structures are totally different with the only common field being the IP address.

The causal link is that the remote IP that interacts with the application/server doing some action a (mentioned in the app's log files) generates network traffic and therefore network statistics which are brought in with this second data set.

I definitely want to be able to query as this is the reason for bringing these two related data sets together.


(Mark Walkom) #4

But if you have multiple application logs, how do you figure out which networks statistics are related to that log entry?


#5

I only have 1 application and all network statistics with matching remote IP addresses should be related.


(Mark Walkom) #6

Are you saying that there is only ever one application log per IP, and all network logs for that IP relate to that log? Is that per minute/hour/day, or forever?


#7

That’s right. The network stats logs are per second as the interactions at the application layer happen.


#8

There’s a causal relationship. Each reported event at the application layer should have corresponding network statistics logs.


(Mark Walkom) #9

But what I am getting at is that it doesn't seem plausible that each unique application use IP should have all the networks logs against that IP associated with it.

I mean, I use a VPN pretty much 24/7, so would others. What if you used a CDN, or a load balancer? Hard linking one application event to all network events from that IP means you would have hundreds or thousands over time.
If you are only linking them per second, what if the events go past that specific second.

It just seems a little strict and unforgiving.


#10

The IPs in the network stats mention a specific port which is linked to that app. So with 100% certainty I can attribute the network stats to the application log events involving that IP.

As you mentioned, the app interactions with that IP will keep repeating and so will the network stats. And the network stats would be dozens of log lines of packet analysis per app interaction.

I would care only about network stats likely immediately before each app log event , involving that specific IP.

So, ideally, I would want all stats for that IP within a certain time window tied to specific app log events. This would provide me with a sense of network performance related to the app-level interaction.

  • Is that possible?
  • @warkolm, what's the right structure for this?
  • Is there a way of defining that parent-child relationship within a certain time window?

#11

The recommendation from an ELK expert outside this forum was to use two separate indeces sharing a common field name for the IP address.

This should allow faster filtering using just the common (IP) field because of the index separation.

Do you concur @warkolm ?


(Mark Walkom) #12

That would make sense.


Data set difference between fields on different indexes