Hi, I've read the blog post on Index Vs. Type but I find it hard to understand what's better in my case.
I have a certain application's log files and a great majority of the log lines mention a few remote IP addresses that my server communicates with on a regular basis
I also have a second data set from network packet analysis, which includes source/destination IP addresses along with network statistics.
The two data sets are linked by the destination IP address field.
The second data set (network stats) has much higher data frequency and volume.
Is the relationship b/w the datasets one that can be called a parent/child relationship?
Should I be storing those as separate indexes or on the same index, different document_type?
If on the same index, should I use the same field name for both IP addresses fields?
The parent should be the application log and the child would be the network statistics under each IP address mentioned in the application's log.
The data structures are totally different with the only common field being the IP address.
The causal link is that the remote IP that interacts with the application/server doing some action a (mentioned in the app's log files) generates network traffic and therefore network statistics which are brought in with this second data set.
I definitely want to be able to query as this is the reason for bringing these two related data sets together.
Are you saying that there is only ever one application log per IP, and all network logs for that IP relate to that log? Is that per minute/hour/day, or forever?
But what I am getting at is that it doesn't seem plausible that each unique application use IP should have all the networks logs against that IP associated with it.
I mean, I use a VPN pretty much 24/7, so would others. What if you used a CDN, or a load balancer? Hard linking one application event to all network events from that IP means you would have hundreds or thousands over time.
If you are only linking them per second, what if the events go past that specific second.
The IPs in the network stats mention a specific port which is linked to that app. So with 100% certainty I can attribute the network stats to the application log events involving that IP.
As you mentioned, the app interactions with that IP will keep repeating and so will the network stats. And the network stats would be dozens of log lines of packet analysis per app interaction.
I would care only about network stats likely immediately before each app log event , involving that specific IP.
So, ideally, I would want all stats for that IP within a certain time window tied to specific app log events. This would provide me with a sense of network performance related to the app-level interaction.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.