I have two real time streams. One contains news articles and the other comments about the same articles. I'd like to create a parent-child relationship between each article and that articles comments. There is no common id. I'd like to use the headline which exists in both streams and match the two streams based on that every 15 minutes. I am assuming that 15 min would be sufficient to handle delay between the two streams. How would you go about doing this? Any ideas would be appreciated.
A typical message containing, entity_name, source_name, headline, which comes through Logstash looks like this:
"Thomson Reuters Corp.","Japan Today","Trump claims victory after forcing NATO crisis talks"
Some typical comments, comment, headline, which comes through Logstash but a separate pipeline looks like this:
"We applaud Trumps claim ...", "Trump claims victory after forcing NATO crisis talks" "Nato crisis is important...", "Trump claims victory after forcing NATO crisis talks"
- Keep indexes separate or create a third index with from the first two?
- How to run 15 min refresh cycles?
- How can I create a hash of the headline conveniently, to use for matching the two pieces?
- If there is a better way/tool/data store, please advise.
Update: There seems to be a big pivot from de-normalized to normalized data structure explained in Removal of Types: [(https://www.elastic.co/guide/en/elasticsearch/reference/current/removal-of-types.html)]
So my guess is that the answer to 1 is to keep the articles and comments in separate indexes.