Send data to multiple destinations - store data in multiple kafka topics?


#1

Hello,

I would like to send some events to more than one ES cluster. I have following architecture:
LOG SOURCE -> LOGSTASH -> KAFKA -> LOGSTASH -> ES
The first logstash process logs and sends it into kafka. The second logstash just sends them to elasticsearch. I want to send some events to more than one ES. I have two options:

  1. Store events from the first LS into one of two kafka topics (data_all, data_secondES), the second logstash will have 2 pipelines - the first pipeline reads both topics and stores data into first ES, the second pipeline (with different kafka consumer group) reads topic data_secondES and stores them into the second ES.
    BUT this is not good because it in fact randomly prioritizes one of the topics - I do not know number of events for each topic so I set each topic for example 8 partitions. Lets say if in any time 99 percent of events are going to data_all, then the second logstash will first read all the events from data_secondES and data_all will still have some messages to process. This is because the partitions from all topics are equally assigned so lower number of messages in topic will be processed first.

  2. Store events from the first LS into BOTH kafka topics (data_all, data_secondES). Each pipeline in the second logstash will read its topis so no problem with prioritization.
    BUT I will have to store the events twice.

What option is better or is there something better I do not see?

Thanks


#2

The first option seems to be to send events either to the first kafka topic, or the second. Clearly you have some conditional there. You could conditionalize the output so that everything goes to the first kafka topic, and only the events that satisfy the condition go to the second topic. Then the two pipelines would each copy everything from their topic to their elasticsearch instance.

Not sure if that helps, since, to be honest, I do not understand your concern about prioritization of partitions. Why do you need multiple partitions? Is the data volume so high that you have problems keeping up?


#3

I mean paritions in kafka terminology. So the first option would be for example (P is a topic partition]:

LS1 ----------------------else--
           |                   |
         if a                  |
           |                   |
       TOPIC1               TOPIC2
P P P P P P P P    P P P P P P P P

LS2 will read all partitions of both topics "at once" and because TOPIC1 contains 1000 msgs/sec and TOPIC2 contains 1 msg/sec so everything in TOPIC2 is processed immediately which means messages in TOPIC1 will be processed slower and later (unwanted).


#4

Yes, I took it to mean kafka partitions. Still not seeing the problem. If you cannot keep up with 1000 messages per second then you can't keep up. If you can keep up with 1000 messages per second then, yes, a message written to TOPIC2 may get indexed a fraction of a second earlier than a message written to TOPIC1 at the same time. Or then again, it might get indexed a fraction of a second later, depending on the batching.


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.