In our company we want to add analytics to the company's product to show different kind of reports to customers.
Currently we store such kind of data in elasticsearch in separate indexes:
- small info about connected devices, like mac, device os etc.
- attributes of logged in users - gender, age etc.
Probably some more data will go to other separate indexes.
Since data is in separate indexes we can't really use ES analytics capabilities out of the box. Options are:
- Run batch jobs (or in real time) create more indexes consolidating data from separate indexes and query these new indexes. The problem is that all data can't go to one index, because it is not always parent-child etc. Can be quite unrelated data but still useful for reports. And it will require some number of indexes with combined data and who knows how many and which kind because of report flexibility
- Use Spark to read data from indexes and make reports using map-reduce.
I have a feeling that it is a mistake not to use all features of ES but I don't really see any other ways to support flexible reports when using separate indexes! Any comments or advice will be greatly appreciated!
As for using Spark - I have an impression that it is a bit slow for fast analytics in real time. And the reason is that it takes a lot of time to read raw data from ES into Spark.
When ES spends a fraction of second to aggregate data - Spark spends 5-10 secs only to read raw data! Obviously it is very fast after that doing map-reduce. But there is no way to pre-process data in ES before reading it into Spark. As far as I noticed when using JavaEsSpark.esRDD - it is possible to only do basic filtering using Query String syntax. Let's say specify start-end date and some fields for filtering. But in my case raw session data over one day will contain ~160K docs. Also I can't specify that I need maybe only few fields instead of the whole document.
SparkSQL can push filters to ES, exactly what I would like to use - but from my experience SparkSQL is not mature yet, and translates to quite strange ES queries. For example, simple expression
"where hotspot_id in (234, 645, 534)" will generate invalid query at all. I wish I knew Scala, would be happy to fix such kind of bugs!
Sorry for lots of information. I will really appreciate any recommendations about how to deal with all these and how to do analytics in a better way.