For each time a song was played, there is a new entry, even is a song is played again.
With this data I want to :
Get total no. of time each user played each songs. For example, if user_id 102 played, the song_id 1 three times, as per above data. I want to have it grouped by the user_id with total count. Something like below:
Can this be done with ElasticSearch via a query instead of having to calculate and store it back to Elastic? I have the flexibility in indexing the data to elastic in any structure, if the structure I followed is wrong or not ideal.
do you need the song_id counts for all users at the same time (e.g. for doing some data anlysis) or is it okay for your application to just get the counts for one particular user (e.g. for a users webpage dashboard or sth. similar?). In the first case I think you might be able to do a histogram aggregation with interval 1 on the user_id (if it is a numeric id value, your data suggests that) and nest another histogram aggregation (interval 1) on the song_id inside. Alternatively terms aggregation might work but those are more expensive and typically restricted to the top N hits.
This might lead to a large number of buckets though if you run it on the whole dataset, so if you are able to restrict it to one user, I'd add a filter on the user_id first and then only do a histogram (or terms aggregation) for the song_id of this one user.
Hope this helps as a starter for research, if you need more information it would help if you provide more info on your use case and your actual Elasticsearch mapping.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.