Elastic hadoop count

(Cjuexuan) #1

in my code , I want to count the index size ,like rest api:

curl -XGET 'host:port/index/type/_count&q=xxx:xxx'

but in docs ,I can't find some way to solve this problem , if we cache the data use rdd.count or dataframe.count ,too slow and if our data size is larger ,no result about the count,how can we make it run quickly

in elastic4s ,we can use search in index/type query xxx size 0 and get hit to solve this problem

(Costin Leau) #2

Doing the query manually through elastic4s might remain the only option for the time being. Count was implemented some time ago to do just that however it changed semantics since in Spark, count actually instantiates all entries.
Going forward we might just add an esCount method however that implies the RDD in question is an ES one. Or potentially bind it to SparkContext/SQLContex.t

(system) #3