ES ignores queries through Spark

Hi all,

So I managed to get elasticsearch-spark_2.10 to work and I can query a
database of tweets in Spark. The problem is it seems to ignore my specific
queries, for example specifying the size or the fields to return. For
example, this is my code:

import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.elasticsearch.spark._

object myApp {
def main(args:Array[String]) {
val conf = new SparkConf().setAppName("esTest")
conf.set("es.nodes","myaddr:myport")
conf.set("es.resource", "index/type")
conf.set("es.endpoint", "_search")

   val query: String = "{\"size\":1000}"
   conf.set("es.query", query)
   
   val data = sc.esRDD()

   println(data.count())

}

}

Even though I specify size to be 1000, it always returns the entire
database (~30,000 samples). I've also tried doing
sc.esRDD("index/type","?size=1000") and this doesn't work either. I could
be messing up my ES queries, but I'm pretty sure that is valid since it
seems to work with curl and DHC. Any help would be greatly appreciated
since I've been banging my head on my desk all day over this.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4ce793ff-ee4b-4296-aff7-10c728f6e83f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

You haven't specified the version of elasticsearch-spark. Either way the issue is likely caused by the fact that each
query is executed locally on each target shard. In other words, your limit of 1000 entries is executed on each shard so
at a maximum you will have 1000 x number of shards. By default, an index has 5 shards resulting in a maximum of 5000
entries.
You could try dividing the size per number of shard though that does not guarantee an exact number; some shards might
have more documents than others.

Potentially you could create an index with only one shard though that's not recommended.

The reason this happens is because everything in es-hadoop is parallelized and the aggregation happens on the
Hadoop/Spark side.

On 4/7/15 10:38 PM, Michael Czerny wrote:

Hi all,

So I managed to get elasticsearch-spark_2.10 to work and I can query a database of tweets in Spark. The problem is it
seems to ignore my specific queries, for example specifying the size or the fields to return. For example, this is my code:

|
importorg.apache.spark._
importorg.apache.spark.SparkContext._
importorg.apache.spark.SparkConf
importorg.elasticsearch.spark._

objectmyApp {
defmain(args:Array[String]){
val conf =newSparkConf().setAppName("esTest")
conf.set("es.nodes","myaddr:myport")
conf.set("es.resource","index/type")
conf.set("es.endpoint","_search")

    val query:String="{\"size\":1000}"
    conf.set("es.query",query)

    val data =sc.esRDD()

    println(data.count())

}
}
|

Even though I specify size to be 1000, it always returns the entire database (~30,000 samples). I've also tried doing
sc.esRDD("index/type","?size=1000") and this doesn't work either. I could be messing up my ES queries, but I'm pretty
sure that is valid since it seems to work with curl and DHC. Any help would be greatly appreciated since I've been
banging my head on my desk all day over this.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/4ce793ff-ee4b-4296-aff7-10c728f6e83f%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/4ce793ff-ee4b-4296-aff7-10c728f6e83f%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/55243F56.7080105%40gmail.com.
For more options, visit https://groups.google.com/d/optout.

Sorry about that! I am using 2.1.0.Beta3.

I see. I tried using a date range instead, and that query seems to work, so
I believe your answer is correct. Thanks!

On Tuesday, April 7, 2015 at 4:34:43 PM UTC-4, Costin Leau wrote:

You haven't specified the version of elasticsearch-spark. Either way the
issue is likely caused by the fact that each
query is executed locally on each target shard. In other words, your limit
of 1000 entries is executed on each shard so
at a maximum you will have 1000 x number of shards. By default, an index
has 5 shards resulting in a maximum of 5000
entries.
You could try dividing the size per number of shard though that does not
guarantee an exact number; some shards might
have more documents than others.

Potentially you could create an index with only one shard though that's
not recommended.

The reason this happens is because everything in es-hadoop is parallelized
and the aggregation happens on the
Hadoop/Spark side.

On 4/7/15 10:38 PM, Michael Czerny wrote:

Hi all,

So I managed to get elasticsearch-spark_2.10 to work and I can query a
database of tweets in Spark. The problem is it
seems to ignore my specific queries, for example specifying the size or
the fields to return. For example, this is my code:

|
importorg.apache.spark._
importorg.apache.spark.SparkContext._
importorg.apache.spark.SparkConf
importorg.elasticsearch.spark._

objectmyApp {
defmain(args:Array[String]){
val conf =newSparkConf().setAppName("esTest")
conf.set("es.nodes","myaddr:myport")
conf.set("es.resource","index/type")
conf.set("es.endpoint","_search")

    val query:String="{\"size\":1000}" 
    conf.set("es.query",query) 

    val data =sc.esRDD() 

    println(data.count()) 

}
}
|

Even though I specify size to be 1000, it always returns the entire
database (~30,000 samples). I've also tried doing
sc.esRDD("index/type","?size=1000") and this doesn't work either. I
could be messing up my ES queries, but I'm pretty
sure that is valid since it seems to work with curl and DHC. Any help
would be greatly appreciated since I've been
banging my head on my desk all day over this.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to
elasticsearc...@googlegroups.com <javascript:> <mailto:
elasticsearch+unsubscribe@googlegroups.com <javascript:>>.
To view this discussion on the web visit

https://groups.google.com/d/msgid/elasticsearch/4ce793ff-ee4b-4296-aff7-10c728f6e83f%40googlegroups.com

<
https://groups.google.com/d/msgid/elasticsearch/4ce793ff-ee4b-4296-aff7-10c728f6e83f%40googlegroups.com?utm_medium=email&utm_source=footer>.

For more options, visit https://groups.google.com/d/optout.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/10fd2fd5-61d4-4046-a70c-b437eddd3806%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.