Hey,
I've recently started using Elasticsearch for Spark (Scala application).
I've added elasticsearch-spark_2.10 version 2.1.0.BUILD-SNAPSHOT to my
Spark application pom file, and used
org.apache.spark.rdd.RDD[String].saveJsonToEs() to send documents to
Elasticsearch.
When the documents are loaded to Elasticsearch, my metadata fields (e.g id,
index, etc.) are being loaded as part of the _source field.
Is there a way to exclude them from the _source?
I've tried using the new "es.mapping.exclude" configuration property (added
in this commit https://github.com/elasticsearch/elasticsearch-hadoop/commit/aae4f0460a23bac9567ea2ad335c74245a1ba069
that's why I needed to take the latest build rather than using version
2.1.0.Beta3), but it doesn't seem to have any affect (although I'm not sure
it's even possible to exclude fields I'm using for mapping, e.g
"es.mapping.id").
A code snippet (I'm using a single-node Elasticsearch cluster for testing
purposes and running the Spark app from my desktop) :
val conf = new SparkConf()...
conf.set("es.index.auto.create", "false")
conf.set("es.nodes.discovery", "false")
conf.set("es.nodes", "XXX:9200")
conf.set("es.update.script", "XXX")
conf.set("es.update.script.params", "param1:events")
conf.set("es.update.retry.on.conflict" , "2")
conf.set("es.write.operation", "upsert")
conf.set("es.input.json", "true")
val documentsRdd = ...
documentsRdd.saveJsonToEs("test/user",
scala.collection.Map("es.mapping.id" -> "_id", "es.mapping.exclude" ->
"_id"))
The JSON looks like that :
{
"_id": "XXXX",
"_type": "user",
"_index": "test",
"params": {
"events": [
{
...
}
]
}
Hey,
Have anyone experienced with such an issue?
Perhaps Costin can help here?
Thanks!
On Thursday, February 12, 2015 at 8:27:14 AM UTC+2, Itai Yaffe wrote:
Hey,
I've recently started using Elasticsearch for Spark (Scala application).
I've added elasticsearch-spark_2.10 version 2.1.0.BUILD-SNAPSHOT to my
Spark application pom file, and used
org.apache.spark.rdd.RDD[String].saveJsonToEs() to send documents to
Elasticsearch.
When the documents are loaded to Elasticsearch, my metadata fields (e.g
id, index, etc.) are being loaded as part of the _source field.
Is there a way to exclude them from the _source?
I've tried using the new "es.mapping.exclude" configuration property
(added in this commit https://github.com/elasticsearch/elasticsearch-hadoop/commit/aae4f0460a23bac9567ea2ad335c74245a1ba069
that's why I needed to take the latest build rather than using version
2.1.0.Beta3), but it doesn't seem to have any affect (although I'm not sure
it's even possible to exclude fields I'm using for mapping, e.g "
es.mapping.id").
A code snippet (I'm using a single-node Elasticsearch cluster for testing
purposes and running the Spark app from my desktop) :
val conf = new SparkConf()...
conf.set("es.index.auto.create", "false")
conf.set("es.nodes.discovery", "false")
conf.set("es.nodes", "XXX:9200")
conf.set("es.update.script", "XXX")
conf.set("es.update.script.params", "param1:events")
conf.set("es.update.retry.on.conflict" , "2")
conf.set("es.write.operation", "upsert")
conf.set("es.input.json", "true")
val documentsRdd = ...
documentsRdd.saveJsonToEs("test/user", scala.collection.Map("
es.mapping.id" -> "_id", "es.mapping.exclude" -> "_id"))
The JSON looks like that :
{
"_id": "XXXX",
"_type": "user",
"_index": "test",
"params": {
"events": [
{
...
}
]
}
Hey,
Has anyone experienced with such an issue?
Perhaps Costin can help here?
Thanks!
On Thursday, February 12, 2015 at 8:27:14 AM UTC+2, Itai Yaffe wrote:
Hey,
I've recently started using Elasticsearch for Spark (Scala application).
I've added elasticsearch-spark_2.10 version 2.1.0.BUILD-SNAPSHOT to my
Spark application pom file, and used
org.apache.spark.rdd.RDD[String].saveJsonToEs() to send documents to
Elasticsearch.
When the documents are loaded to Elasticsearch, my metadata fields (e.g
id, index, etc.) are being loaded as part of the _source field.
Is there a way to exclude them from the _source?
I've tried using the new "es.mapping.exclude" configuration property
(added in this commit https://github.com/elasticsearch/elasticsearch-hadoop/commit/aae4f0460a23bac9567ea2ad335c74245a1ba069
that's why I needed to take the latest build rather than using version
2.1.0.Beta3), but it doesn't seem to have any affect (although I'm not sure
it's even possible to exclude fields I'm using for mapping, e.g "
es.mapping.id").
A code snippet (I'm using a single-node Elasticsearch cluster for testing
purposes and running the Spark app from my desktop) :
val conf = new SparkConf()...
conf.set("es.index.auto.create", "false")
conf.set("es.nodes.discovery", "false")
conf.set("es.nodes", "XXX:9200")
conf.set("es.update.script", "XXX")
conf.set("es.update.script.params", "param1:events")
conf.set("es.update.retry.on.conflict" , "2")
conf.set("es.write.operation", "upsert")
conf.set("es.input.json", "true")
val documentsRdd = ...
documentsRdd.saveJsonToEs("test/user", scala.collection.Map("
es.mapping.id" -> "_id", "es.mapping.exclude" -> "_id"))
The JSON looks like that :
{
"_id": "XXXX",
"_type": "user",
"_index": "test",
"params": {
"events": [
{
...
}
]
}
Sorry I missed your email. I'm not clear from your post how your documents
look like - can you post a gist somewhere with your JSON input that you are
sending to Elasticsearch?
Typically the metadata appear in the _source if they are declared that way.
You should be able to go around this by using:
es.mapping.exclude - if it doesn't seem to be working
in case of Spark, by specifying the metadata through the saveWithMeta
methods which allows it to stay decoupled from the object itself.
Since you are using JSON likely 1 is your best shot. If it doesn't work for
you can you please raise an issue with a quick/small sample to be able to
reproduce it?
Thanks,
On Wed, Feb 18, 2015 at 10:27 AM, Itai Yaffe itaiy@exelate.com wrote:
Hey,
Has anyone experienced with such an issue?
Perhaps Costin can help here?
Thanks!
On Thursday, February 12, 2015 at 8:27:14 AM UTC+2, Itai Yaffe wrote:
Hey,
I've recently started using Elasticsearch for Spark (Scala application).
I've added elasticsearch-spark_2.10 version 2.1.0.BUILD-SNAPSHOT to my
Spark application pom file, and used org.apache.spark.rdd.RDD[String].saveJsonToEs()
to send documents to Elasticsearch.
When the documents are loaded to Elasticsearch, my metadata fields (e.g
id, index, etc.) are being loaded as part of the _source field.
Is there a way to exclude them from the _source?
I've tried using the new "es.mapping.exclude" configuration property
(added in this commit https://github.com/elasticsearch/elasticsearch-hadoop/commit/aae4f0460a23bac9567ea2ad335c74245a1ba069
that's why I needed to take the latest build rather than using version
2.1.0.Beta3), but it doesn't seem to have any affect (although I'm not sure
it's even possible to exclude fields I'm using for mapping, e.g "
es.mapping.id").
A code snippet (I'm using a single-node Elasticsearch cluster for testing
purposes and running the Spark app from my desktop) :
val conf = new SparkConf()...
conf.set("es.index.auto.create", "false")
conf.set("es.nodes.discovery", "false")
conf.set("es.nodes", "XXX:9200")
conf.set("es.update.script", "XXX")
conf.set("es.update.script.params", "param1:events")
conf.set("es.update.retry.on.conflict" , "2")
conf.set("es.write.operation", "upsert")
conf.set("es.input.json", "true")
val documentsRdd = ...
documentsRdd.saveJsonToEs("test/user", scala.collection.Map("es.
mapping.id" -> "_id", "es.mapping.exclude" -> "_id"))
The JSON looks like that :
{
"_id": "XXXX",
"_type": "user",
"_index": "test",
"params": {
"events": [
{
...
}
]
}
On Wednesday, February 18, 2015 at 7:42:36 PM UTC+2, Costin Leau wrote:
Hi Itay,
Sorry I missed your email. I'm not clear from your post how your documents
look like - can you post a gist somewhere with your JSON input that you are
sending to Elasticsearch?
Typically the metadata appear in the _source if they are declared that
way. You should be able to go around this by using:
es.mapping.exclude - if it doesn't seem to be working
in case of Spark, by specifying the metadata through the saveWithMeta
methods which allows it to stay decoupled from the object itself.
Since you are using JSON likely 1 is your best shot. If it doesn't work
for you can you please raise an issue with a quick/small sample to be able
to reproduce it?
Thanks,
On Wed, Feb 18, 2015 at 10:27 AM, Itai Yaffe <it...@exelate.com
<javascript:>> wrote:
Hey,
Has anyone experienced with such an issue?
Perhaps Costin can help here?
Thanks!
On Thursday, February 12, 2015 at 8:27:14 AM UTC+2, Itai Yaffe wrote:
Hey,
I've recently started using Elasticsearch for Spark (Scala application).
I've added elasticsearch-spark_2.10 version 2.1.0.BUILD-SNAPSHOT to my
Spark application pom file, and used org.apache.spark.rdd.RDD[String].saveJsonToEs()
to send documents to Elasticsearch.
When the documents are loaded to Elasticsearch, my metadata fields (e.g
id, index, etc.) are being loaded as part of the _source field.
Is there a way to exclude them from the _source?
I've tried using the new "es.mapping.exclude" configuration property
(added in this commit https://github.com/elasticsearch/elasticsearch-hadoop/commit/aae4f0460a23bac9567ea2ad335c74245a1ba069
that's why I needed to take the latest build rather than using version
2.1.0.Beta3), but it doesn't seem to have any affect (although I'm not sure
it's even possible to exclude fields I'm using for mapping, e.g "
es.mapping.id").
A code snippet (I'm using a single-node Elasticsearch cluster for
testing purposes and running the Spark app from my desktop) :
val conf = new SparkConf()...
conf.set("es.index.auto.create", "false")
conf.set("es.nodes.discovery", "false")
conf.set("es.nodes", "XXX:9200")
conf.set("es.update.script", "XXX")
conf.set("es.update.script.params", "param1:events")
conf.set("es.update.retry.on.conflict" , "2")
conf.set("es.write.operation", "upsert")
conf.set("es.input.json", "true")
val documentsRdd = ...
documentsRdd.saveJsonToEs("test/user", scala.collection.Map("es.
mapping.id" -> "_id", "es.mapping.exclude" -> "_id"))
The JSON looks like that :
{
"_id": "XXXX",
"_type": "user",
"_index": "test",
"params": {
"events": [
{
...
}
]
}
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.