[Hadoop][Spark] Exclude metadata fields from _source

Hey,
I've recently started using Elasticsearch for Spark (Scala application).
I've added elasticsearch-spark_2.10 version 2.1.0.BUILD-SNAPSHOT to my
Spark application pom file, and used
org.apache.spark.rdd.RDD[String].saveJsonToEs() to send documents to
Elasticsearch.
When the documents are loaded to Elasticsearch, my metadata fields (e.g id,
index, etc.) are being loaded as part of the _source field.
Is there a way to exclude them from the _source?
I've tried using the new "es.mapping.exclude" configuration property (added
in this commit
https://github.com/elasticsearch/elasticsearch-hadoop/commit/aae4f0460a23bac9567ea2ad335c74245a1ba069

  • that's why I needed to take the latest build rather than using version
    2.1.0.Beta3), but it doesn't seem to have any affect (although I'm not sure
    it's even possible to exclude fields I'm using for mapping, e.g
    "es.mapping.id").

A code snippet (I'm using a single-node Elasticsearch cluster for testing
purposes and running the Spark app from my desktop) :
val conf = new SparkConf()...
conf.set("es.index.auto.create", "false")
conf.set("es.nodes.discovery", "false")
conf.set("es.nodes", "XXX:9200")
conf.set("es.update.script", "XXX")
conf.set("es.update.script.params", "param1:events")
conf.set("es.update.retry.on.conflict" , "2")
conf.set("es.write.operation", "upsert")
conf.set("es.input.json", "true")
val documentsRdd = ...
documentsRdd.saveJsonToEs("test/user",
scala.collection.Map("es.mapping.id" -> "_id", "es.mapping.exclude" ->
"_id"))

The JSON looks like that :
{
"_id": "XXXX",
"_type": "user",
"_index": "test",
"params": {
"events": [
{
...
}
]
}

Thanks!
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8055055f-8787-492b-97f4-144b2a7f7fce%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hey,
Have anyone experienced with such an issue?
Perhaps Costin can help here?

Thanks!

On Thursday, February 12, 2015 at 8:27:14 AM UTC+2, Itai Yaffe wrote:

Hey,
I've recently started using Elasticsearch for Spark (Scala application).
I've added elasticsearch-spark_2.10 version 2.1.0.BUILD-SNAPSHOT to my
Spark application pom file, and used
org.apache.spark.rdd.RDD[String].saveJsonToEs() to send documents to
Elasticsearch.
When the documents are loaded to Elasticsearch, my metadata fields (e.g
id, index, etc.) are being loaded as part of the _source field.
Is there a way to exclude them from the _source?
I've tried using the new "es.mapping.exclude" configuration property
(added in this commit
https://github.com/elasticsearch/elasticsearch-hadoop/commit/aae4f0460a23bac9567ea2ad335c74245a1ba069

  • that's why I needed to take the latest build rather than using version
    2.1.0.Beta3), but it doesn't seem to have any affect (although I'm not sure
    it's even possible to exclude fields I'm using for mapping, e.g "
    es.mapping.id").

A code snippet (I'm using a single-node Elasticsearch cluster for testing
purposes and running the Spark app from my desktop) :
val conf = new SparkConf()...
conf.set("es.index.auto.create", "false")
conf.set("es.nodes.discovery", "false")
conf.set("es.nodes", "XXX:9200")
conf.set("es.update.script", "XXX")
conf.set("es.update.script.params", "param1:events")
conf.set("es.update.retry.on.conflict" , "2")
conf.set("es.write.operation", "upsert")
conf.set("es.input.json", "true")
val documentsRdd = ...
documentsRdd.saveJsonToEs("test/user", scala.collection.Map("
es.mapping.id" -> "_id", "es.mapping.exclude" -> "_id"))

The JSON looks like that :
{
"_id": "XXXX",
"_type": "user",
"_index": "test",
"params": {
"events": [
{
...
}
]
}

Thanks!
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b0d957c6-ce86-4329-91fb-99a536a9b14b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hey,
Has anyone experienced with such an issue?
Perhaps Costin can help here?

Thanks!

On Thursday, February 12, 2015 at 8:27:14 AM UTC+2, Itai Yaffe wrote:

Hey,
I've recently started using Elasticsearch for Spark (Scala application).
I've added elasticsearch-spark_2.10 version 2.1.0.BUILD-SNAPSHOT to my
Spark application pom file, and used
org.apache.spark.rdd.RDD[String].saveJsonToEs() to send documents to
Elasticsearch.
When the documents are loaded to Elasticsearch, my metadata fields (e.g
id, index, etc.) are being loaded as part of the _source field.
Is there a way to exclude them from the _source?
I've tried using the new "es.mapping.exclude" configuration property
(added in this commit
https://github.com/elasticsearch/elasticsearch-hadoop/commit/aae4f0460a23bac9567ea2ad335c74245a1ba069

  • that's why I needed to take the latest build rather than using version
    2.1.0.Beta3), but it doesn't seem to have any affect (although I'm not sure
    it's even possible to exclude fields I'm using for mapping, e.g "
    es.mapping.id").

A code snippet (I'm using a single-node Elasticsearch cluster for testing
purposes and running the Spark app from my desktop) :
val conf = new SparkConf()...
conf.set("es.index.auto.create", "false")
conf.set("es.nodes.discovery", "false")
conf.set("es.nodes", "XXX:9200")
conf.set("es.update.script", "XXX")
conf.set("es.update.script.params", "param1:events")
conf.set("es.update.retry.on.conflict" , "2")
conf.set("es.write.operation", "upsert")
conf.set("es.input.json", "true")
val documentsRdd = ...
documentsRdd.saveJsonToEs("test/user", scala.collection.Map("
es.mapping.id" -> "_id", "es.mapping.exclude" -> "_id"))

The JSON looks like that :
{
"_id": "XXXX",
"_type": "user",
"_index": "test",
"params": {
"events": [
{
...
}
]
}

Thanks!
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/aea88dfb-8d4b-49d1-a236-8de6d513b4f6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi Itay,

Sorry I missed your email. I'm not clear from your post how your documents
look like - can you post a gist somewhere with your JSON input that you are
sending to Elasticsearch?
Typically the metadata appear in the _source if they are declared that way.
You should be able to go around this by using:

  1. es.mapping.exclude - if it doesn't seem to be working
  2. in case of Spark, by specifying the metadata through the saveWithMeta
    methods which allows it to stay decoupled from the object itself.

Since you are using JSON likely 1 is your best shot. If it doesn't work for
you can you please raise an issue with a quick/small sample to be able to
reproduce it?

Thanks,

On Wed, Feb 18, 2015 at 10:27 AM, Itai Yaffe itaiy@exelate.com wrote:

Hey,
Has anyone experienced with such an issue?
Perhaps Costin can help here?

Thanks!

On Thursday, February 12, 2015 at 8:27:14 AM UTC+2, Itai Yaffe wrote:

Hey,
I've recently started using Elasticsearch for Spark (Scala application).
I've added elasticsearch-spark_2.10 version 2.1.0.BUILD-SNAPSHOT to my
Spark application pom file, and used org.apache.spark.rdd.RDD[String].saveJsonToEs()
to send documents to Elasticsearch.
When the documents are loaded to Elasticsearch, my metadata fields (e.g
id, index, etc.) are being loaded as part of the _source field.
Is there a way to exclude them from the _source?
I've tried using the new "es.mapping.exclude" configuration property
(added in this commit
https://github.com/elasticsearch/elasticsearch-hadoop/commit/aae4f0460a23bac9567ea2ad335c74245a1ba069

  • that's why I needed to take the latest build rather than using version
    2.1.0.Beta3), but it doesn't seem to have any affect (although I'm not sure
    it's even possible to exclude fields I'm using for mapping, e.g "
    es.mapping.id").

A code snippet (I'm using a single-node Elasticsearch cluster for testing
purposes and running the Spark app from my desktop) :
val conf = new SparkConf()...
conf.set("es.index.auto.create", "false")
conf.set("es.nodes.discovery", "false")
conf.set("es.nodes", "XXX:9200")
conf.set("es.update.script", "XXX")
conf.set("es.update.script.params", "param1:events")
conf.set("es.update.retry.on.conflict" , "2")
conf.set("es.write.operation", "upsert")
conf.set("es.input.json", "true")
val documentsRdd = ...
documentsRdd.saveJsonToEs("test/user", scala.collection.Map("es.
mapping.id" -> "_id", "es.mapping.exclude" -> "_id"))

The JSON looks like that :
{
"_id": "XXXX",
"_type": "user",
"_index": "test",
"params": {
"events": [
{
...
}
]
}

Thanks!
}

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/aea88dfb-8d4b-49d1-a236-8de6d513b4f6%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/aea88dfb-8d4b-49d1-a236-8de6d513b4f6%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJogdmcAmUF2txP_6_DGoW9%3DN7kKKx3gkCaeDBohFmjC8PvtNg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thanks for the response Costin!
As you mentioned, option 1, i.e es.mapping.exclude, is more appropriate
when working with JSON.
Since it doesn't seem to work, I've followed your advice and raised a new
issue (Excluding fields when writing JSON documents from Spark to Elasticsearch doesn't work · Issue #381 · elastic/elasticsearch-hadoop · GitHub)
including a small test application to reproduce.
I'd be happy to hear what you think of it.

Thanks again,
Itai

On Wednesday, February 18, 2015 at 7:42:36 PM UTC+2, Costin Leau wrote:

Hi Itay,

Sorry I missed your email. I'm not clear from your post how your documents
look like - can you post a gist somewhere with your JSON input that you are
sending to Elasticsearch?
Typically the metadata appear in the _source if they are declared that
way. You should be able to go around this by using:

  1. es.mapping.exclude - if it doesn't seem to be working
  2. in case of Spark, by specifying the metadata through the saveWithMeta
    methods which allows it to stay decoupled from the object itself.

Since you are using JSON likely 1 is your best shot. If it doesn't work
for you can you please raise an issue with a quick/small sample to be able
to reproduce it?

Thanks,

On Wed, Feb 18, 2015 at 10:27 AM, Itai Yaffe <it...@exelate.com
<javascript:>> wrote:

Hey,
Has anyone experienced with such an issue?
Perhaps Costin can help here?

Thanks!

On Thursday, February 12, 2015 at 8:27:14 AM UTC+2, Itai Yaffe wrote:

Hey,
I've recently started using Elasticsearch for Spark (Scala application).
I've added elasticsearch-spark_2.10 version 2.1.0.BUILD-SNAPSHOT to my
Spark application pom file, and used org.apache.spark.rdd.RDD[String].saveJsonToEs()
to send documents to Elasticsearch.
When the documents are loaded to Elasticsearch, my metadata fields (e.g
id, index, etc.) are being loaded as part of the _source field.
Is there a way to exclude them from the _source?
I've tried using the new "es.mapping.exclude" configuration property
(added in this commit
https://github.com/elasticsearch/elasticsearch-hadoop/commit/aae4f0460a23bac9567ea2ad335c74245a1ba069

  • that's why I needed to take the latest build rather than using version
    2.1.0.Beta3), but it doesn't seem to have any affect (although I'm not sure
    it's even possible to exclude fields I'm using for mapping, e.g "
    es.mapping.id").

A code snippet (I'm using a single-node Elasticsearch cluster for
testing purposes and running the Spark app from my desktop) :
val conf = new SparkConf()...
conf.set("es.index.auto.create", "false")
conf.set("es.nodes.discovery", "false")
conf.set("es.nodes", "XXX:9200")
conf.set("es.update.script", "XXX")
conf.set("es.update.script.params", "param1:events")
conf.set("es.update.retry.on.conflict" , "2")
conf.set("es.write.operation", "upsert")
conf.set("es.input.json", "true")
val documentsRdd = ...
documentsRdd.saveJsonToEs("test/user", scala.collection.Map("es.
mapping.id" -> "_id", "es.mapping.exclude" -> "_id"))

The JSON looks like that :
{
"_id": "XXXX",
"_type": "user",
"_index": "test",
"params": {
"events": [
{
...
}
]
}

Thanks!
}

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/aea88dfb-8d4b-49d1-a236-8de6d513b4f6%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/aea88dfb-8d4b-49d1-a236-8de6d513b4f6%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9f210b41-4a31-4dd4-aa2d-cae7aabd3a1f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.