[Hadoop][Spark] Exclude metadata fields from _source

Itai_Yaffe · February 12, 2015, 6:27am

Hey,
I've recently started using Elasticsearch for Spark (Scala application).
I've added elasticsearch-spark_2.10 version 2.1.0.BUILD-SNAPSHOT to my
Spark application pom file, and used
org.apache.spark.rdd.RDD[String].saveJsonToEs() to send documents to
Elasticsearch.
When the documents are loaded to Elasticsearch, my metadata fields (e.g id,
index, etc.) are being loaded as part of the _source field.
Is there a way to exclude them from the _source?
I've tried using the new "es.mapping.exclude" configuration property (added
in this commit
https://github.com/elasticsearch/elasticsearch-hadoop/commit/aae4f0460a23bac9567ea2ad335c74245a1ba069

that's why I needed to take the latest build rather than using version
2.1.0.Beta3), but it doesn't seem to have any affect (although I'm not sure
it's even possible to exclude fields I'm using for mapping, e.g
"es.mapping.id").

A code snippet (I'm using a single-node Elasticsearch cluster for testing
purposes and running the Spark app from my desktop) :
val conf = new SparkConf()...
conf.set("es.index.auto.create", "false")
conf.set("es.nodes.discovery", "false")
conf.set("es.nodes", "XXX:9200")
conf.set("es.update.script", "XXX")
conf.set("es.update.script.params", "param1:events")
conf.set("es.update.retry.on.conflict" , "2")
conf.set("es.write.operation", "upsert")
conf.set("es.input.json", "true")
val documentsRdd = ...
documentsRdd.saveJsonToEs("test/user",
scala.collection.Map("es.mapping.id" -> "_id", "es.mapping.exclude" ->
"_id"))

The JSON looks like that :
{
"_id": "XXXX",
"_type": "user",
"_index": "test",
"params": {
"events": [
{
...
}
]
}

Thanks!
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8055055f-8787-492b-97f4-144b2a7f7fce%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Itai_Yaffe · February 18, 2015, 8:26am

Hey,
Have anyone experienced with such an issue?
Perhaps Costin can help here?

Thanks!

On Thursday, February 12, 2015 at 8:27:14 AM UTC+2, Itai Yaffe wrote:

Hey,
I've recently started using Elasticsearch for Spark (Scala application).
I've added elasticsearch-spark_2.10 version 2.1.0.BUILD-SNAPSHOT to my
Spark application pom file, and used
org.apache.spark.rdd.RDD[String].saveJsonToEs() to send documents to
Elasticsearch.
When the documents are loaded to Elasticsearch, my metadata fields (e.g
id, index, etc.) are being loaded as part of the _source field.
Is there a way to exclude them from the _source?
I've tried using the new "es.mapping.exclude" configuration property
(added in this commit
https://github.com/elasticsearch/elasticsearch-hadoop/commit/aae4f0460a23bac9567ea2ad335c74245a1ba069

that's why I needed to take the latest build rather than using version
2.1.0.Beta3), but it doesn't seem to have any affect (although I'm not sure
it's even possible to exclude fields I'm using for mapping, e.g "
es.mapping.id").

A code snippet (I'm using a single-node Elasticsearch cluster for testing
purposes and running the Spark app from my desktop) :
val conf = new SparkConf()...
conf.set("es.index.auto.create", "false")
conf.set("es.nodes.discovery", "false")
conf.set("es.nodes", "XXX:9200")
conf.set("es.update.script", "XXX")
conf.set("es.update.script.params", "param1:events")
conf.set("es.update.retry.on.conflict" , "2")
conf.set("es.write.operation", "upsert")
conf.set("es.input.json", "true")
val documentsRdd = ...
documentsRdd.saveJsonToEs("test/user", scala.collection.Map("
es.mapping.id" -> "_id", "es.mapping.exclude" -> "_id"))

The JSON looks like that :
{
"_id": "XXXX",
"_type": "user",
"_index": "test",
"params": {
"events": [
{
...
}
]
}

Thanks!
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b0d957c6-ce86-4329-91fb-99a536a9b14b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Itai_Yaffe · February 18, 2015, 8:27am

Hey,
Has anyone experienced with such an issue?
Perhaps Costin can help here?

Thanks!

On Thursday, February 12, 2015 at 8:27:14 AM UTC+2, Itai Yaffe wrote:

Hey,
I've recently started using Elasticsearch for Spark (Scala application).
I've added elasticsearch-spark_2.10 version 2.1.0.BUILD-SNAPSHOT to my
Spark application pom file, and used
org.apache.spark.rdd.RDD[String].saveJsonToEs() to send documents to
Elasticsearch.
When the documents are loaded to Elasticsearch, my metadata fields (e.g
id, index, etc.) are being loaded as part of the _source field.
Is there a way to exclude them from the _source?
I've tried using the new "es.mapping.exclude" configuration property
(added in this commit
https://github.com/elasticsearch/elasticsearch-hadoop/commit/aae4f0460a23bac9567ea2ad335c74245a1ba069

that's why I needed to take the latest build rather than using version
2.1.0.Beta3), but it doesn't seem to have any affect (although I'm not sure
it's even possible to exclude fields I'm using for mapping, e.g "
es.mapping.id").

A code snippet (I'm using a single-node Elasticsearch cluster for testing
purposes and running the Spark app from my desktop) :
val conf = new SparkConf()...
conf.set("es.index.auto.create", "false")
conf.set("es.nodes.discovery", "false")
conf.set("es.nodes", "XXX:9200")
conf.set("es.update.script", "XXX")
conf.set("es.update.script.params", "param1:events")
conf.set("es.update.retry.on.conflict" , "2")
conf.set("es.write.operation", "upsert")
conf.set("es.input.json", "true")
val documentsRdd = ...
documentsRdd.saveJsonToEs("test/user", scala.collection.Map("
es.mapping.id" -> "_id", "es.mapping.exclude" -> "_id"))

The JSON looks like that :
{
"_id": "XXXX",
"_type": "user",
"_index": "test",
"params": {
"events": [
{
...
}
]
}

Thanks!
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/aea88dfb-8d4b-49d1-a236-8de6d513b4f6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

costin · February 18, 2015, 5:42pm

Hi Itay,

Sorry I missed your email. I'm not clear from your post how your documents
look like - can you post a gist somewhere with your JSON input that you are
sending to Elasticsearch?
Typically the metadata appear in the _source if they are declared that way.
You should be able to go around this by using:

es.mapping.exclude - if it doesn't seem to be working
in case of Spark, by specifying the metadata through the saveWithMeta
methods which allows it to stay decoupled from the object itself.

Since you are using JSON likely 1 is your best shot. If it doesn't work for
you can you please raise an issue with a quick/small sample to be able to
reproduce it?

Thanks,

On Wed, Feb 18, 2015 at 10:27 AM, Itai Yaffe itaiy@exelate.com wrote:

Hey,
Has anyone experienced with such an issue?
Perhaps Costin can help here?

Thanks!

On Thursday, February 12, 2015 at 8:27:14 AM UTC+2, Itai Yaffe wrote:

Hey,
I've recently started using Elasticsearch for Spark (Scala application).
I've added elasticsearch-spark_2.10 version 2.1.0.BUILD-SNAPSHOT to my
Spark application pom file, and used org.apache.spark.rdd.RDD[String].saveJsonToEs()
to send documents to Elasticsearch.
When the documents are loaded to Elasticsearch, my metadata fields (e.g
id, index, etc.) are being loaded as part of the _source field.
Is there a way to exclude them from the _source?
I've tried using the new "es.mapping.exclude" configuration property
(added in this commit
https://github.com/elasticsearch/elasticsearch-hadoop/commit/aae4f0460a23bac9567ea2ad335c74245a1ba069

that's why I needed to take the latest build rather than using version
2.1.0.Beta3), but it doesn't seem to have any affect (although I'm not sure
it's even possible to exclude fields I'm using for mapping, e.g "
es.mapping.id").

A code snippet (I'm using a single-node Elasticsearch cluster for testing
purposes and running the Spark app from my desktop) :
val conf = new SparkConf()...
conf.set("es.index.auto.create", "false")
conf.set("es.nodes.discovery", "false")
conf.set("es.nodes", "XXX:9200")
conf.set("es.update.script", "XXX")
conf.set("es.update.script.params", "param1:events")
conf.set("es.update.retry.on.conflict" , "2")
conf.set("es.write.operation", "upsert")
conf.set("es.input.json", "true")
val documentsRdd = ...
documentsRdd.saveJsonToEs("test/user", scala.collection.Map("es.
mapping.id" -> "_id", "es.mapping.exclude" -> "_id"))

The JSON looks like that :
{
"_id": "XXXX",
"_type": "user",
"_index": "test",
"params": {
"events": [
{
...
}
]
}

Thanks!
}

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/aea88dfb-8d4b-49d1-a236-8de6d513b4f6%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/aea88dfb-8d4b-49d1-a236-8de6d513b4f6%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJogdmcAmUF2txP_6_DGoW9%3DN7kKKx3gkCaeDBohFmjC8PvtNg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Itai_Yaffe · February 19, 2015, 5:03pm

Thanks for the response Costin!
As you mentioned, option 1, i.e es.mapping.exclude, is more appropriate
when working with JSON.
Since it doesn't seem to work, I've followed your advice and raised a new
issue (Excluding fields when writing JSON documents from Spark to Elasticsearch doesn't work · Issue #381 · elastic/elasticsearch-hadoop · GitHub)
including a small test application to reproduce.
I'd be happy to hear what you think of it.

Thanks again,
Itai

On Wednesday, February 18, 2015 at 7:42:36 PM UTC+2, Costin Leau wrote:

Hi Itay,

Sorry I missed your email. I'm not clear from your post how your documents
look like - can you post a gist somewhere with your JSON input that you are
sending to Elasticsearch?
Typically the metadata appear in the _source if they are declared that
way. You should be able to go around this by using:

es.mapping.exclude - if it doesn't seem to be working

in case of Spark, by specifying the metadata through the saveWithMeta
methods which allows it to stay decoupled from the object itself.

Since you are using JSON likely 1 is your best shot. If it doesn't work
for you can you please raise an issue with a quick/small sample to be able
to reproduce it?

Thanks,

On Wed, Feb 18, 2015 at 10:27 AM, Itai Yaffe <it...@exelate.com
<javascript:>> wrote:

Hey,
Has anyone experienced with such an issue?
Perhaps Costin can help here?

Thanks!

On Thursday, February 12, 2015 at 8:27:14 AM UTC+2, Itai Yaffe wrote:

Hey,
I've recently started using Elasticsearch for Spark (Scala application).
I've added elasticsearch-spark_2.10 version 2.1.0.BUILD-SNAPSHOT to my
Spark application pom file, and used org.apache.spark.rdd.RDD[String].saveJsonToEs()
to send documents to Elasticsearch.
When the documents are loaded to Elasticsearch, my metadata fields (e.g
id, index, etc.) are being loaded as part of the _source field.
Is there a way to exclude them from the _source?
I've tried using the new "es.mapping.exclude" configuration property
(added in this commit
https://github.com/elasticsearch/elasticsearch-hadoop/commit/aae4f0460a23bac9567ea2ad335c74245a1ba069

that's why I needed to take the latest build rather than using version
2.1.0.Beta3), but it doesn't seem to have any affect (although I'm not sure
it's even possible to exclude fields I'm using for mapping, e.g "
es.mapping.id").

A code snippet (I'm using a single-node Elasticsearch cluster for
testing purposes and running the Spark app from my desktop) :
val conf = new SparkConf()...
conf.set("es.index.auto.create", "false")
conf.set("es.nodes.discovery", "false")
conf.set("es.nodes", "XXX:9200")
conf.set("es.update.script", "XXX")
conf.set("es.update.script.params", "param1:events")
conf.set("es.update.retry.on.conflict" , "2")
conf.set("es.write.operation", "upsert")
conf.set("es.input.json", "true")
val documentsRdd = ...
documentsRdd.saveJsonToEs("test/user", scala.collection.Map("es.
mapping.id" -> "_id", "es.mapping.exclude" -> "_id"))

The JSON looks like that :
{
"_id": "XXXX",
"_type": "user",
"_index": "test",
"params": {
"events": [
{
...
}
]
}

Thanks!
}

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/aea88dfb-8d4b-49d1-a236-8de6d513b4f6%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/aea88dfb-8d4b-49d1-a236-8de6d513b4f6%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9f210b41-4a31-4dd4-aa2d-cae7aabd3a1f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Elasticsearch and spark Elasticsearch	7	1171	July 6, 2017
Don't store certain fields by default Elasticsearch	12	436	July 6, 2017
Java.lang.NoSuchFieldError: ALLOW_UNQUOTED_FIELD_NAMES when trying to query elasticsearch using spark Elasticsearch	8	1092	July 6, 2017
Updating documents with excluded fields Elasticsearch es-hadoop	3	3476	July 6, 2017
ElasticSearch+Hadoop+Spark Elasticsearch	2	979	July 6, 2017

[Hadoop][Spark] Exclude metadata fields from _source

Related topics