Elastic Search losing data not in _source on _update


(Jordan Reiter) #1

Hi,

Every time I run a POST request using _update, I notice that any indexed
information I didn't put in _source appears to go missing.

Obviously, it would be ideal if I didn't have to store, for example, the
contents of a several-megabyte file in _source in order to keep it in my
record after calling the _update method on my index/mapping.

To start, here is the version info for elastic search:

{
"status" : 200,
"name" : "Feron",
"version" : {
"number" : "1.3.1",
"build_hash" : "2de6dc5268c32fb49b205233c138d93aaf772015",
"build_timestamp" : "2014-07-28T14:45:15Z",
"build_snapshot" : false,
"lucene_version" : "4.9"
},
"tagline" : "You Know, for Search"
}

Here's my cluster health:

{
"cluster_name" : "my-cluster",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 5,
"active_shards" : 10,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

A script for recreating the issue is attached. In it, I create a mapping and save a record using the attachment plugin. The records correctly match searches on a field in _source, a field excluded from _source, and within the content (attachment) field (also excluded from source).

As soon as I make the POST request to …/_update searches against fields excluded from _source return 0 hits.

Is the only solution to this to store all fields in _source if I plan on calling _update on the record?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/341a24f2-aedf-4f5f-9a9e-1434b9ea1e62%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(David Pilato) #2

Yes you need the complete source so excluding fields won't work as expected.
In that case, you need to send back the attachment again I guess.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 31 juillet 2014 à 13:38:54, Jordan Reiter (jordanthecoder@gmail.com) a écrit:

Hi,

Every time I run a POST request using _update, I notice that any indexed information I didn't put in _source appears to go missing.

Obviously, it would be ideal if I didn't have to store, for example, the contents of a several-megabyte file in _source in order to keep it in my record after calling the _update method on my index/mapping.

To start, here is the version info for elastic search:

{
"status" : 200,
"name" : "Feron",
"version" : {
"number" : "1.3.1",
"build_hash" : "2de6dc5268c32fb49b205233c138d93aaf772015",
"build_timestamp" : "2014-07-28T14:45:15Z",
"build_snapshot" : false,
"lucene_version" : "4.9"
},
"tagline" : "You Know, for Search"
}

Here's my cluster health:

{
"cluster_name" : "my-cluster",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 5,
"active_shards" : 10,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

A script for recreating the issue is attached. In it, I create a mapping and save a record using the attachment plugin. The records correctly match searches on a field in _source, a field excluded from _source, and within the content (attachment) field (also excluded from source).

As soon as I make the POST request to …/_update searches against fields excluded from _source return 0 hits.

Is the only solution to this to store all fields in _source if I plan on calling _update on the record?

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/341a24f2-aedf-4f5f-9a9e-1434b9ea1e62%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/etPan.53da44d3.38437fdb.f0d0%40MacBook-Air-de-David.local.
For more options, visit https://groups.google.com/d/optout.


(Jordan Reiter) #3

So I guess using updates is not a good idea for records with file
attachments.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/972f1b41-d10b-4ebc-8ac2-c83b80891924%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(David Pilato) #4

In your case, it's not. Because you excluded the attachment field.

If you are a Java developer, you could easily use Tika directly in your own code and send to elasticsearch only the extracted content and not the binary file.
In that case, you could remove mapper attachment plugin.

If not, I think you need to send again the full JSON document, including the binary file.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 31 juillet 2014 à 16:32:04, Jordan Reiter (jordanthecoder@gmail.com) a écrit:

So I guess using updates is not a good idea for records with file attachments.

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/972f1b41-d10b-4ebc-8ac2-c83b80891924%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/etPan.53da5401.684a481a.f0d0%40MacBook-Air-de-David.local.
For more options, visit https://groups.google.com/d/optout.


(Jordan Reiter-2) #5

Is there any way to do this so it can be stored but I don't get it
when I pull in the _source record? Even extracted text is going to be
huge when you're talking about 20-30+ page documents.

On Thu, Jul 31, 2014 at 10:34 AM, David Pilato david@pilato.fr wrote:

In your case, it's not. Because you excluded the attachment field.

If you are a Java developer, you could easily use Tika directly in your own
code and send to elasticsearch only the extracted content and not the binary
file.
In that case, you could remove mapper attachment plugin.

If not, I think you need to send again the full JSON document, including the
binary file.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 31 juillet 2014 à 16:32:04, Jordan Reiter (jordanthecoder@gmail.com) a
écrit:

So I guess using updates is not a good idea for records with file
attachments.

You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/972f1b41-d10b-4ebc-8ac2-c83b80891924%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/26HBTz6XKgM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/etPan.53da5401.684a481a.f0d0%40MacBook-Air-de-David.local.

For more options, visit https://groups.google.com/d/optout.

--
Jordan Reiter
AACE - Association for the Advancement of Computing in Education
Email: jordan@aace.org | Website: www.aace.org | +1.267.438.2388

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAD4hTsUW6VZ2dhRjqyMaRV5uiTGYuAWiz4Z%3D0y0dtbozJeHMLA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Jordan Reiter-2) #6

Never mind, a little googling answered that question:
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/root-object.html#source-field
"In a search request, you can ask for only certain fields by
specifying the _source parameter in the request body".

That neatly resolves my issue!

It does mean I'm going to have to change my mapping and, probably,
re-index my entire collection.

Thanks for your help
Jordan

On Thu, Jul 31, 2014 at 10:21 PM, Jordan Reiter jordan@aace.org wrote:

Is there any way to do this so it can be stored but I don't get it
when I pull in the _source record? Even extracted text is going to be
huge when you're talking about 20-30+ page documents.

On Thu, Jul 31, 2014 at 10:34 AM, David Pilato david@pilato.fr wrote:

In your case, it's not. Because you excluded the attachment field.

If you are a Java developer, you could easily use Tika directly in your own
code and send to elasticsearch only the extracted content and not the binary
file.
In that case, you could remove mapper attachment plugin.

If not, I think you need to send again the full JSON document, including the
binary file.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 31 juillet 2014 à 16:32:04, Jordan Reiter (jordanthecoder@gmail.com) a
écrit:

So I guess using updates is not a good idea for records with file
attachments.

You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/972f1b41-d10b-4ebc-8ac2-c83b80891924%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/26HBTz6XKgM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/etPan.53da5401.684a481a.f0d0%40MacBook-Air-de-David.local.

For more options, visit https://groups.google.com/d/optout.

--
Jordan Reiter
AACE - Association for the Advancement of Computing in Education
Email: jordan@aace.org | Website: www.aace.org | +1.267.438.2388

--
Jordan Reiter
AACE - Association for the Advancement of Computing in Education
Email: jordan@aace.org | Website: www.aace.org | +1.267.438.2388

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAD4hTsWbKFb2iA9x-ezsz-EiY8j1gH%2BfkMpGv-khQnyUqv%3DqzA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #7