it makes sense. I guess we'll write a new REST action later as you suggest.
Short answer: modifying the source after having executed a standard
index or bulk action is not possible.
Long answer: it depends, if you look
at https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/action/index/TransportIndexAction.java#L188
you can see how index (or bulk index, which uses same code fragment)
action depends on the source as a byte ref which is passed around
through all the mapping/analyzer phase. It is absolutely required that
_source field represents the document of what the index analyzer
produced from it.
But it's not a serious limitation. In fact, it is a good thing that no
ES user knows an easy way how to tamper with _source data and open the
box to all kinds of mysterious bugs just by installing (maybe
malevolent) plugins and change the way the standard actions are
expected to work.
Of course, it is possible in plugin code to write a new action (plus
another REST action endpoint) which works similar to index/bulk action
but can also address the additional _source modification you want.
For example, I have implemented another bulk-style action in ES which
works with a different BulkProcessor class and has another style of
error handling. Not possible by modifying the existing bulk action,
but by adding another bulk action.
Jörg
On Thu, Jun 12, 2014 at 11:53 PM, Jakub Kotowski
<jakub@sindicetech.com mailto:jakub@sindicetech.com> wrote:
Hi again,
I have a followup question about the ParseContext and processing
documents before indexing.
Now I would need to modify a document before it is parsed by
ElasticSearch.
I tried to do it by modifying context.source() but that leads to a
corrupt index. I guess that's because context.parser is also
initialized with the same bytearray (at least w.r.t. its contents)
as context.source(). So in order to mutate the bytearray, I would
need to do it in the parser too. The parser however is already
started and, by the time I get to it, it already processed at
least two tokens. That means that I could conceivably try to
restart the parser with the modified bytearray and lead it to the
same (resp. corresponding) state as it originally would have been
thanks to ObjectMapper's actions on it. This is would however very
clearly be a very fragile hack... One way of avoiding that maybe
could be somehow achieving to be the first rootMapper executed by
the ObjectMapper, but I think this is hardcoded and cannot be
easily changed (there's no client API for it afaik).
Is there some way of modifying a document before ElasticSearch
gets to parse it?
Basically, I need to send a document to ES that contains some JSON
subobjects understood by the custom parser of our plugin and it
doesn't make much sense for ElasticSearch to index them as they
are so ideally we would like to transform them a bit.
Thanks for any pointers.
Jakub
On Friday, May 23, 2014 6:56:32 PM UTC+1, Jörg Prante wrote:
In answer to (1), in each custom mapper, you have access to
ParseContext in the method
publicvoidparse(ParseContextcontext)throwsIOException
In the ParseContext, you can access _source with the source()
method to do whatever you want, e.g. copy it, parse it, index
it again etc.
(2) is a slight misconception, since _source is not a field,
but a "field container", it is a byte array passed through the
ES API so the field mappers can do their work.
(3) as said, it is possible to copy _source, but only
internally in the code of a custom field mapper, not by
configuration in the mapping, since _source is reserved for
special treatment inside ES and users should not be able to
tamper with it.
So a customized mapper in a plugin could work like this in the
root object:
"mappings" : {
"properties" : {
...
"_siren" : { "type" : "siren" }
}
}
and in the corresponding code in the custom mapper, when field
_siren is processed because of the type "siren", it copies the
byte array from _source in the ParseContext. (It need not to
be the field name _siren this is just an example name)
Jörg
On Fri, May 23, 2014 at 5:38 PM, Jakub Kotowski
<ja...@sindicetech.com> wrote:
Hi Jörg,
thanks for the reply. Yes, what you suggest is a way to
improve our current approach so that we can get a subdoc
instead of a json encoded in a string field.
What we would like to achieve is to always be able to
process any document that comes to elasticsearch as a
whole, i.e. be it { "title": "my title", "content" : "my
content"} or {"name" : "john", "surname" : "doe"}.
For that we either (1) need to be able to set an analyzer
for the whole input document or (2) set an analyzer for
the _source field which already contains the whole doc or
(3) copy the _source field to a normal field, let's say
_siren, and set an analyzer for it.
(1) and (2) seem to be impossible.
So we are exploring option (3) which also seems difficult.
Jakub
On Friday, May 23, 2014 4:24:39 PM UTC+1, Jörg Prante wrote:
Not sure what the plugin is doing, but if you want to
process dedicated JSON data in an ES document, you
could prepare an analyzer for a new field type. So
user can assign special meaning in the mapping to a
field of their preference.
E.g. a mapping with
"mappings: {
"mycontent" : { "type" : "siren" }
}
and a given document would look like
"mycontent" : {
"title" : "foo",
"name" : "bar"
...
}
and then you could extract the whole JSON subdoc from
the doc under "mycontent" into your analyzer plugin
and process it.
For an example, you could look into plugins like the
StandardNumber analyzer, where I defined a new type
"standardnumber" for analysis:
https://github.com/jprante/elasticsearch-analysis-standardnumber/blob/master/src/main/java/org/xbib/elasticsearch/index/mapper/standardnumber/StandardNumberMapper.java
Jörg
On Fri, May 23, 2014 at 4:48 PM, Jakub Kotowski
<ja...@sindicetech.com> wrote:
Hello all,
we are trying to implement a SIREn plugin for
ElasticSearch for indexing and querying documents.
We already implemented a version which uses SIREn
to index and query a specific field (called
"contents" below) which contains a JSON document
as a string. An example of a doc:
{
"id":3,
"contents":"{\"title\":\"This is an another article about SIREn.\",\"content\":\"bla bla bla \"}"
}
Instead, we would like to index the whole document
as it is posted to ElasticSearch to avoid the need
for a special loader that transforms an input JSON
to the required form. So then the user would
simply post a document such as:
{
"id":3,
"title":"This is an another article about SIREn.",
"content": "bla bla bla "
}
and it would be indexed as a whole both by
ElasticSearch and by the SIREn plugin.
One problem we encountered is that it is not
possible to use copyTo for the _source field and
then only configure an analyzer for the copy.
It seems that the cleanest solution would be to
modify the SourceFieldMapper class to allow copyTo.
As a workaround we are going to create a class
that extends SourceFieldMapper and set copyTo for
the _source field to a new field that will be then
used for SIREn and register it as follows:
mapperService.documentMapperParser().putRootTypeParser("_source",
new ModifiedSourceFieldMapper.TypeParser());
Does it sound OK or is there a simpler/cleaner
solution?
Thank you in advance,
Jakub
--
You received this message because you are
subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving
emails from it, send an email to
elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/352e7668-d382-4ca3-bbeb-605d6c019ed1%40googlegroups.com
<https://groups.google.com/d/msgid/elasticsearch/352e7668-d382-4ca3-bbeb-605d6c019ed1%40googlegroups.com?utm_medium=email&utm_source=footer>.
For more options, visit
https://groups.google.com/d/optout.
--
You received this message because you are subscribed to
the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails
from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/e796d820-8e9c-4f94-b425-38bd5f509b51%40googlegroups.com
<https://groups.google.com/d/msgid/elasticsearch/e796d820-8e9c-4f94-b425-38bd5f509b51%40googlegroups.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearch+unsubscribe@googlegroups.com
<mailto:elasticsearch+unsubscribe@googlegroups.com>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/89d75c30-5aa5-49e5-a17f-90f9b38829fa%40googlegroups.com
<https://groups.google.com/d/msgid/elasticsearch/89d75c30-5aa5-49e5-a17f-90f9b38829fa%40googlegroups.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/f4x8JdMAYdM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com
mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGRJ4VsjVqt-F8F4Hr%2BQ_tyHAmGQdvVhGfoZqkfj-48Ww%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGRJ4VsjVqt-F8F4Hr%2BQ_tyHAmGQdvVhGfoZqkfj-48Ww%40mail.gmail.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout.