ES Hadoop--Index only new documents without killing job from exceptions?


(James Campbell) #1

Hi ES-Hadoop users--

I have a large list of simple documents that I would like to index for an
auto complete feature. At batch processing time, I do not know which values
are new (never seen before) and which are not (some other part of the
update process changed, but the autocomplete-relevant portion of the
document did not).

I believe I could simply write all of the documents to the index whenever I
run a new batch with the default es.write.operation=index, but that will
cause ES to reindex the document each time even if it wasn't updated.

On the other hand, if I choose to use es.write.operation=create, then any
existing documents will cause the job to fail.

Is there a way to combine those behaviors, so that I can allow
elasticsearch to simply ignore requests to reindex existing documents
(based on _id) but not to throw an exception that kills the entire job?

James Campbell

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2e5b93ef-0c42-4068-bc2c-33e4efbe429b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Costin Leau) #2

I would recommend indexing the document since it's a 'cheap' operation per
document and it covers the potential differences between the docs. Also
from a performance POV you are not going to lose much since you are anyway
sending the doc to ES, which does hashing and returns the error to the user.
So the only thing that you save and might potentially see is the actual
indexing which should become a problem only when dealing with large amounts
of docs.

These being said, there's already an issue opened [1] for trapping/handling
errors during a job (to prevent it from being cancelled) which potentially
can be used for such a purpose as well. Free free to add your comments to
it.

[1] https://github.com/elasticsearch/elasticsearch-hadoop/issues/160

On Thu, Jul 3, 2014 at 8:49 PM, James Campbell james.p.campbell@gmail.com
wrote:

Hi ES-Hadoop users--

I have a large list of simple documents that I would like to index for an
auto complete feature. At batch processing time, I do not know which values
are new (never seen before) and which are not (some other part of the
update process changed, but the autocomplete-relevant portion of the
document did not).

I believe I could simply write all of the documents to the index whenever
I run a new batch with the default es.write.operation=index, but that will
cause ES to reindex the document each time even if it wasn't updated.

On the other hand, if I choose to use es.write.operation=create, then any
existing documents will cause the job to fail.

Is there a way to combine those behaviors, so that I can allow
elasticsearch to simply ignore requests to reindex existing documents
(based on _id) but not to throw an exception that kills the entire job?

James Campbell

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/2e5b93ef-0c42-4068-bc2c-33e4efbe429b%40googlegroups.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJogdmeX_Dc-LRNcgPxY4bQ6drz43eL%3DuQnRVYYD-kjZ8%3DJebw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(James Campbell) #3

Thanks, Costin. That makes sense; I've also commented on the issue you
mentioned on github.

Having more control over the when to fail a job or choose to ignore certain
errors would definitely be a great feature from my perspective. I've
encountered a few different areas where I think extra control would be
valuable:

(1) Ability to fail on indexing failures (that persist despite the retry
policy). Currently multiple failed bulk retries are reported only via
counter. Since job control programs such as Oozie don't make it easy to
fail a workflow based on a counter, I think it makes more sense to be able
to fail a job that had batches completely fail, else the documents may
never be searchable from elastic search.

(2) DocumentAlreadyExists exceptions with the "create" write mode. Given
the batch nature of hadoop, there are cases (e.g. building autocomplete)
where it may make sense to update an index only with new data. To avoid a
reindex cost, it would be nice to be able to have a job succeed even if ES
thaws a DocumentAlreadyExists exception so we can just throw data over to
ES to check whether it exists and ignore the request if it does.

(3) Malformed/bad data. Despite (2) above, it would be ideal to still be
able to throw errors and fail a job in the case of invalid data,
particularly in case of legitimately invalid JSON (such as unescaped
special characters that may have occurred in data that is being batch
processed from a a binary container format in HDFS).

On Sun, Jul 6, 2014 at 4:38 PM, Costin Leau costin.leau@gmail.com wrote:

I would recommend indexing the document since it's a 'cheap' operation per
document and it covers the potential differences between the docs. Also
from a performance POV you are not going to lose much since you are anyway
sending the doc to ES, which does hashing and returns the error to the user.
So the only thing that you save and might potentially see is the actual
indexing which should become a problem only when dealing with large amounts
of docs.

These being said, there's already an issue opened [1] for
trapping/handling errors during a job (to prevent it from being cancelled)
which potentially can be used for such a purpose as well. Free free to add
your comments to it.

[1] https://github.com/elasticsearch/elasticsearch-hadoop/issues/160

On Thu, Jul 3, 2014 at 8:49 PM, James Campbell <james.p.campbell@gmail.com

wrote:

Hi ES-Hadoop users--

I have a large list of simple documents that I would like to index for an
auto complete feature. At batch processing time, I do not know which values
are new (never seen before) and which are not (some other part of the
update process changed, but the autocomplete-relevant portion of the
document did not).

I believe I could simply write all of the documents to the index whenever
I run a new batch with the default es.write.operation=index, but that will
cause ES to reindex the document each time even if it wasn't updated.

On the other hand, if I choose to use es.write.operation=create, then any
existing documents will cause the job to fail.

Is there a way to combine those behaviors, so that I can allow
elasticsearch to simply ignore requests to reindex existing documents
(based on _id) but not to throw an exception that kills the entire job?

James Campbell

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/2e5b93ef-0c42-4068-bc2c-33e4efbe429b%40googlegroups.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/EHJQsxb-s4w/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAJogdmeX_Dc-LRNcgPxY4bQ6drz43eL%3DuQnRVYYD-kjZ8%3DJebw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAJogdmeX_Dc-LRNcgPxY4bQ6drz43eL%3DuQnRVYYD-kjZ8%3DJebw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CA%2BAQu3xrGMFhDV%2B%2B7SGshm%2ByLHof7DV-RRy3inLOz-DVsCaHXg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #4