I am building a system in which I will have two sources of updates:
Bulk updating from the source of truth(db) <- Always inserting
documents(complete docs)
Live updates <- Adding insert and update (complete and incomplete docs)
Also, lets assume that each insert/update has a timestamp, which we belive
in (not ES timestamp).
The idea is to have a complete, up to date index once the bulk updating
finishes. To achieve this I need to guarantee that I will have the correct
data. This would work mostly well, if everything we would do upserts and
the inserts/updates coming into ES have a strictly increasing timestamp.
But one could imagine that this is a possibly problematic situation, when:
We are performing bulk indexing,
a) we read an object from the db
b) process it
c) send it to ES.
We have an update on the same object, after step (a) and before if makes
to ES in the bulk updating - phase(c). That is, ES gets an update with new
data and only after that we get the insert with the entire document from
the source of truth with older data. Hence, in ES we have a document with a
newer timestamp, than the newly added one phase(c).
My theoretical solution: For each operation, have the timestamp for that
change (timestamp from the system that made the change, not from Elastic
Search). Lets say that all of the operations that we will perform are
upserts.
Then once we get an insert or an update (lets call it doc), we have to
perform the following script (pseudo mvel) inside ES.
{
if (doc.timestamp > ctx.source.timestamp) {
// doc is newer than what was in ES
upsert(doc); // update the index with all of the info from the new doc
} else {
// there is already a document in ES with a newer timestamp, note, this
may be an incomplete document (an update) fill the missing fields in the document in ES with values from doc
}
}
My question is:
Is there a better approach?
If so, is there a simple approach for doing the ' fill the missing
fields in the document in ES with values from doc' operation/script?
On Thursday, May 1, 2014 2:47:39 PM UTC-7, Michał Zgliczyński wrote:
Hi,
I am building a system in which I will have two sources of updates:
Bulk updating from the source of truth(db) <- Always inserting
documents(complete docs)
Live updates <- Adding insert and update (complete and incomplete docs)
Also, lets assume that each insert/update has a timestamp, which we belive
in (not ES timestamp).
The idea is to have a complete, up to date index once the bulk updating
finishes. To achieve this I need to guarantee that I will have the correct
data. This would work mostly well, if everything we would do upserts and
the inserts/updates coming into ES have a strictly increasing timestamp.
But one could imagine that this is a possibly problematic situation, when:
We are performing bulk indexing,
a) we read an object from the db
b) process it
c) send it to ES.
We have an update on the same object, after step (a) and before if
makes to ES in the bulk updating - phase(c). That is, ES gets an update
with new data and only after that we get the insert with the entire
document from the source of truth with older data. Hence, in ES we have a
document with a newer timestamp, than the newly added one phase(c).
My theoretical solution: For each operation, have the timestamp for that
change (timestamp from the system that made the change, not from Elastic
Search). Lets say that all of the operations that we will perform are
upserts.
Then once we get an insert or an update (lets call it doc), we have to
perform the following script (pseudo mvel) inside ES.
{
if (doc.timestamp > ctx.source.timestamp) {
// doc is newer than what was in ES
upsert(doc); // update the index with all of the info from the new doc
} else {
// there is already a document in ES with a newer timestamp, note,
this may be an incomplete document (an update) fill the missing fields in the document in ES with values from doc
}
}
My question is:
Is there a better approach?
If so, is there a simple approach for doing the ' fill the missing
fields in the document in ES with values from doc' operation/script?
Hi,
Thank you for your response. I have looked through this blog
post: Elasticsearch Platform — Find real-time answers at scale | Elastic
It looks as if external versioning would be the way to go. Have the
timestamps act as version numbers and let ES only pick the document with
the newest version as the correct document. However, with the situation I
have presented above, ES will fail. A quote from the post:
"With version_type set to external, Elasticsearch will store the version
number as given and will not increment it. Also, instead of checking for an
exact match, Elasticsearch will only return a version collision error if
the version currently stored is greater or equal to the one in the indexing
command. This effectively means “only store this information if no one else
has supplied the same or a more recent version in the meantime”.
Concretely, the above request will succeed if the stored version number is
smaller than 526. 526 and above will cause the request to fail."
In my example, we would have that situation. A partial doc with a larger
version number(later timestamp) is already stored in ES and we get the
complete document with a smaller timestamp. In this situation we would like
to merge these 2 documents in a way that, we have all of the fields from
the partial doc and the other fields(not currently specified in the ES
document) to be filled from the complete document.
Thanks!
Michal Zgliczynski
W dniu czwartek, 1 maja 2014 14:58:31 UTC-7 użytkownik Rob Ottaway napisał:
On Thursday, May 1, 2014 2:47:39 PM UTC-7, Michał Zgliczyński wrote:
Hi,
I am building a system in which I will have two sources of updates:
Bulk updating from the source of truth(db) <- Always inserting
documents(complete docs)
Live updates <- Adding insert and update (complete and incomplete docs)
Also, lets assume that each insert/update has a timestamp, which we
belive in (not ES timestamp).
The idea is to have a complete, up to date index once the bulk updating
finishes. To achieve this I need to guarantee that I will have the correct
data. This would work mostly well, if everything we would do upserts and
the inserts/updates coming into ES have a strictly increasing timestamp.
But one could imagine that this is a possibly problematic situation, when:
We are performing bulk indexing,
a) we read an object from the db
b) process it
c) send it to ES.
We have an update on the same object, after step (a) and before if
makes to ES in the bulk updating - phase(c). That is, ES gets an update
with new data and only after that we get the insert with the entire
document from the source of truth with older data. Hence, in ES we have a
document with a newer timestamp, than the newly added one phase(c).
My theoretical solution: For each operation, have the timestamp for that
change (timestamp from the system that made the change, not from Elastic
Search). Lets say that all of the operations that we will perform are
upserts.
Then once we get an insert or an update (lets call it doc), we have to
perform the following script (pseudo mvel) inside ES.
{
if (doc.timestamp > ctx.source.timestamp) {
// doc is newer than what was in ES
upsert(doc); // update the index with all of the info from the new doc
} else {
// there is already a document in ES with a newer timestamp, note,
this may be an incomplete document (an update) fill the missing fields in the document in ES with values from doc
}
}
My question is:
Is there a better approach?
If so, is there a simple approach for doing the ' fill the missing
fields in the document in ES with values from doc' operation/script?
I missed that the later doc would only be partial. What is the reason to
use the partial doc? That really complicates things.
Filling in missing fields is going to be a very large headache. You'll
probably kill performance trying to do it too. Likely it'll be so complex
it will present a lot more trouble.
I think if you can better present the overall use cases you will get better
insight into how to work this out.
On Thursday, May 1, 2014 4:51:03 PM UTC-7, Michał Zgliczyński wrote:
Hi,
Thank you for your response. I have looked through this blog post: Elasticsearch Platform — Find real-time answers at scale | Elastic
It looks as if external versioning would be the way to go. Have the
timestamps act as version numbers and let ES only pick the document with
the newest version as the correct document. However, with the situation I
have presented above, ES will fail. A quote from the post:
"With version_type set to external, Elasticsearch will store the version
number as given and will not increment it. Also, instead of checking for an
exact match, Elasticsearch will only return a version collision error if
the version currently stored is greater or equal to the one in the indexing
command. This effectively means “only store this information if no one else
has supplied the same or a more recent version in the meantime”.
Concretely, the above request will succeed if the stored version number is
smaller than 526. 526 and above will cause the request to fail."
In my example, we would have that situation. A partial doc with a larger
version number(later timestamp) is already stored in ES and we get the
complete document with a smaller timestamp. In this situation we would like
to merge these 2 documents in a way that, we have all of the fields from
the partial doc and the other fields(not currently specified in the ES
document) to be filled from the complete document.
Thanks!
Michal Zgliczynski
W dniu czwartek, 1 maja 2014 14:58:31 UTC-7 użytkownik Rob Ottaway napisał:
On Thursday, May 1, 2014 2:47:39 PM UTC-7, Michał Zgliczyński wrote:
Hi,
I am building a system in which I will have two sources of updates:
Bulk updating from the source of truth(db) <- Always inserting
documents(complete docs)
Live updates <- Adding insert and update (complete and incomplete
docs)
Also, lets assume that each insert/update has a timestamp, which we
belive in (not ES timestamp).
The idea is to have a complete, up to date index once the bulk updating
finishes. To achieve this I need to guarantee that I will have the correct
data. This would work mostly well, if everything we would do upserts and
the inserts/updates coming into ES have a strictly increasing timestamp.
But one could imagine that this is a possibly problematic situation,
when:
We are performing bulk indexing,
a) we read an object from the db
b) process it
c) send it to ES.
We have an update on the same object, after step (a) and before if
makes to ES in the bulk updating - phase(c). That is, ES gets an update
with new data and only after that we get the insert with the entire
document from the source of truth with older data. Hence, in ES we have a
document with a newer timestamp, than the newly added one phase(c).
My theoretical solution: For each operation, have the timestamp for that
change (timestamp from the system that made the change, not from Elastic
Search). Lets say that all of the operations that we will perform are
upserts.
Then once we get an insert or an update (lets call it doc), we have to
perform the following script (pseudo mvel) inside ES.
{
if (doc.timestamp > ctx.source.timestamp) {
// doc is newer than what was in ES
upsert(doc); // update the index with all of the info from the new
doc
} else {
// there is already a document in ES with a newer timestamp, note,
this may be an incomplete document (an update) fill the missing fields in the document in ES with values from
doc
}
}
My question is:
Is there a better approach?
If so, is there a simple approach for doing the ' fill the missing
fields in the document in ES with values from doc' operation/script?
My system is changing rapidly. The final result is to have all the data
inside the ES index. The way I have it set up currently I have 2 different
systems that write to the ES index:
Bulk job. Run through all the dbs, fetch things in batch updates of 5k
and send it to ES.
Live updating job. Pickup the newest changes and send them to ES. Either
updates or inserts. Note: the updates don't contain full documents
After this step (1) and (2) I would like to have (almost) 100% guarantee
that the index is full and up to date.
I think that this is quite common use case if you want to have an index
with live data, not stale as of the time of the beginning of the bulk job.
W dniu czwartek, 1 maja 2014 19:45:53 UTC-7 użytkownik Rob Ottaway napisał:
I missed that the later doc would only be partial. What is the reason to
use the partial doc? That really complicates things.
Filling in missing fields is going to be a very large headache. You'll
probably kill performance trying to do it too. Likely it'll be so complex
it will present a lot more trouble.
I think if you can better present the overall use cases you will get
better insight into how to work this out.
On Thursday, May 1, 2014 4:51:03 PM UTC-7, Michał Zgliczyński wrote:
Hi,
Thank you for your response. I have looked through this blog post: Elasticsearch Platform — Find real-time answers at scale | Elastic
It looks as if external versioning would be the way to go. Have the
timestamps act as version numbers and let ES only pick the document with
the newest version as the correct document. However, with the situation I
have presented above, ES will fail. A quote from the post:
"With version_type set to external, Elasticsearch will store the version
number as given and will not increment it. Also, instead of checking for an
exact match, Elasticsearch will only return a version collision error if
the version currently stored is greater or equal to the one in the indexing
command. This effectively means “only store this information if no one else
has supplied the same or a more recent version in the meantime”.
Concretely, the above request will succeed if the stored version number is
smaller than 526. 526 and above will cause the request to fail."
In my example, we would have that situation. A partial doc with a larger
version number(later timestamp) is already stored in ES and we get the
complete document with a smaller timestamp. In this situation we would like
to merge these 2 documents in a way that, we have all of the fields from
the partial doc and the other fields(not currently specified in the ES
document) to be filled from the complete document.
Thanks!
Michal Zgliczynski
W dniu czwartek, 1 maja 2014 14:58:31 UTC-7 użytkownik Rob Ottaway
napisał:
On Thursday, May 1, 2014 2:47:39 PM UTC-7, Michał Zgliczyński wrote:
Hi,
I am building a system in which I will have two sources of updates:
Bulk updating from the source of truth(db) <- Always inserting
documents(complete docs)
Live updates <- Adding insert and update (complete and incomplete
docs)
Also, lets assume that each insert/update has a timestamp, which we
belive in (not ES timestamp).
The idea is to have a complete, up to date index once the bulk updating
finishes. To achieve this I need to guarantee that I will have the correct
data. This would work mostly well, if everything we would do upserts and
the inserts/updates coming into ES have a strictly increasing timestamp.
But one could imagine that this is a possibly problematic situation,
when:
We are performing bulk indexing,
a) we read an object from the db
b) process it
c) send it to ES.
We have an update on the same object, after step (a) and before if
makes to ES in the bulk updating - phase(c). That is, ES gets an update
with new data and only after that we get the insert with the entire
document from the source of truth with older data. Hence, in ES we have a
document with a newer timestamp, than the newly added one phase(c).
My theoretical solution: For each operation, have the timestamp for
that change (timestamp from the system that made the change, not from
Elastic Search). Lets say that all of the operations that we will perform
are upserts.
Then once we get an insert or an update (lets call it doc), we have to
perform the following script (pseudo mvel) inside ES.
{
if (doc.timestamp > ctx.source.timestamp) {
// doc is newer than what was in ES
upsert(doc); // update the index with all of the info from the new
doc
} else {
// there is already a document in ES with a newer timestamp, note,
this may be an incomplete document (an update) fill the missing fields in the document in ES with values from
doc
}
}
My question is:
Is there a better approach?
If so, is there a simple approach for doing the ' fill the missing
fields in the document in ES with values from doc' operation/script?
I agree wholly that this is a pretty normal use case to want to be able to
bulk index now and again while maintaining an index with updates in
(near)realtime. I question your proposed solutione. If the "live updating
job" does not feed you whole complete documents from the system or truth
(db right?) then you aren't going to be able to use versioned documents.
Versioned docs is how you solve this type of situation in Elasticsearch.
You would do better to alter your "live updates" to collect all the doc
from db (it has to be updated there anyhow right?) for live update index
ops and index while using versions to solve the race condition with the
bulk job.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.