Batch doc updates and real-time search


(Otis Gospodnetić) #1

Hello,

Does ES have anything in it that makes it better than Lucene when one
needs to modify a large number of docs (e.g. modify a "tags" field for
50K documents in the result set) and see the changes reflected in real-
time?

With straight Lucene, one could get the NRT part, but modifying 50K
docs would trigger 50K doc deletes and 50K doc adds for just slightly
modified documents.

Does ES happen to have a solution for this?

Thanks,
Otis

Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch


(Shay Banon) #2

Hi,

No, ES does not handle it and you will need to do a full update to change a
certain field (like rename a tag). There are hacks to do it on top of Lucene
(for example, by maintaining two parallel indices), but they are not really
manageable. Not really sure how NRT fits into this? Because of the deletes
and the cloning? If that is the case, then NRT is only opened in a scheduled
manner (though there is an API for that).

The nice thing is that this will be much much faster since you go
distributed and you basically spread the load.

-shay.banon

On Thu, Jun 10, 2010 at 10:55 PM, Otis otis.gospodnetic@gmail.com wrote:

Hello,

Does ES have anything in it that makes it better than Lucene when one
needs to modify a large number of docs (e.g. modify a "tags" field for
50K documents in the result set) and see the changes reflected in real-
time?

With straight Lucene, one could get the NRT part, but modifying 50K
docs would trigger 50K doc deletes and 50K doc adds for just slightly
modified documents.

Does ES happen to have a solution for this?

Thanks,
Otis

Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch


(Otis Gospodnetić) #3

Thanks Shay,

More below.

On Jun 10, 4:03 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

No, ES does not handle it and you will need to do a full update to change a
certain field (like rename a tag). There are hacks to do it on top of Lucene
(for example, by maintaining two parallel indices), but they are not really
manageable. Not really sure how NRT fits into this? Because of the deletes
and the cloning? If that is the case, then NRT is only opened in a scheduled
manner (though there is an API for that).

The nice thing is that this will be much much faster since you go
distributed and you basically spread the load.

Could you please expand on this a bit? Since I didn't mention
distributed search, I wonder what you are referring to.

Are you saying that IF I were to involve multiple shards (and thus
multiple nodes/servers), then batch doc updates would be faster
because, since docs would be spread over multiple nodes, the overall
time needed to update a large batch of docs would be shorter because
updates of sub-sets of docs would happen in parallel on multiple
nodes?

Thanks,
Otis

Sematext --http://sematext.com/-- Solr - Lucene - Nutch

On Thu, Jun 10, 2010 at 10:55 PM, Otis otis.gospodne...@gmail.com wrote:

Hello,

Does ES have anything in it that makes it better than Lucene when one
needs to modify a large number of docs (e.g. modify a "tags" field for
50K documents in the result set) and see the changes reflected in real-
time?

With straight Lucene, one could get the NRT part, but modifying 50K
docs would trigger 50K doc deletes and 50K doc adds for just slightly
modified documents.

Does ES happen to have a solution for this?

Thanks,
Otis

Sematext --http://sematext.com/-- Solr - Lucene - Nutch


(Shay Banon) #4

Yep, I meant that the indexing process or update process would be spread
across several nodes, thus will be faster. Sadly, there is no simple
solution for this in the Lucene world as far as I know... .

-shay.banon

On Thu, Jun 10, 2010 at 11:24 PM, Otis otis.gospodnetic@gmail.com wrote:

Thanks Shay,

More below.

On Jun 10, 4:03 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

No, ES does not handle it and you will need to do a full update to
change a
certain field (like rename a tag). There are hacks to do it on top of
Lucene
(for example, by maintaining two parallel indices), but they are not
really
manageable. Not really sure how NRT fits into this? Because of the
deletes
and the cloning? If that is the case, then NRT is only opened in a
scheduled
manner (though there is an API for that).

The nice thing is that this will be much much faster since you go
distributed and you basically spread the load.

Could you please expand on this a bit? Since I didn't mention
distributed search, I wonder what you are referring to.

Are you saying that IF I were to involve multiple shards (and thus
multiple nodes/servers), then batch doc updates would be faster
because, since docs would be spread over multiple nodes, the overall
time needed to update a large batch of docs would be shorter because
updates of sub-sets of docs would happen in parallel on multiple
nodes?

Thanks,
Otis

Sematext --http://sematext.com/-- Solr - Lucene - Nutch

On Thu, Jun 10, 2010 at 10:55 PM, Otis otis.gospodne...@gmail.com
wrote:

Hello,

Does ES have anything in it that makes it better than Lucene when one
needs to modify a large number of docs (e.g. modify a "tags" field for
50K documents in the result set) and see the changes reflected in real-
time?

With straight Lucene, one could get the NRT part, but modifying 50K
docs would trigger 50K doc deletes and 50K doc adds for just slightly
modified documents.

Does ES happen to have a solution for this?

Thanks,
Otis

Sematext --http://sematext.com/-- Solr - Lucene - Nutch


(system) #5