Re-Index Strategies


(Andrew Harvey-2) #1

Hi All,

I'm wondering what re-index strategies you guys are using with elasticsearch.

I primarily interact with ES through the REST api, and my java-fu is not exactly strong, but I'm not exactly keen on the overhead required by pulling all my docs out via HTTP and putting them back in via HTTP. Is there a better way, or should I just bite the bullet and do a scrolled search to re-index all my documents?

Why am I re-indexing you might ask? I need to add stopwords to my analyser, and I'm fairly sure that's going to require a re-index, but I'm willing to be proven wrong.

Any help would be most appreciated.

Andrew
Andrew Harvey / Developer
lexer
m/
t/ +61 2 9019 6379
w/ http://lexer.com.au
Help put an end to whaling. Visit http://www.givewhalesavoice.com.au/

Please consider the environment before printing this email
This email transmission is confidential and intended solely for the person or organisation to whom it is addressed. If you are not the intended recipient, you must not copy, distribute or disseminate the information, or take any action in relation to it and please delete this e-mail. Any views expressed in this message are those of the individual sender, except where the send specifically states them to be the views of any organisation or employer. If you have received this message in error, do not open any attachment but please notify the sender (above). This message has been checked for all known viruses powered by McAfee.

For further information visit http://www.mcafee.com/us/threat_center/default.asp
Please rely on your own virus check as no responsibility is taken by the sender for any damage rising out of any virus infection this communication may contain.

This message has been scanned for malware by Websense. www.websense.com


(Shay Banon) #2

A simple thing that can improve the indexing performance without doing java
would be to use the memcached protocol, which have shown to be faster. But I
would also say measure first before you say that the HTTP is too slow. It
might be good enough, especially if you fork the indexing into several
processes / threads (for example, by executing a search which segments the
data using a filter for each filter / thread).

Regarding adding stopwords without reindexing, if you change them, they will
only be applied to feature indexing and search requests (which might be
enough for you, especially on the search side).

-shay.banon

On Wed, Jul 7, 2010 at 5:16 AM, Andrew Harvey Andrew.Harvey@lexer.com.auwrote:

Hi All,

I'm wondering what re-index strategies you guys are using with
elasticsearch.

I primarily interact with ES through the REST api, and my java-fu is not
exactly strong, but I'm not exactly keen on the overhead required by pulling
all my docs out via HTTP and putting them back in via HTTP. Is there a
better way, or should I just bite the bullet and do a scrolled search to
re-index all my documents?

Why am I re-indexing you might ask? I need to add stopwords to my analyser,
and I'm fairly sure that's going to require a re-index, but I'm willing to
be proven wrong.

Any help would be most appreciated.

Andrew
Andrew Harvey / Developer
lexer
m/
t/ +61 2 9019 6379
w/ http://lexer.com.au
Help put an end to whaling. Visit http://www.givewhalesavoice.com.au/

Please consider the environment before printing this email
This email transmission is confidential and intended solely for the person
or organisation to whom it is addressed. If you are not the intended
recipient, you must not copy, distribute or disseminate the information, or
take any action in relation to it and please delete this e-mail. Any views
expressed in this message are those of the individual sender, except where
the send specifically states them to be the views of any organisation or
employer. If you have received this message in error, do not open any
attachment but please notify the sender (above). This message has been
checked for all known viruses powered by McAfee.

For further information visit
http://www.mcafee.com/us/threat_center/default.asp
Please rely on your own virus check as no responsibility is taken by the
sender for any damage rising out of any virus infection this communication
may contain.

This message has been scanned for malware by Websense. www.websense.com


(Andrew Harvey-2) #3

Right. I'll just go down the HTTP route. It shouldn't be too slow.

The reason I need to change the stop words is for term facets. I'd wager they will require a re-index.

Andrew

On 07/07/2010, at 5:12 PM, Shay Banon wrote:

A simple thing that can improve the indexing performance without doing java would be to use the memcached protocol, which have shown to be faster. But I would also say measure first before you say that the HTTP is too slow. It might be good enough, especially if you fork the indexing into several processes / threads (for example, by executing a search which segments the data using a filter for each filter / thread).

Regarding adding stopwords without reindexing, if you change them, they will only be applied to feature indexing and search requests (which might be enough for you, especially on the search side).

-shay.banon

On Wed, Jul 7, 2010 at 5:16 AM, Andrew Harvey <Andrew.Harvey@lexer.com.aumailto:Andrew.Harvey@lexer.com.au> wrote:
Hi All,

I'm wondering what re-index strategies you guys are using with elasticsearch.

I primarily interact with ES through the REST api, and my java-fu is not exactly strong, but I'm not exactly keen on the overhead required by pulling all my docs out via HTTP and putting them back in via HTTP. Is there a better way, or should I just bite the bullet and do a scrolled search to re-index all my documents?

Why am I re-indexing you might ask? I need to add stopwords to my analyser, and I'm fairly sure that's going to require a re-index, but I'm willing to be proven wrong.

Any help would be most appreciated.

Andrew
Andrew Harvey / Developer
lexer
m/
t/ +61 2 9019 6379
w/ http://lexer.com.auhttp://lexer.com.au/
Help put an end to whaling. Visit http://www.givewhalesavoice.com.au/

Please consider the environment before printing this email
This email transmission is confidential and intended solely for the person or organisation to whom it is addressed. If you are not the intended recipient, you must not copy, distribute or disseminate the information, or take any action in relation to it and please delete this e-mail. Any views expressed in this message are those of the individual sender, except where the send specifically states them to be the views of any organisation or employer. If you have received this message in error, do not open any attachment but please notify the sender (above). This message has been checked for all known viruses powered by McAfee.

For further information visit http://www.mcafee.com/us/threat_center/default.asp
Please rely on your own virus check as no responsibility is taken by the sender for any damage rising out of any virus infection this communication may contain.

This message has been scanned for malware by Websense. www.websense.comhttp://www.websense.com/

Click herehttps://www.mailcontrol.com/sr/1X9s1s6k5KzTndxI!oX7UrpakrQMGuaSaLC5GvEXgaH3tvJPQukjtl1!sQzokgGawjM0KpQ9XUMsWPB8dn7asg== to report this email as spam.

Andrew Harvey / Developer
lexer

m/
t/ +61 2 9019 6379
w/ http://lexer.com.au

Help put an end to whaling. Visit www.givewhalesavoice.com.auhttp://www.givewhalesavoice.com.au/


Please consider the environment before printing this email
This email transmission is confidential and intended solely for the person or organisation to whom it is addressed. If you are not the intended recipient, you must not copy, distribute or disseminate the information, or take any action in relation to it and please delete this e-mail. Any views expressed in this message are those of the individual sender, except where the send specifically states them to be the views of any organisation or employer. If you have received this message in error, do not open any attachment but please notify the sender (above). This message has been checked for all known viruses powered by McAfee.

For further information visit http://www.mcafee.com/us/threat_center/default.asp
Please rely on your own virus check as no responsibility is taken by the sender for any damage rising out of any virus infection this communication may contain.


(Shay Banon) #4

Yes, 0.9 will require reindex, but you raise a very good point, I think that
a nice feature would be to define an exclude term set as an optional setting
for terms facets (can be set on the request side, or on per request, or
both). This means that the stop words can remain independent of the terms
facets behavior. What do you think? If it make sense, can you open an issue
for this?

-shay.banon

On Wed, Jul 7, 2010 at 10:15 AM, Andrew Harvey
Andrew.Harvey@lexer.com.auwrote:

Right. I'll just go down the HTTP route. It shouldn't be too slow.

The reason I need to change the stop words is for term facets. I'd wager
they will require a re-index.

Andrew

On 07/07/2010, at 5:12 PM, Shay Banon wrote:

A simple thing that can improve the indexing performance without doing java
would be to use the memcached protocol, which have shown to be faster. But I
would also say measure first before you say that the HTTP is too slow. It
might be good enough, especially if you fork the indexing into several
processes / threads (for example, by executing a search which segments the
data using a filter for each filter / thread).

Regarding adding stopwords without reindexing, if you change them, they
will only be applied to feature indexing and search requests (which might be
enough for you, especially on the search side).

-shay.banon

On Wed, Jul 7, 2010 at 5:16 AM, Andrew Harvey Andrew.Harvey@lexer.com.auwrote:

Hi All,

I'm wondering what re-index strategies you guys are using with
elasticsearch.

I primarily interact with ES through the REST api, and my java-fu is not
exactly strong, but I'm not exactly keen on the overhead required by pulling
all my docs out via HTTP and putting them back in via HTTP. Is there a
better way, or should I just bite the bullet and do a scrolled search to
re-index all my documents?

Why am I re-indexing you might ask? I need to add stopwords to my
analyser, and I'm fairly sure that's going to require a re-index, but I'm
willing to be proven wrong.

Any help would be most appreciated.

Andrew
Andrew Harvey / Developer
lexer
m/
t/ +61 2 9019 6379
w/ http://lexer.com.au
Help put an end to whaling. Visit http://www.givewhalesavoice.com.au/

Please consider the environment before printing this email
This email transmission is confidential and intended solely for the person
or organisation to whom it is addressed. If you are not the intended
recipient, you must not copy, distribute or disseminate the information, or
take any action in relation to it and please delete this e-mail. Any views
expressed in this message are those of the individual sender, except where
the send specifically states them to be the views of any organisation or
employer. If you have received this message in error, do not open any
attachment but please notify the sender (above). This message has been
checked for all known viruses powered by McAfee.

For further information visit
http://www.mcafee.com/us/threat_center/default.asp
Please rely on your own virus check as no responsibility is taken by the
sender for any damage rising out of any virus infection this communication
may contain.

This message has been scanned for malware by Websense. www.websense.com

Click herehttps://www.mailcontrol.com/sr/1X9s1s6k5KzTndxI!oX7UrpakrQMGuaSaLC5GvEXgaH3tvJPQukjtl1!sQzokgGawjM0KpQ9XUMsWPB8dn7asg==to report this email as spam.

Andrew Harvey / Developer lexer

m/ t/ +61 2 9019 6379 w/ http://lexer.com.au

Help put an end to whaling. Visit www.givewhalesavoice.com.au

Please consider the environment before printing this email
This email transmission is confidential and intended solely for the person
or organisation to whom it is addressed. If you are not the intended
recipient, you must not copy, distribute or disseminate the information, or
take any action in relation to it and please delete this e-mail. Any views
expressed in this message are those of the individual sender, except where
the send specifically states them to be the views of any organisation or
employer. If you have received this message in error, do not open any
attachment but please notify the sender (above). This message has been
checked for all known viruses powered by McAfee.

For further information visit
http://www.mcafee.com/us/threat_center/default.asp
Please rely on your own virus check as no responsibility is taken by the
sender for any damage rising out of any virus infection this communication
may contain.


(Shay Banon) #5

ok, opened http://github.com/elasticsearch/elasticsearch/issues/issue/246,
the request part terms to exclude is simple to implement, should be in
master laster today.

-shay.banon

On Wed, Jul 7, 2010 at 2:24 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Yes, 0.9 will require reindex, but you raise a very good point, I think
that a nice feature would be to define an exclude term set as an optional
setting for terms facets (can be set on the request side, or on per request,
or both). This means that the stop words can remain independent of the terms
facets behavior. What do you think? If it make sense, can you open an issue
for this?

-shay.banon

On Wed, Jul 7, 2010 at 10:15 AM, Andrew Harvey <Andrew.Harvey@lexer.com.au

wrote:

Right. I'll just go down the HTTP route. It shouldn't be too slow.

The reason I need to change the stop words is for term facets. I'd wager
they will require a re-index.

Andrew

On 07/07/2010, at 5:12 PM, Shay Banon wrote:

A simple thing that can improve the indexing performance without doing
java would be to use the memcached protocol, which have shown to be faster.
But I would also say measure first before you say that the HTTP is too slow.
It might be good enough, especially if you fork the indexing into several
processes / threads (for example, by executing a search which segments the
data using a filter for each filter / thread).

Regarding adding stopwords without reindexing, if you change them, they
will only be applied to feature indexing and search requests (which might be
enough for you, especially on the search side).

-shay.banon

On Wed, Jul 7, 2010 at 5:16 AM, Andrew Harvey <Andrew.Harvey@lexer.com.au

wrote:

Hi All,

I'm wondering what re-index strategies you guys are using with
elasticsearch.

I primarily interact with ES through the REST api, and my java-fu is not
exactly strong, but I'm not exactly keen on the overhead required by pulling
all my docs out via HTTP and putting them back in via HTTP. Is there a
better way, or should I just bite the bullet and do a scrolled search to
re-index all my documents?

Why am I re-indexing you might ask? I need to add stopwords to my
analyser, and I'm fairly sure that's going to require a re-index, but I'm
willing to be proven wrong.

Any help would be most appreciated.

Andrew
Andrew Harvey / Developer
lexer
m/
t/ +61 2 9019 6379
w/ http://lexer.com.au
Help put an end to whaling. Visit http://www.givewhalesavoice.com.au/


Please consider the environment before printing this email
This email transmission is confidential and intended solely for the
person or organisation to whom it is addressed. If you are not the intended
recipient, you must not copy, distribute or disseminate the information, or
take any action in relation to it and please delete this e-mail. Any views
expressed in this message are those of the individual sender, except where
the send specifically states them to be the views of any organisation or
employer. If you have received this message in error, do not open any
attachment but please notify the sender (above). This message has been
checked for all known viruses powered by McAfee.

For further information visit
http://www.mcafee.com/us/threat_center/default.asp
Please rely on your own virus check as no responsibility is taken by the
sender for any damage rising out of any virus infection this communication
may contain.

This message has been scanned for malware by Websense. www.websense.com

Click herehttps://www.mailcontrol.com/sr/1X9s1s6k5KzTndxI!oX7UrpakrQMGuaSaLC5GvEXgaH3tvJPQukjtl1!sQzokgGawjM0KpQ9XUMsWPB8dn7asg==to report this email as spam.

Andrew Harvey / Developer lexer

m/ t/ +61 2 9019 6379 w/ http://lexer.com.au

Help put an end to whaling. Visit www.givewhalesavoice.com.au

Please consider the environment before printing this email
This email transmission is confidential and intended solely for the person
or organisation to whom it is addressed. If you are not the intended
recipient, you must not copy, distribute or disseminate the information, or
take any action in relation to it and please delete this e-mail. Any views
expressed in this message are those of the individual sender, except where
the send specifically states them to be the views of any organisation or
employer. If you have received this message in error, do not open any
attachment but please notify the sender (above). This message has been
checked for all known viruses powered by McAfee.

For further information visit
http://www.mcafee.com/us/threat_center/default.asp
Please rely on your own virus check as no responsibility is taken by the
sender for any damage rising out of any virus infection this communication
may contain.


(system) #6