Making documents searchable only on explicit request


(Hillel Taub-Tabib) #1

Hi all,

I'm doing bulk indexing, and I need my documents to become searchable
only on explicit request. I was hoping to achieve this by disabling
auto-refresh. I changed "index.refresh_interval" to -1, and indeed,
when indexing small batches of documents, the documents do not become
searchable. However, when I'm indexing large batches (~40000 docs),
the index appears to be refreshing itself several times during the
process (I assume due to memory issues).

Is there a way to bullet proof the process and make sure that
documents do not become searchable until explicitly requested?


(Igor Motov) #2

It might be more convenient to do it on the application level. You can add
a numeric field called "availability" and create a filtering alias for the
index. The filter in the alias would filter out all records that have
availability value higher than a certain threshold. Each new batch should
be indexed with availability higher than the current threshold. So, the
records from the new batch would not appear in the search results. When you
want to make a new batch of records available, you just recreate the alias
with a new filter with a higher threshold.

On Wednesday, April 11, 2012 9:28:25 AM UTC-4, Hillel Taub-Tabib wrote:

Hi all,

I'm doing bulk indexing, and I need my documents to become searchable
only on explicit request. I was hoping to achieve this by disabling
auto-refresh. I changed "index.refresh_interval" to -1, and indeed,
when indexing small batches of documents, the documents do not become
searchable. However, when I'm indexing large batches (~40000 docs),
the index appears to be refreshing itself several times during the
process (I assume due to memory issues).

Is there a way to bullet proof the process and make sure that
documents do not become searchable until explicitly requested?


(Shay Banon) #3

Let me just explain why it happens, periodically, a flush happens in
elasticsearch which also involves refreshing the index. You could
potentially increase the translog flush options, but they should not be
very high:
http://www.elasticsearch.org/guide/reference/index-modules/translog.html.
The settings can be updated on an opened index using the indices update
settings API.

On Wed, Apr 11, 2012 at 6:49 PM, Igor Motov imotov@gmail.com wrote:

It might be more convenient to do it on the application level. You can add
a numeric field called "availability" and create a filtering alias for the
index. The filter in the alias would filter out all records that have
availability value higher than a certain threshold. Each new batch should
be indexed with availability higher than the current threshold. So, the
records from the new batch would not appear in the search results. When you
want to make a new batch of records available, you just recreate the alias
with a new filter with a higher threshold.

On Wednesday, April 11, 2012 9:28:25 AM UTC-4, Hillel Taub-Tabib wrote:

Hi all,

I'm doing bulk indexing, and I need my documents to become searchable
only on explicit request. I was hoping to achieve this by disabling
auto-refresh. I changed "index.refresh_interval" to -1, and indeed,
when indexing small batches of documents, the documents do not become
searchable. However, when I'm indexing large batches (~40000 docs),
the index appears to be refreshing itself several times during the
process (I assume due to memory issues).

Is there a way to bullet proof the process and make sure that
documents do not become searchable until explicitly requested?


(Hillel Taub-Tabib) #4

I implemented something similar to Igor's suggestion and it seems to
be working well.

Igor, Shay, Thanks for your help.

On Apr 11, 9:48 pm, Shay Banon kim...@gmail.com wrote:

Let me just explain why it happens, periodically, a flush happens in
elasticsearch which also involves refreshing the index. You could
potentially increase the translog flush options, but they should not be
very high:http://www.elasticsearch.org/guide/reference/index-modules/translog.html.
The settings can be updated on an opened index using the indices update
settings API.

On Wed, Apr 11, 2012 at 6:49 PM, Igor Motov imo...@gmail.com wrote:

It might be more convenient to do it on the application level. You can add
a numeric field called "availability" and create a filtering alias for the
index. The filter in the alias would filter out all records that have
availability value higher than a certain threshold. Each new batch should
be indexed with availability higher than the current threshold. So, the
records from the new batch would not appear in the search results. When you
want to make a new batch of records available, you just recreate the alias
with a new filter with a higher threshold.

On Wednesday, April 11, 2012 9:28:25 AM UTC-4, Hillel Taub-Tabib wrote:

Hi all,

I'm doing bulk indexing, and I need my documents to become searchable
only on explicit request. I was hoping to achieve this by disabling
auto-refresh. I changed "index.refresh_interval" to -1, and indeed,
when indexing small batches of documents, the documents do not become
searchable. However, when I'm indexing large batches (~40000 docs),
the index appears to be refreshing itself several times during the
process (I assume due to memory issues).

Is there a way to bullet proof the process and make sure that
documents do not become searchable until explicitly requested?


(system) #5