Decoupling Data and indexing

asthanaamish · November 10, 2014, 9:13pm

Hi
Is there a way we can decouple data and associated mapping/indexing in
Elasticsearch itself.
Basically store the raw data as source( json or some other format) and
various mapping/index can be used on top of that.
I understand that one can use an outside database or file system, but can
it be natively achieved in ES itself.

Basically we are trying to see how our ES instance will work when we have
to change mapping of existing and continuously incoming data without any
downtime for the end user.
We have an added wrinkle that our indexing has to be edit aware for
versioning purpose; unlike ES where each edit is a new record.
regards and thanks
amish

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0bb1f5ef-3991-4568-9891-018baf79ebae%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jprante · November 11, 2014, 9:21am

I know from the FAST Search engine ten years ago there was a two-phase
commit for distributed search and indexing. One server could listen on the
API and keep the (compressed) input stored, and all the other indexing
servers were supplied by this input in another phase to create binary
indexes, either automatically, or by manual operation, called
"suspend/resume indexing API".

The advantage was that data could be received permanently via API while
FAST indexing could be stopped temporarily in order to balance between
indexing and search performance on limited hardware.

Do you think of something like that also for Elasticsearch? This
architecture is possible to implement by a plugin.

Jörg

On Mon, Nov 10, 2014 at 10:13 PM, Amish Asthana asthanaamish@gmail.com
wrote:

Hi
Is there a way we can decouple data and associated mapping/indexing in
Elasticsearch itself.
Basically store the raw data as source( json or some other format) and
various mapping/index can be used on top of that.
I understand that one can use an outside database or file system, but can
it be natively achieved in ES itself.

Basically we are trying to see how our ES instance will work when we have
to change mapping of existing and continuously incoming data without any
downtime for the end user.
We have an added wrinkle that our indexing has to be edit aware for
versioning purpose; unlike ES where each edit is a new record.
regards and thanks
amish

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0bb1f5ef-3991-4568-9891-018baf79ebae%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0bb1f5ef-3991-4568-9891-018baf79ebae%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE%3D-qMjEUp%2Br0mmL1dTH9ZeXa8Y%2BtNT7m8v%2BF_xQxfoNQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

asthanaamish · November 11, 2014, 6:02pm

I am not aware of FAST but the idea looks promising.
However it might not be that easy to just have plugin for ES, as the data
itself is distributed on different machines.
So it will not be possible to have just one server with the data, as it
will become single point of failure.
regards and thanks
amish

On Tuesday, November 11, 2014 1:21:53 AM UTC-8, Jörg Prante wrote:

I know from the FAST Search engine ten years ago there was a two-phase
commit for distributed search and indexing. One server could listen on the
API and keep the (compressed) input stored, and all the other indexing
servers were supplied by this input in another phase to create binary
indexes, either automatically, or by manual operation, called
"suspend/resume indexing API".

The advantage was that data could be received permanently via API while
FAST indexing could be stopped temporarily in order to balance between
indexing and search performance on limited hardware.

Do you think of something like that also for Elasticsearch? This
architecture is possible to implement by a plugin.

Jörg

On Mon, Nov 10, 2014 at 10:13 PM, Amish Asthana <asthan...@gmail.com
<javascript:>> wrote:

Hi
Is there a way we can decouple data and associated mapping/indexing in
Elasticsearch itself.
Basically store the raw data as source( json or some other format) and
various mapping/index can be used on top of that.
I understand that one can use an outside database or file system, but can
it be natively achieved in ES itself.

Basically we are trying to see how our ES instance will work when we have
to change mapping of existing and continuously incoming data without any
downtime for the end user.
We have an added wrinkle that our indexing has to be edit aware for
versioning purpose; unlike ES where each edit is a new record.
regards and thanks
amish

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0bb1f5ef-3991-4568-9891-018baf79ebae%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0bb1f5ef-3991-4568-9891-018baf79ebae%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8221cb51-a44e-4450-a9f5-7240681fab6c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

asthanaamish · November 11, 2014, 6:46pm

With existing Elastic Search I can think of an architecture like this.

Index : indexForDataDump : No mapping(Is it possible?) or minimum mapping.
Use only to dump data from external system. There is some primary key.

There are different search indexes with different mapping : search-index1,
search-index2 etc.
These indexes get populated from the indexForDataDump using technique
mentioned here
http://www.elasticsearch.org/blog/changing-mapping-with-zero-downtime/.
So this way I can drop the search index as desired and create new one with
new mapping.
Any pros/cons or issue with this approach? There will be data duplication
but I am hoping its minimum. ( Any way to quantify it?)

regards and thanks
amish

On Tuesday, November 11, 2014 10:02:46 AM UTC-8, Amish Asthana wrote:

I am not aware of FAST but the idea looks promising.
However it might not be that easy to just have plugin for ES, as the data
itself is distributed on different machines.
So it will not be possible to have just one server with the data, as it
will become single point of failure.
regards and thanks
amish

On Tuesday, November 11, 2014 1:21:53 AM UTC-8, Jörg Prante wrote:

I know from the FAST Search engine ten years ago there was a two-phase
commit for distributed search and indexing. One server could listen on the
API and keep the (compressed) input stored, and all the other indexing
servers were supplied by this input in another phase to create binary
indexes, either automatically, or by manual operation, called
"suspend/resume indexing API".

The advantage was that data could be received permanently via API while
FAST indexing could be stopped temporarily in order to balance between
indexing and search performance on limited hardware.

Do you think of something like that also for Elasticsearch? This
architecture is possible to implement by a plugin.

Jörg

On Mon, Nov 10, 2014 at 10:13 PM, Amish Asthana asthan...@gmail.com
wrote:

Hi
Is there a way we can decouple data and associated mapping/indexing in
Elasticsearch itself.
Basically store the raw data as source( json or some other format) and
various mapping/index can be used on top of that.
I understand that one can use an outside database or file system, but
can it be natively achieved in ES itself.

Basically we are trying to see how our ES instance will work when we
have to change mapping of existing and continuously incoming data without
any downtime for the end user.
We have an added wrinkle that our indexing has to be edit aware for
versioning purpose; unlike ES where each edit is a new record.
regards and thanks
amish

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0bb1f5ef-3991-4568-9891-018baf79ebae%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0bb1f5ef-3991-4568-9891-018baf79ebae%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4be01b3a-2747-4f6e-a1c3-7299e9f83bc4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jprante · November 11, 2014, 8:07pm

FAST stored the source data in distributed machines, only the control API
was not distributed (similar to ES HTTP curl requests, which also connect
to one host only).

Of course you could index raw JSON to a preparer index with a single field,
_all disabled, and field set to "not indexed" so there is no Lucene
activity on it. This preparer index could also hold mappings in special
documents for the indexing runs.

The data duplication factor depends on the complexity of the mapping(s),
and the characteristics of the data (dictionary size, analyzer / tokenizer
output, norms etc.)

A plugin would do no magic at all, it could bundle the calls that otherwise
a client would have to execute from remote, and adds some convenience
commands for managing the prepare stage (e.g. suspend/resume) and showing
the current state of indexing.

If redundant data is a no-go, then the whole approach is counterintuitive.

Jörg

On Tue, Nov 11, 2014 at 7:46 PM, Amish Asthana asthanaamish@gmail.com
wrote:

With existing Elastic Search I can think of an architecture like this.

Index : indexForDataDump : No mapping(Is it possible?) or minimum mapping.
Use only to dump data from external system. There is some primary key.

There are different search indexes with different mapping : search-index1,
search-index2 etc.
These indexes get populated from the indexForDataDump using technique
mentioned here
http://www.elasticsearch.org/blog/changing-mapping-with-zero-downtime/.
So this way I can drop the search index as desired and create new one with
new mapping.
Any pros/cons or issue with this approach? There will be data duplication
but I am hoping its minimum. ( Any way to quantify it?)

regards and thanks
amish

On Tuesday, November 11, 2014 10:02:46 AM UTC-8, Amish Asthana wrote:

I am not aware of FAST but the idea looks promising.
However it might not be that easy to just have plugin for ES, as the data
itself is distributed on different machines.
So it will not be possible to have just one server with the data, as it
will become single point of failure.
regards and thanks
amish

On Tuesday, November 11, 2014 1:21:53 AM UTC-8, Jörg Prante wrote:

I know from the FAST Search engine ten years ago there was a two-phase
commit for distributed search and indexing. One server could listen on the
API and keep the (compressed) input stored, and all the other indexing
servers were supplied by this input in another phase to create binary
indexes, either automatically, or by manual operation, called
"suspend/resume indexing API".

The advantage was that data could be received permanently via API while
FAST indexing could be stopped temporarily in order to balance between
indexing and search performance on limited hardware.

Do you think of something like that also for Elasticsearch? This
architecture is possible to implement by a plugin.

Jörg

On Mon, Nov 10, 2014 at 10:13 PM, Amish Asthana asthan...@gmail.com
wrote:

Hi
Is there a way we can decouple data and associated mapping/indexing in
Elasticsearch itself.
Basically store the raw data as source( json or some other format) and
various mapping/index can be used on top of that.
I understand that one can use an outside database or file system, but
can it be natively achieved in ES itself.

Basically we are trying to see how our ES instance will work when we
have to change mapping of existing and continuously incoming data without
any downtime for the end user.
We have an added wrinkle that our indexing has to be edit aware for
versioning purpose; unlike ES where each edit is a new record.
regards and thanks
amish

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/0bb1f5ef-3991-4568-9891-018baf79ebae%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0bb1f5ef-3991-4568-9891-018baf79ebae%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/4be01b3a-2747-4f6e-a1c3-7299e9f83bc4%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/4be01b3a-2747-4f6e-a1c3-7299e9f83bc4%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEcAt0xR5Ch7dE53SQcoOgjkbd%3DcBX4dRsG9EDVdnWUfA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

asthanaamish · November 11, 2014, 10:40pm

Thanks Jorg, make sense.
Few minor questions :
a) With the current ES architecture is this the best/recommended way?
b) Is there any project in roadmap to provide more support for it.

regards and thanks
amish

On Tuesday, November 11, 2014 12:08:24 PM UTC-8, Jörg Prante wrote:

FAST stored the source data in distributed machines, only the control API
was not distributed (similar to ES HTTP curl requests, which also connect
to one host only).

Of course you could index raw JSON to a preparer index with a single
field, _all disabled, and field set to "not indexed" so there is no Lucene
activity on it. This preparer index could also hold mappings in special
documents for the indexing runs.

The data duplication factor depends on the complexity of the mapping(s),
and the characteristics of the data (dictionary size, analyzer / tokenizer
output, norms etc.)

A plugin would do no magic at all, it could bundle the calls that
otherwise a client would have to execute from remote, and adds some
convenience commands for managing the prepare stage (e.g. suspend/resume)
and showing the current state of indexing.

If redundant data is a no-go, then the whole approach is counterintuitive.

Jörg

On Tue, Nov 11, 2014 at 7:46 PM, Amish Asthana <asthan...@gmail.com
<javascript:>> wrote:

With existing Elastic Search I can think of an architecture like this.

Index : indexForDataDump : No mapping(Is it possible?) or minimum
mapping. Use only to dump data from external system. There is some primary
key.

There are different search indexes with different mapping :
search-index1, search-index2 etc.
These indexes get populated from the indexForDataDump using technique
mentioned here
http://www.elasticsearch.org/blog/changing-mapping-with-zero-downtime/.
So this way I can drop the search index as desired and create new one
with new mapping.
Any pros/cons or issue with this approach? There will be data duplication
but I am hoping its minimum. ( Any way to quantify it?)

regards and thanks
amish

On Tuesday, November 11, 2014 10:02:46 AM UTC-8, Amish Asthana wrote:

I am not aware of FAST but the idea looks promising.
However it might not be that easy to just have plugin for ES, as the
data itself is distributed on different machines.
So it will not be possible to have just one server with the data, as it
will become single point of failure.
regards and thanks
amish

On Tuesday, November 11, 2014 1:21:53 AM UTC-8, Jörg Prante wrote:

I know from the FAST Search engine ten years ago there was a two-phase
commit for distributed search and indexing. One server could listen on the
API and keep the (compressed) input stored, and all the other indexing
servers were supplied by this input in another phase to create binary
indexes, either automatically, or by manual operation, called
"suspend/resume indexing API".

The advantage was that data could be received permanently via API while
FAST indexing could be stopped temporarily in order to balance between
indexing and search performance on limited hardware.

Do you think of something like that also for Elasticsearch? This
architecture is possible to implement by a plugin.

Jörg

On Mon, Nov 10, 2014 at 10:13 PM, Amish Asthana asthan...@gmail.com
wrote:

Hi
Is there a way we can decouple data and associated mapping/indexing in
Elasticsearch itself.
Basically store the raw data as source( json or some other format)
and various mapping/index can be used on top of that.
I understand that one can use an outside database or file system, but
can it be natively achieved in ES itself.

Basically we are trying to see how our ES instance will work when we
have to change mapping of existing and continuously incoming data without
any downtime for the end user.
We have an added wrinkle that our indexing has to be edit aware for
versioning purpose; unlike ES where each edit is a new record.
regards and thanks
amish

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/0bb1f5ef-3991-4568-9891-018baf79ebae%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0bb1f5ef-3991-4568-9891-018baf79ebae%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/4be01b3a-2747-4f6e-a1c3-7299e9f83bc4%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/4be01b3a-2747-4f6e-a1c3-7299e9f83bc4%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/367562df-b374-47e6-9bf2-53a1302f5a93%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jprante · November 12, 2014, 8:22am

There is no current method to redirect indexing to a preparer index for
delayed indexing, while searching is still enabled.

By using rivers, you can close the _river index, some rivers (not all) may
take this as an indicator to stop indexing unless the _river index is
reopened. I consider this as a workaround and not as a feature.

From my understanding the most preferred method to implement delayed
indexing currently is to set up a durable message queue (like RabbitMQ and
logstash) for external document persistency. By stopping/starting and
reconfiguring the message queue, the data can be indexed wherever you like.

If you like to see delayed indexing as a core feature in ES and not as a
plugin, then you should open an issue with the suggestion. To be honest I
assume this will be rejected in favor of a queue in front of ES, like
described in this blog post

http://dopey.io/logstash-rabbitmq-tuning.html

Jörg

On Tue, Nov 11, 2014 at 11:40 PM, Amish Asthana asthanaamish@gmail.com
wrote:

Thanks Jorg, make sense.
Few minor questions :
a) With the current ES architecture is this the best/recommended way?
b) Is there any project in roadmap to provide more support for it.

regards and thanks
amish

On Tuesday, November 11, 2014 12:08:24 PM UTC-8, Jörg Prante wrote:

FAST stored the source data in distributed machines, only the control API
was not distributed (similar to ES HTTP curl requests, which also connect
to one host only).

Of course you could index raw JSON to a preparer index with a single
field, _all disabled, and field set to "not indexed" so there is no Lucene
activity on it. This preparer index could also hold mappings in special
documents for the indexing runs.

The data duplication factor depends on the complexity of the mapping(s),
and the characteristics of the data (dictionary size, analyzer / tokenizer
output, norms etc.)

A plugin would do no magic at all, it could bundle the calls that
otherwise a client would have to execute from remote, and adds some
convenience commands for managing the prepare stage (e.g. suspend/resume)
and showing the current state of indexing.

If redundant data is a no-go, then the whole approach is counterintuitive.

Jörg

On Tue, Nov 11, 2014 at 7:46 PM, Amish Asthana asthan...@gmail.com
wrote:

With existing Elastic Search I can think of an architecture like this.

Index : indexForDataDump : No mapping(Is it possible?) or minimum
mapping. Use only to dump data from external system. There is some primary
key.

There are different search indexes with different mapping :
search-index1, search-index2 etc.
These indexes get populated from the indexForDataDump using technique
mentioned here
http://www.elasticsearch.org/blog/changing-mapping-with-zero-downtime/
.
So this way I can drop the search index as desired and create new one
with new mapping.
Any pros/cons or issue with this approach? There will be data
duplication but I am hoping its minimum. ( Any way to quantify it?)

regards and thanks
amish

On Tuesday, November 11, 2014 10:02:46 AM UTC-8, Amish Asthana wrote:

I am not aware of FAST but the idea looks promising.
However it might not be that easy to just have plugin for ES, as the
data itself is distributed on different machines.
So it will not be possible to have just one server with the data, as it
will become single point of failure.
regards and thanks
amish

On Tuesday, November 11, 2014 1:21:53 AM UTC-8, Jörg Prante wrote:

I know from the FAST Search engine ten years ago there was a two-phase
commit for distributed search and indexing. One server could listen on the
API and keep the (compressed) input stored, and all the other indexing
servers were supplied by this input in another phase to create binary
indexes, either automatically, or by manual operation, called
"suspend/resume indexing API".

The advantage was that data could be received permanently via API
while FAST indexing could be stopped temporarily in order to balance
between indexing and search performance on limited hardware.

Do you think of something like that also for Elasticsearch? This
architecture is possible to implement by a plugin.

Jörg

On Mon, Nov 10, 2014 at 10:13 PM, Amish Asthana asthan...@gmail.com
wrote:

Hi
Is there a way we can decouple data and associated mapping/indexing
in Elasticsearch itself.
Basically store the raw data as source( json or some other format)
and various mapping/index can be used on top of that.
I understand that one can use an outside database or file system, but
can it be natively achieved in ES itself.

Basically we are trying to see how our ES instance will work when we
have to change mapping of existing and continuously incoming data without
any downtime for the end user.
We have an added wrinkle that our indexing has to be edit aware for
versioning purpose; unlike ES where each edit is a new record.
regards and thanks
amish

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/0bb1f5ef-3991-4568-9891-018baf79ebae%40goo
glegroups.com
https://groups.google.com/d/msgid/elasticsearch/0bb1f5ef-3991-4568-9891-018baf79ebae%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/4be01b3a-2747-4f6e-a1c3-7299e9f83bc4%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/4be01b3a-2747-4f6e-a1c3-7299e9f83bc4%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/367562df-b374-47e6-9bf2-53a1302f5a93%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/367562df-b374-47e6-9bf2-53a1302f5a93%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGrxq0S5HcY8bwohqexPWqCTwR2DR521UUs_K-WsNqWiQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

asthanaamish · November 12, 2014, 6:33pm

Thanks Jorg.

On Wednesday, November 12, 2014 12:23:06 AM UTC-8, Jörg Prante wrote:

There is no current method to redirect indexing to a preparer index for
delayed indexing, while searching is still enabled.

By using rivers, you can close the _river index, some rivers (not all) may
take this as an indicator to stop indexing unless the _river index is
reopened. I consider this as a workaround and not as a feature.

From my understanding the most preferred method to implement delayed
indexing currently is to set up a durable message queue (like RabbitMQ and
logstash) for external document persistency. By stopping/starting and
reconfiguring the message queue, the data can be indexed wherever you like.

If you like to see delayed indexing as a core feature in ES and not as a
plugin, then you should open an issue with the suggestion. To be honest I
assume this will be rejected in favor of a queue in front of ES, like
described in this blog post

http://dopey.io/logstash-rabbitmq-tuning.html

Jörg

On Tue, Nov 11, 2014 at 11:40 PM, Amish Asthana <asthan...@gmail.com
<javascript:>> wrote:

Thanks Jorg, make sense.
Few minor questions :
a) With the current ES architecture is this the best/recommended way?
b) Is there any project in roadmap to provide more support for it.

regards and thanks
amish

On Tuesday, November 11, 2014 12:08:24 PM UTC-8, Jörg Prante wrote:

FAST stored the source data in distributed machines, only the control
API was not distributed (similar to ES HTTP curl requests, which also
connect to one host only).

Of course you could index raw JSON to a preparer index with a single
field, _all disabled, and field set to "not indexed" so there is no Lucene
activity on it. This preparer index could also hold mappings in special
documents for the indexing runs.

The data duplication factor depends on the complexity of the mapping(s),
and the characteristics of the data (dictionary size, analyzer / tokenizer
output, norms etc.)

A plugin would do no magic at all, it could bundle the calls that
otherwise a client would have to execute from remote, and adds some
convenience commands for managing the prepare stage (e.g. suspend/resume)
and showing the current state of indexing.

If redundant data is a no-go, then the whole approach is
counterintuitive.

Jörg

On Tue, Nov 11, 2014 at 7:46 PM, Amish Asthana asthan...@gmail.com
wrote:

With existing Elastic Search I can think of an architecture like this.

Index : indexForDataDump : No mapping(Is it possible?) or minimum
mapping. Use only to dump data from external system. There is some primary
key.

There are different search indexes with different mapping :
search-index1, search-index2 etc.
These indexes get populated from the indexForDataDump using technique
mentioned here
http://www.elasticsearch.org/blog/changing-mapping-with-zero-downtime/
.
So this way I can drop the search index as desired and create new one
with new mapping.
Any pros/cons or issue with this approach? There will be data
duplication but I am hoping its minimum. ( Any way to quantify it?)

regards and thanks
amish

On Tuesday, November 11, 2014 10:02:46 AM UTC-8, Amish Asthana wrote:

I am not aware of FAST but the idea looks promising.
However it might not be that easy to just have plugin for ES, as the
data itself is distributed on different machines.
So it will not be possible to have just one server with the data, as
it will become single point of failure.
regards and thanks
amish

On Tuesday, November 11, 2014 1:21:53 AM UTC-8, Jörg Prante wrote:

I know from the FAST Search engine ten years ago there was a
two-phase commit for distributed search and indexing. One server could
listen on the API and keep the (compressed) input stored, and all the other
indexing servers were supplied by this input in another phase to create
binary indexes, either automatically, or by manual operation, called
"suspend/resume indexing API".

The advantage was that data could be received permanently via API
while FAST indexing could be stopped temporarily in order to balance
between indexing and search performance on limited hardware.

Do you think of something like that also for Elasticsearch? This
architecture is possible to implement by a plugin.

Jörg

On Mon, Nov 10, 2014 at 10:13 PM, Amish Asthana asthan...@gmail.com
wrote:

Hi
Is there a way we can decouple data and associated mapping/indexing
in Elasticsearch itself.
Basically store the raw data as source( json or some other format)
and various mapping/index can be used on top of that.
I understand that one can use an outside database or file system,
but can it be natively achieved in ES itself.

Basically we are trying to see how our ES instance will work when we
have to change mapping of existing and continuously incoming data without
any downtime for the end user.
We have an added wrinkle that our indexing has to be edit aware for
versioning purpose; unlike ES where each edit is a new record.
regards and thanks
amish

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0bb1f5ef-399
1-4568-9891-018baf79ebae%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0bb1f5ef-3991-4568-9891-018baf79ebae%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/4be01b3a-2747-4f6e-a1c3-7299e9f83bc4%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/4be01b3a-2747-4f6e-a1c3-7299e9f83bc4%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/367562df-b374-47e6-9bf2-53a1302f5a93%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/367562df-b374-47e6-9bf2-53a1302f5a93%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e38cd140-83bf-48a6-a9f8-c1e693d0d3be%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
elasticSearch as a document database Elasticsearch	16	1468	July 6, 2017
ES DataBase Engine Elasticsearch	18	3233	July 6, 2017
Index creation from very large data set Elasticsearch	12	846	July 6, 2017
Elasticsearch as a atabase Elasticsearch	10	403	July 6, 2017
Elasticsearch thoughts; also, a bug Elasticsearch	11	409	July 6, 2017

Decoupling Data and indexing

Related topics