Document ordering in index

I'm researching the use of Elasticsearch as a key-value store for binary
blobs. The nature of the data is such that documents with adjacent keys
are more likely to be accessed together. Many documents could fit into a
single page of memory. I would therefore like to be able to control the
document ordering at the index level to make the best use of the system IO
cache. Does this make sense? Does anybody have any experience with this?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/06de4fd7-42ef-4870-9614-fca8423788c5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

The closest thing that ES has would probably be routing -
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-index_.html#index-routing

On 17 February 2015 at 02:14, ElasticGuy ebradshaw1@gmail.com wrote:

I'm researching the use of Elasticsearch as a key-value store for binary
blobs. The nature of the data is such that documents with adjacent keys
are more likely to be accessed together. Many documents could fit into a
single page of memory. I would therefore like to be able to control the
document ordering at the index level to make the best use of the system IO
cache. Does this make sense? Does anybody have any experience with this?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/06de4fd7-42ef-4870-9614-fca8423788c5%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/06de4fd7-42ef-4870-9614-fca8423788c5%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X9%3DRk0N8ZAK7acusb6LjCcuKAp%3Da%3DLHVq-ExwSx%2Bpe3QQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Routing would definitely be a part of the solution. But I'm wondering if
any simple tweak could be done at the Lucene level.

On Monday, February 16, 2015 at 3:37:14 PM UTC-5, Mark Walkom wrote:

The closest thing that ES has would probably be routing -
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-index_.html#index-routing

On 17 February 2015 at 02:14, ElasticGuy <ebrad...@gmail.com <javascript:>

wrote:

I'm researching the use of Elasticsearch as a key-value store for binary
blobs. The nature of the data is such that documents with adjacent keys
are more likely to be accessed together. Many documents could fit into a
single page of memory. I would therefore like to be able to control the
document ordering at the index level to make the best use of the system IO
cache. Does this make sense? Does anybody have any experience with this?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/06de4fd7-42ef-4870-9614-fca8423788c5%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/06de4fd7-42ef-4870-9614-fca8423788c5%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/bac1a24c-a964-4c90-bc4b-685811eff237%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Well a shard is a lucene index so the only way to ensure the same shards
are on the same FS are to either use routing or to have a single node.

At least, that's my understanding.

On 17 February 2015 at 13:31, ElasticGuy ebradshaw1@gmail.com wrote:

Routing would definitely be a part of the solution. But I'm wondering if
any simple tweak could be done at the Lucene level.

On Monday, February 16, 2015 at 3:37:14 PM UTC-5, Mark Walkom wrote:

The closest thing that ES has would probably be routing -
http://www.elasticsearch.org/guide/en/elasticsearch/
reference/current/docs-index_.html#index-routing

On 17 February 2015 at 02:14, ElasticGuy ebrad...@gmail.com wrote:

I'm researching the use of Elasticsearch as a key-value store for binary
blobs. The nature of the data is such that documents with adjacent keys
are more likely to be accessed together. Many documents could fit into a
single page of memory. I would therefore like to be able to control the
document ordering at the index level to make the best use of the system IO
cache. Does this make sense? Does anybody have any experience with this?

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/06de4fd7-42ef-4870-9614-fca8423788c5%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/06de4fd7-42ef-4870-9614-fca8423788c5%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/bac1a24c-a964-4c90-bc4b-685811eff237%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/bac1a24c-a964-4c90-bc4b-685811eff237%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X-uuFPqNza%2Bph9K3wT7OZ8fapwnx2Xd7mPk1C%3DLTi3Jbg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

As Elasticsearch requires indexed documents to be in JSON format, you will
need to base64 encode any binary blobs in order to store them. This will
increase the size on disk significantly and have an impact on performance.

Unless you plan to utilise the search features in Elasticsearch at a later
stage, you may well be better served by a pure key/value store.

Best regards,

Christian

On Monday, February 16, 2015 at 3:14:10 PM UTC, ElasticGuy wrote:

I'm researching the use of Elasticsearch as a key-value store for binary
blobs. The nature of the data is such that documents with adjacent keys
are more likely to be accessed together. Many documents could fit into a
single page of memory. I would therefore like to be able to control the
document ordering at the index level to make the best use of the system IO
cache. Does this make sense? Does anybody have any experience with this?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1d42b666-83e8-4155-94dd-ea3093c27403%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I am investigating other key value stores actually. However, I am also using elasticsearch for other purposes. I noticed Lucene has the SortingMergePolicy. Are there plans/is there a way to use this in Elasticsearch?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f8d9db27-b949-40ae-a8b8-34e299e32193%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

It is possible to write a plugin which implements SortingMergePolicy in ES.

Jörg

On Tue, Feb 17, 2015 at 1:33 PM, ElasticGuy ebradshaw1@gmail.com wrote:

I am investigating other key value stores actually. However, I am also
using elasticsearch for other purposes. I noticed Lucene has the
SortingMergePolicy. Are there plans/is there a way to use this in
Elasticsearch?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f8d9db27-b949-40ae-a8b8-34e299e32193%40googlegroups.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFY7v_5ycxMgkrLzMUkhFYGNnyNVnV_QSOOFKhKSKbG6A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thanks Jorg. I'll look into that. Has a similar plugin been written that
could be used for reference?

On Tuesday, February 17, 2015 at 8:13:45 AM UTC-5, Jörg Prante wrote:

It is possible to write a plugin which implements SortingMergePolicy in ES.

Jörg

On Tue, Feb 17, 2015 at 1:33 PM, ElasticGuy <ebrad...@gmail.com
<javascript:>> wrote:

I am investigating other key value stores actually. However, I am also
using elasticsearch for other purposes. I noticed Lucene has the
SortingMergePolicy. Are there plans/is there a way to use this in
Elasticsearch?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f8d9db27-b949-40ae-a8b8-34e299e32193%40googlegroups.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/027c994b-a943-484c-b790-658c825a019b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I went ahead and gave it a shot. Haven't tried it yet. If anyone wants
to, take a look at the gist and let me know your thoughts:

I don't love using reflection directly to instantiate the delegated merge
policy provider. I'm not sure how to bind multiple providers at
configuration time though. There's probably a cleaner way of handling this.

Right now this is just thrown together to support field sorting.

Now if that works to sort on merge, is there a way to ensure that all
segments are sorted, even those that have been flushed without merging?

On Tuesday, February 17, 2015 at 9:00:59 AM UTC-5, ElasticGuy wrote:

Thanks Jorg. I'll look into that. Has a similar plugin been written that
could be used for reference?

On Tuesday, February 17, 2015 at 8:13:45 AM UTC-5, Jörg Prante wrote:

It is possible to write a plugin which implements SortingMergePolicy in
ES.

Jörg

On Tue, Feb 17, 2015 at 1:33 PM, ElasticGuy ebrad...@gmail.com wrote:

I am investigating other key value stores actually. However, I am also
using elasticsearch for other purposes. I noticed Lucene has the
SortingMergePolicy. Are there plans/is there a way to use this in
Elasticsearch?

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f8d9db27-b949-40ae-a8b8-34e299e32193%40googlegroups.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/57727fac-c0ca-41b0-a99a-6fe13fc610c8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

So it looks like that SortingMergePolicyProvider instantiated the necessary
MergePolicies as planned, but is not behaving as expected. On an optimize
call on a test index, forcing a merge, the overridden functions in
SortingOneMerge are not called. These functions, like getMergeReaders(),
instruct ES to retrieve a sorted view of the existing segments prior to
merging.

After some digging, it appears that this is caused by the
ElasticsearchMergePolicy.IndexUpgraderMergeSpecification. It looks like
this was designed with the understanding that ES merge policies would only
be used for deciding which segments to merge. The
IndexUpgraderMergeSpecification strips out any custom merge logic with the
following:

    @Override
    public void add(OneMerge merge) {
        super.add(new IndexUpgraderOneMerge(merge.segments));
    }

I doubt that this was the original intent, with SortingMergePolicy only
being a recent addition to the Lucene codebase. I've created an issue
request: https://github.com/elasticsearch/elasticsearch/issues/9731

On Tuesday, February 17, 2015 at 11:07:50 AM UTC-5, ElasticGuy wrote:

I went ahead and gave it a shot. Haven't tried it yet. If anyone wants
to, take a look at the gist and let me know your thoughts:
https://gist.github.com/ebradshaw/d29c80a9b843a5d1e77a

I don't love using reflection directly to instantiate the delegated merge
policy provider. I'm not sure how to bind multiple providers at
configuration time though. There's probably a cleaner way of handling this.

Right now this is just thrown together to support field sorting.

Now if that works to sort on merge, is there a way to ensure that all
segments are sorted, even those that have been flushed without merging?

On Tuesday, February 17, 2015 at 9:00:59 AM UTC-5, ElasticGuy wrote:

Thanks Jorg. I'll look into that. Has a similar plugin been written
that could be used for reference?

On Tuesday, February 17, 2015 at 8:13:45 AM UTC-5, Jörg Prante wrote:

It is possible to write a plugin which implements SortingMergePolicy in
ES.

Jörg

On Tue, Feb 17, 2015 at 1:33 PM, ElasticGuy ebrad...@gmail.com wrote:

I am investigating other key value stores actually. However, I am also
using elasticsearch for other purposes. I noticed Lucene has the
SortingMergePolicy. Are there plans/is there a way to use this in
Elasticsearch?

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f8d9db27-b949-40ae-a8b8-34e299e32193%40googlegroups.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/af34816b-af02-4eba-b705-f1f52e16824a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

The merge policy code seems to have many opportunities for improvement. ES
comes with some pre-coded merge strategies (tiered, log byte-size, log doc)
but only one (tiered) is used at a time in a node. I don't know the reason
why an index can not have its own merge policy at creation time. This could
simplify the development of custom implementations e.g. blob indices or
key-value stores.

Jörg

On Tue, Feb 17, 2015 at 7:05 PM, ElasticGuy ebradshaw1@gmail.com wrote:

So it looks like that SortingMergePolicyProvider instantiated the
necessary MergePolicies as planned, but is not behaving as expected. On an
optimize call on a test index, forcing a merge, the overridden functions in
SortingOneMerge are not called. These functions, like getMergeReaders(),
instruct ES to retrieve a sorted view of the existing segments prior to
merging.

After some digging, it appears that this is caused by the
ElasticsearchMergePolicy.IndexUpgraderMergeSpecification. It looks like
this was designed with the understanding that ES merge policies would only
be used for deciding which segments to merge. The
IndexUpgraderMergeSpecification strips out any custom merge logic with the
following:

    @Override
    public void add(OneMerge merge) {
        super.add(new IndexUpgraderOneMerge(merge.segments));
    }

I doubt that this was the original intent, with SortingMergePolicy only
being a recent addition to the Lucene codebase. I've created an issue
request: https://github.com/elasticsearch/elasticsearch/issues/9731

On Tuesday, February 17, 2015 at 11:07:50 AM UTC-5, ElasticGuy wrote:

I went ahead and gave it a shot. Haven't tried it yet. If anyone wants
to, take a look at the gist and let me know your thoughts:
https://gist.github.com/ebradshaw/d29c80a9b843a5d1e77a

I don't love using reflection directly to instantiate the delegated merge
policy provider. I'm not sure how to bind multiple providers at
configuration time though. There's probably a cleaner way of handling this.

Right now this is just thrown together to support field sorting.

Now if that works to sort on merge, is there a way to ensure that all
segments are sorted, even those that have been flushed without merging?

On Tuesday, February 17, 2015 at 9:00:59 AM UTC-5, ElasticGuy wrote:

Thanks Jorg. I'll look into that. Has a similar plugin been written
that could be used for reference?

On Tuesday, February 17, 2015 at 8:13:45 AM UTC-5, Jörg Prante wrote:

It is possible to write a plugin which implements SortingMergePolicy in
ES.

Jörg

On Tue, Feb 17, 2015 at 1:33 PM, ElasticGuy ebrad...@gmail.com wrote:

I am investigating other key value stores actually. However, I am
also using elasticsearch for other purposes. I noticed Lucene has the
SortingMergePolicy. Are there plans/is there a way to use this in
Elasticsearch?

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/f8d9db27-b949-40ae-a8b8-34e299e32193%
40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/af34816b-af02-4eba-b705-f1f52e16824a%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/af34816b-af02-4eba-b705-f1f52e16824a%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH1Q39t814ZhF0q76Bbsq2KDbOy8XbWC-pwo%2BzHqkq3Kw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.