Using Elasticsearch as a fast lookup table

I have following use case:

I store scientific data in HDF5 files (approx. 6 million entries per file).
There are two important fields (position and score) and all the 6 million
entries are stored sorted by score in descending order. However I usually
only display the top 0.1% or 6000 entries respectively.

Each of these 6 million entries has a specific annotation. Now when i want
to display the top 6000 entries I also want to add the corresponding
annotation to it.
I was thinking about using a key/value storage (i.e. MongoDB) to store the
annotation for the 6 million entries once using the position field as a key
and the annotation data as a value.
When I want to retrieve the 6000/0.1% top entries I would do a lookup for
these entries in MongoDB , retrieve and add the annotation.

However I already have a elasticsearch instance running (for something
different) and I am considering using elasticsearch instead of setting up a
new/dedicated MongoDB db.
The benefit would be that I could have fulltext search features on the
annotation data although this is not the primary use case.

  • How is the performance of elasticsearch compared to MongoDB regarding the
    lookup of 6000 entries ?
    I guess in elasticsearch I would use the position field (that is used as a
    key in MongoDB) as the _id field.

  • Can I do a bulk REST request with those 6000 position fields?

The annotation data is relatively static. It won't be changed. At the most
new entries will be added.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Ümit

Each of these 6 million entries has a specific annotation. Now when i
want to display the top 6000 entries I also want to add the
corresponding annotation to it.
I was thinking about using a key/value storage (i.e. MongoDB) to store
the annotation for the 6 million entries once using the position field
as a key and the annotation data as a value.
When I want to retrieve the 6000/0.1% top entries I would do a lookup
for these entries in MongoDB , retrieve and add the annotation.

Some examples of actual data and what you want to achieve with it will
make this question a lot easier to understand. I find that an abstract
description is often quite confusing, and becomes a lot simpler when
real data is presented.

  • How is the performance of elasticsearch compared to MongoDB
    regarding the lookup of 6000 entries ?

Very good.

One potential issue is if you need to sort the data. Sorting 6000
across eg 5 shards means return 5 x 6000 records to the node handling
the request, to be resorted.

This is slower than doing it on a single node, but for 6000 records
won't be so bad. For 600,000, it'd be tricky :slight_smile:

I guess in elasticsearch I would use the position field (that is used
as a key in MongoDB) as the _id field.

  • Can I do a bulk REST request with those 6000 position fields?

Again, real data would help us to understand.

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Clint,
Sorry I should have provided an example.

The data in the scientific hdf5 files look like the following:

position | score
54023 | 10.2
123410 | 9.5
230 | 7.4
12304 | 4.4
.....

6M entries like above.
position is a long value between 1 and 30M and score is between 0 and 10.
The data is sorted by score so that I can easily slice out the top 6000
entries.

Now I have multiple of these files however each unique position has the
same annotation. So in order not to save redundantly in each of these hdf5
files together with the position and score I want to index it once in
elasticsearch.
The annotation looks something like this:

annotation {
"properties": {
"position":{"type":"integer"},
"name":{"type":"string"},
"description": {"type":"string},
......
}
}

My idea was to use the position as _id.
Basically I see 2 options:

  1. Use multi-get and pass array of 6000 positions as ids to retrieve the
    corresponding documents
  2. Use match_all and filter ids to retrieve the corresponding documents.

The advantage of option 1 is that the order of the returned documents
corresponds to the order of the ids in the passed array, so I don't have to
sort them.
Option 2 doesn't maintain the sorting.

What is the performance difference between both approaches ?
I guess option 1 doesn't support caching, whereas option 2 does.
However I think that the chance that I have multiple requests with the
exact same 6000 ids/positions is quite slim so cache hit would be low.

On Wed, Mar 27, 2013 at 8:59 PM, Clinton Gormley clint@traveljury.comwrote:

Hi Ümit

Each of these 6 million entries has a specific annotation. Now when i
want to display the top 6000 entries I also want to add the
corresponding annotation to it.
I was thinking about using a key/value storage (i.e. MongoDB) to store
the annotation for the 6 million entries once using the position field
as a key and the annotation data as a value.
When I want to retrieve the 6000/0.1% top entries I would do a lookup
for these entries in MongoDB , retrieve and add the annotation.

Some examples of actual data and what you want to achieve with it will
make this question a lot easier to understand. I find that an abstract
description is often quite confusing, and becomes a lot simpler when
real data is presented.

  • How is the performance of elasticsearch compared to MongoDB
    regarding the lookup of 6000 entries ?

Very good.

One potential issue is if you need to sort the data. Sorting 6000
across eg 5 shards means return 5 x 6000 records to the node handling
the request, to be resorted.

This is slower than doing it on a single node, but for 6000 records
won't be so bad. For 600,000, it'd be tricky :slight_smile:

I guess in elasticsearch I would use the position field (that is used
as a key in MongoDB) as the _id field.

  • Can I do a bulk REST request with those 6000 position fields?

Again, real data would help us to understand.

clint

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/HqhazTyvZIA/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hiya Ümit

The data in the scientific hdf5 files look like the following:

position | score
54023 | 10.2
123410 | 9.5
230 | 7.4
12304 | 4.4
.....

6M entries like above.
position is a long value between 1 and 30M and score is between 0 and
10.
The data is sorted by score so that I can easily slice out the top
6000 entries.

Now I have multiple of these files however each unique position has
the same annotation. So in order not to save redundantly in each of
these hdf5 files together with the position and score I want to index
it once in elasticsearch.
The annotation looks something like this:

annotation {
"properties": {
"position":{"type":"integer"},
"name":{"type":"string"},
"description": {"type":"string},
......
}
}

My idea was to use the position as _id.
Basically I see 2 options:

  1. Use multi-get and pass array of 6000 positions as ids to retrieve
    the corresponding documents
  2. Use match_all and filter ids to retrieve the corresponding
    documents.

The advantage of option 1 is that the order of the returned documents
corresponds to the order of the ids in the passed array, so I don't
have to sort them.
Option 2 doesn't maintain the sorting.

Option 1 is probably more efficient than a query. GET knows exactly
where to retrieve each doc. Certainly it would be so for smaller numbers
of GETs. I'm not sure if, as the number of GETs increases, the query may
become faster. You'd have to test that.

The bit I'm missing is where you get the list of IDs from. You're
maintaining this list outside of ES?

One possibility (which may or may not be useful to you) is to store the
list itself in ES as well, possibly in two versions: short (top 6000
ids), and long (full list of ids).

Then you could use the "terms lookup mechanism" described on
http://www.elasticsearch.org/guide/reference/query-dsl/terms-filter/
to search for the matching docs, and sort by score descending

The IDs list from these short/long docs would be cached, so if these ID
lists change, then you would need to clear the cached version manually.

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Clint,
Thanks for feedback.
I will go ahead and index the data and benchmark it and see what the
difference is.

Yes the ids are maintained outside of ES in HDF5 files.
Basically those 6 million position/score pairs are results of a single
analysis. Potentially there could be thousands of these analysis/results
and thus thousands of 6M position/score values. Each analysis/result is
stored in a separate HDF5 file (http://www.hdfgroup.org/HDF5/).

The reason why we store them in HDF5 files is that these files are self
contained, self-described, portable and usually allow for implicit sharding
when fetching the data. What I mean with sharding is that typically I am
only interested in the top 6000 of one analysis at a time (for
visualizing).
To have the same kind of sharding I would have to store each analysis
result in its separate index or create a separate type right?.
Typically the results with low scores are not interested and thus I didn't
want to pollute the index with those values (if I have thousands of those 6
M entries the index might get huge).

However I have a use case where I have an id and want to retrieve all the
analysis in which the id had a significant score. So I am actually thinking
of taking a combined approach:
Continue to store the 6M entries in HDF5 (basically an archive) and store
the top 6000 entries together with some information about the analysis in
ES.

The annotation would be stored once and then I could use parent/child
mapping to connect them.

On Thu, Mar 28, 2013 at 11:07 AM, Clinton Gormley clint@traveljury.comwrote:

Hiya Ümit

The data in the scientific hdf5 files look like the following:

position | score
54023 | 10.2
123410 | 9.5
230 | 7.4
12304 | 4.4
.....

6M entries like above.
position is a long value between 1 and 30M and score is between 0 and
10.
The data is sorted by score so that I can easily slice out the top
6000 entries.

Now I have multiple of these files however each unique position has
the same annotation. So in order not to save redundantly in each of
these hdf5 files together with the position and score I want to index
it once in elasticsearch.
The annotation looks something like this:

annotation {
"properties": {
"position":{"type":"integer"},
"name":{"type":"string"},
"description": {"type":"string},
......
}
}

My idea was to use the position as _id.
Basically I see 2 options:

  1. Use multi-get and pass array of 6000 positions as ids to retrieve
    the corresponding documents
  2. Use match_all and filter ids to retrieve the corresponding
    documents.

The advantage of option 1 is that the order of the returned documents
corresponds to the order of the ids in the passed array, so I don't
have to sort them.
Option 2 doesn't maintain the sorting.

Option 1 is probably more efficient than a query. GET knows exactly
where to retrieve each doc. Certainly it would be so for smaller numbers
of GETs. I'm not sure if, as the number of GETs increases, the query may
become faster. You'd have to test that.

The bit I'm missing is where you get the list of IDs from. You're
maintaining this list outside of ES?

One possibility (which may or may not be useful to you) is to store the
list itself in ES as well, possibly in two versions: short (top 6000
ids), and long (full list of ids).

Then you could use the "terms lookup mechanism" described on
http://www.elasticsearch.org/guide/reference/query-dsl/terms-filter/
to search for the matching docs, and sort by score descending

The IDs list from these short/long docs would be cached, so if these ID
lists change, then you would need to clear the cached version manually.

clint

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/HqhazTyvZIA/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Ümit

The reason why we store them in HDF5 files is that these files are
self contained, self-described, portable and usually allow for
implicit sharding when fetching the data. What I mean with sharding is
that typically I am only interested in the top 6000 of one analysis at
a time (for visualizing).

To have the same kind of sharding I would have to store each analysis
result in its separate index or create a separate type right?.

No - I was thinking of storing each analysis as a single doc.

Typically the results with low scores are not interested and thus I
didn't want to pollute the index with those values (if I have
thousands of those 6 M entries the index might get huge).

You can limit it to including only the first 6,000 annotations if you
like.

However I have a use case where I have an id and want to retrieve all
the analysis in which the id had a significant score. So I am actually
thinking of taking a combined approach:
Continue to store the 6M entries in HDF5 (basically an archive) and
store the top 6000 entries together with some information about the
analysis in ES.

The annotation would be stored once and then I could use parent/child
mapping to connect them.

Not sure that you even need parent/child for this. I've put together an
example below:

create your index with type 'annotation' and type 'analysis':

curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d '
{
"mappings" : {
"analysis" : {
"properties" : {
"name" : {
"type" : "string"
},
"annotations" : {
"type" : "integer"
}
}
},
"annotation" : {
"_id" : {
"path" : "position"
},
"properties" : {
"name" : {
"type" : "string"
},
"position" : {
"type" : "integer"
},
"score" : {
"type" : "double"
},
"description" : {
"type" : "string"
}
}
}
}
}
'

Add some annotation data:

curl -XPOST 'http://127.0.0.1:9200/test/annotation?pretty=1' -d '
{
"name" : "foo_1",
"position" : 54023,
"score" : 10.2
}
'
curl -XPOST 'http://127.0.0.1:9200/test/annotation?pretty=1' -d '
{
"name" : "foo_2",
"position" : 123410,
"score" : 9.5
}
'
curl -XPOST 'http://127.0.0.1:9200/test/annotation?pretty=1' -d '
{
"name" : "foo_3",
"position" : 230,
"score" : 7.4
}
'
curl -XPOST 'http://127.0.0.1:9200/test/annotation?pretty=1' -d '
{
"name" : "foo_4",
"position" : 12304,
"score" : 4.4
}
'

And some analysis data:

curl -XPUT 'http://127.0.0.1:9200/test/analysis/1?pretty=1' -d '
{
"name" : "First analysis",
"annotations" : [
54023,
123410,
230,
12304
]
}
'

Return all annotations for analysis '1' in order of score:

curl -XGET 'http://127.0.0.1:9200/test/annotation/_search?pretty=1' -d
'
{
"sort" : {
"score" : "desc"
},
"query" : {
"constant_score" : {
"filter" : {
"terms" : {
"position" : {
"index" : "test",
"path" : "annotations",
"id" : 1,
"type" : "analysis"
}
}
}
}
}
}
'

{

"hits" : {

"hits" : [

{

"_source" : {

"position" : 54023,

"name" : "foo_1",

"score" : 10.2

},

"sort" : [

10.2

],

"_score" : null,

"_index" : "test",

"_id" : "54023",

"_type" : "annotation"

},

{

"_source" : {

"position" : 123410,

"name" : "foo_2",

"score" : 9.5

},

"sort" : [

9.5

],

"_score" : null,

"_index" : "test",

"_id" : "123410",

"_type" : "annotation"

},

{

"_source" : {

"position" : 230,

"name" : "foo_3",

"score" : 7.4

},

"sort" : [

7.4

],

"_score" : null,

"_index" : "test",

"_id" : "230",

"_type" : "annotation"

},

{

"_source" : {

"position" : 12304,

"name" : "foo_4",

"score" : 4.4

},

"sort" : [

4.4

],

"_score" : null,

"_index" : "test",

"_id" : "12304",

"_type" : "annotation"

}

],

"max_score" : null,

"total" : 4

},

"timed_out" : false,

"_shards" : {

"failed" : 0,

"successful" : 5,

"total" : 5

},

"took" : 12

}

Find analyses which contain particular annotations:

curl -XGET 'http://127.0.0.1:9200/test/analysis/_search?pretty=1' -d '
{
"query" : {
"constant_score" : {
"filter" : {
"terms" : {
"annotations" : [
12304,
230
]
}
}
}
}
}
'

{

"hits" : {

"hits" : [

{

"_source" : {

"name" : "First analysis",

"annotations" : [

54023,

123410,

230,

12304

]

},

"_score" : 1,

"_index" : "test",

"_id" : "1",

"_type" : "analysis"

}

],

"max_score" : 1,

"total" : 1

},

"timed_out" : false,

"_shards" : {

"failed" : 0,

"successful" : 5,

"total" : 5

},

"took" : 4

}

Clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

one thing to add: if your analysis documents can change, then you will
need to specify a _cache_key on the 'terms' filter, and remove that
filter from the cache after updating your analysis doc.

See http://www.elasticsearch.org/guide/reference/query-dsl/terms-filter/
for an example

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Many thanks for the the detailed example.
However I think it won't entirely solve my use case because I can't have
multiple analysis that contain annotations with different score fields
right?
The 6000 position/score basically measurements/calculations of a
statistical analysis.
Each position has a reference to a annotation. There are in total 6 M
possible annotations.

How about I split up annotation into measurements and
measurements_annoation like this:

curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d '
{
"mappings" : {
"analysis" : {
"properties" : {
"name" : {
"type" : "string"
},
"measurements" : {
"type" : "nested",
"properties": {
"position":{"type":"integer"},"score":{"type":"double"}
}
}
}
},
"measurements_annotation" : {
"_id" : {
"path" : "position"
},
"properties" : {
"name" : {
"type" : "string"
},
"position" : {
"type" : "integer"
},
"description" : {
"type" : "string"
},
...... many other fields
}
}
}
}

On Thu, Mar 28, 2013 at 1:51 PM, Clinton Gormley clint@traveljury.comwrote:

one thing to add: if your analysis documents can change, then you will
need to specify a _cache_key on the 'terms' filter, and remove that
filter from the cache after updating your analysis doc.

See http://www.elasticsearch.org/guide/reference/query-dsl/terms-filter/
for an example

clint

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/HqhazTyvZIA/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.