Elastic search as multi-key value cache


(geegalrawat) #1

I am writing a service that will be creating and managing user records.
100+ million of them. For each new user, service will generate a unique
user id and write it in database. Database is sharded based on unique user
id that gets generated.

Each user record has several fields. Now one of the requirement is that the
service be able to search if there exists a user with a matching field
value. So those fields are declared as index in database schema.

However since database is sharded based on primary key ( unique user id ).
I will need to search on all shards to find a user record that matches a
particular column.

So to make that lookup fast. One thing i am thinking of doing is setting up
an ElasticSearch cluster. Service will write to the ES cluster every time
it creates a new user record. ES cluster will index the user record based
on the relevant fields.

My question is :

-- What kind of performance can i expect from ES here ? Assuming i have
100+million user records where 5 columns of each user record need to be
indexed. I know it depends on hardware config as well. But please assume a
well tuned hardware.

-- Here i am trying to use ES as a memcache alternative as ES provides me
multi-key-value store. So i want all dataset to be in memory and does not
need to be durable. Is ES right tool to do that ?

Any comment/recommendation based on experience with ElasticSearch for large
dataset is very much appreciated.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f0338e27-9fd0-43eb-86a7-ab6ed590a0f0%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Mark Walkom) #2

Performance depends on a lot of variables, you'd be best placed to run up a
small cluster and test based on a sample data set and extrapolate. However
100 million isn't a massive data set, that's 30% less one of our weekly
indexes from just one of our data feeds.

ES doesn't store everything in memory but it caches things very
aggressively, and it will also provide durability.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 14 December 2013 08:53, geegalrawat@gmail.com wrote:

I am writing a service that will be creating and managing user records.
100+ million of them. For each new user, service will generate a unique
user id and write it in database. Database is sharded based on unique user
id that gets generated.

Each user record has several fields. Now one of the requirement is that
the service be able to search if there exists a user with a matching field
value. So those fields are declared as index in database schema.

However since database is sharded based on primary key ( unique user id ).
I will need to search on all shards to find a user record that matches a
particular column.

So to make that lookup fast. One thing i am thinking of doing is setting
up an ElasticSearch cluster. Service will write to the ES cluster every
time it creates a new user record. ES cluster will index the user record
based on the relevant fields.

My question is :

-- What kind of performance can i expect from ES here ? Assuming i have
100+million user records where 5 columns of each user record need to be
indexed. I know it depends on hardware config as well. But please assume a
well tuned hardware.

-- Here i am trying to use ES as a memcache alternative as ES provides me
multi-key-value store. So i want all dataset to be in memory and does not
need to be durable. Is ES right tool to do that ?

Any comment/recommendation based on experience with ElasticSearch for
large dataset is very much appreciated.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f0338e27-9fd0-43eb-86a7-ab6ed590a0f0%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624aHhajgpbmRfTYWW9sB1RLJbG-FGF7nVWf4nUCBHcoozw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(geegalrawat) #3

Thanks Mark. I am trying to prepare a test setup for benchmarking purpose.
So would you answer my question below as "Yes" ?

Here i am trying to use ES as a memcache alternative as ES provides me
multi-key-value store. So i want all dataset to be in memory and does not
need to be durable. Is ES right tool to do that ?

On Fri, Dec 13, 2013 at 3:13 PM, Mark Walkom markw@campaignmonitor.comwrote:

Performance depends on a lot of variables, you'd be best placed to run up
a small cluster and test based on a sample data set and extrapolate.
However 100 million isn't a massive data set, that's 30% less one of our
weekly indexes from just one of our data feeds.

ES doesn't store everything in memory but it caches things very
aggressively, and it will also provide durability.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 14 December 2013 08:53, geegalrawat@gmail.com wrote:

I am writing a service that will be creating and managing user records.
100+ million of them. For each new user, service will generate a unique
user id and write it in database. Database is sharded based on unique user
id that gets generated.

Each user record has several fields. Now one of the requirement is that
the service be able to search if there exists a user with a matching field
value. So those fields are declared as index in database schema.

However since database is sharded based on primary key ( unique user id
). I will need to search on all shards to find a user record that matches a
particular column.

So to make that lookup fast. One thing i am thinking of doing is setting
up an ElasticSearch cluster. Service will write to the ES cluster every
time it creates a new user record. ES cluster will index the user record
based on the relevant fields.

My question is :

-- What kind of performance can i expect from ES here ? Assuming i have
100+million user records where 5 columns of each user record need to be
indexed. I know it depends on hardware config as well. But please assume a
well tuned hardware.

-- Here i am trying to use ES as a memcache alternative as ES provides me
multi-key-value store. So i want all dataset to be in memory and does not
need to be durable. Is ES right tool to do that ?

Any comment/recommendation based on experience with ElasticSearch for
large dataset is very much appreciated.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f0338e27-9fd0-43eb-86a7-ab6ed590a0f0%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEM624aHhajgpbmRfTYWW9sB1RLJbG-FGF7nVWf4nUCBHcoozw%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKm8WC%3D%2B0oFPhcUPsTAo1MKBKp4jAsr22i4OZ2UjNo7%2Bo2cJ%3Dg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Mark Walkom) #4

It's not a 100% match based on your wishes, but I think it will be suitable.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 14 December 2013 10:20, Rawat geegalrawat@gmail.com wrote:

Thanks Mark. I am trying to prepare a test setup for benchmarking purpose.
So would you answer my question below as "Yes" ?

Here i am trying to use ES as a memcache alternative as ES provides me
multi-key-value store. So i want all dataset to be in memory and does not
need to be durable. Is ES right tool to do that ?

On Fri, Dec 13, 2013 at 3:13 PM, Mark Walkom markw@campaignmonitor.comwrote:

Performance depends on a lot of variables, you'd be best placed to run up
a small cluster and test based on a sample data set and extrapolate.
However 100 million isn't a massive data set, that's 30% less one of our
weekly indexes from just one of our data feeds.

ES doesn't store everything in memory but it caches things very
aggressively, and it will also provide durability.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 14 December 2013 08:53, geegalrawat@gmail.com wrote:

I am writing a service that will be creating and managing user records.
100+ million of them. For each new user, service will generate a unique
user id and write it in database. Database is sharded based on unique user
id that gets generated.

Each user record has several fields. Now one of the requirement is that
the service be able to search if there exists a user with a matching field
value. So those fields are declared as index in database schema.

However since database is sharded based on primary key ( unique user id
). I will need to search on all shards to find a user record that matches a
particular column.

So to make that lookup fast. One thing i am thinking of doing is setting
up an ElasticSearch cluster. Service will write to the ES cluster every
time it creates a new user record. ES cluster will index the user record
based on the relevant fields.

My question is :

-- What kind of performance can i expect from ES here ? Assuming i have
100+million user records where 5 columns of each user record need to be
indexed. I know it depends on hardware config as well. But please assume a
well tuned hardware.

-- Here i am trying to use ES as a memcache alternative as ES provides
me multi-key-value store. So i want all dataset to be in memory and does
not need to be durable. Is ES right tool to do that ?

Any comment/recommendation based on experience with ElasticSearch for
large dataset is very much appreciated.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f0338e27-9fd0-43eb-86a7-ab6ed590a0f0%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEM624aHhajgpbmRfTYWW9sB1RLJbG-FGF7nVWf4nUCBHcoozw%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKm8WC%3D%2B0oFPhcUPsTAo1MKBKp4jAsr22i4OZ2UjNo7%2Bo2cJ%3Dg%40mail.gmail.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624aoJ3Q9BJDWukkNukTO-NQ7dm3LMgqPJ3AnULb-OR9wcQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Otis Gospodnetić) #5

Hi,

Multi key.... as in key1 key2 key 3 ==> key1-key2-key3 -- look, just one
key now?

Why not use Voldemort or Redis?

Otis

Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

On Friday, December 13, 2013 4:53:10 PM UTC-5, geega...@gmail.com wrote:

I am writing a service that will be creating and managing user records.
100+ million of them. For each new user, service will generate a unique
user id and write it in database. Database is sharded based on unique user
id that gets generated.

Each user record has several fields. Now one of the requirement is that
the service be able to search if there exists a user with a matching field
value. So those fields are declared as index in database schema.

However since database is sharded based on primary key ( unique user id ).
I will need to search on all shards to find a user record that matches a
particular column.

So to make that lookup fast. One thing i am thinking of doing is setting
up an ElasticSearch cluster. Service will write to the ES cluster every
time it creates a new user record. ES cluster will index the user record
based on the relevant fields.

My question is :

-- What kind of performance can i expect from ES here ? Assuming i have
100+million user records where 5 columns of each user record need to be
indexed. I know it depends on hardware config as well. But please assume a
well tuned hardware.

-- Here i am trying to use ES as a memcache alternative as ES provides me
multi-key-value store. So i want all dataset to be in memory and does not
need to be durable. Is ES right tool to do that ?

Any comment/recommendation based on experience with ElasticSearch for
large dataset is very much appreciated.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/17eaba69-c9c4-40fa-9945-10cbabb19d2b%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #6

Do not consider ES a simple memcache replacement. It is not a cache, it is
a distributed search engine. The main difference is that search engine uses
not a direct index but an inverted index for generating fast query result
and an analyzer framework for generating field values in a dictionary.

For maintaining a user index, you surely want a direct index without
analyzer framework. And ES is not a direct index. In 1.0.0 ES will provide
"doc values" for loading field values of more than 100+ mio. docs from disk
very efficiently, like a direct index, but this is bit slower than from RAM.

For 100+ mio user sessions, you should test Redis Sets
http://redis.io/topics/data-types#sets if they meet your requirements.

With "Sharded Jedis", you have something like "shards"
https://github.com/xetorthio/jedis/wiki/AdvancedUsage in your Java client
if a single Redis server does not scale to your requirements.

Just my 2p.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoF_AhNh79-eJDYpWEuxtYVKVLnQRninA5%2BWgxEa007FuQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(geegalrawat) #7

Thanks alot for replies.

Jorg, I want to understand your statement about direct and indirect indexes
better.

The main difference is that search engine uses not a direct index but an
inverted index for generating fast query result and an analyzer framework
for generating field values in a dictionary. For maintaining a user index,
you surely want a direct index without analyzer framework. And ES is not a
direct index.

Can you please elaborate or point me to some links that i can read to
understand it bit more ?

I do see how Redis can do the same thing that i want to do while still
being all in memory. I like the idea. However i would like to be able to
cluster the Redis to support horizontal scalability and for that you
suggested using Jedis. Do you think benefits of using Redis here is worth
going with our own sharding and managing nodes go down or added etc. One
thing i see with ES is that it will manage its own sharding and host going
down and coming up. Any thoughts ?

On Sun, Dec 15, 2013 at 5:25 AM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

Do not consider ES a simple memcache replacement. It is not a cache, it is
a distributed search engine. The main difference is that search engine uses
not a direct index but an inverted index for generating fast query result
and an analyzer framework for generating field values in a dictionary.

For maintaining a user index, you surely want a direct index without
analyzer framework. And ES is not a direct index. In 1.0.0 ES will provide
"doc values" for loading field values of more than 100+ mio. docs from disk
very efficiently, like a direct index, but this is bit slower than from RAM.

For 100+ mio user sessions, you should test Redis Sets
http://redis.io/topics/data-types#sets if they meet your requirements.

With "Sharded Jedis", you have something like "shards"
https://github.com/xetorthio/jedis/wiki/AdvancedUsage in your Java client
if a single Redis server does not scale to your requirements.

Just my 2p.

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoF_AhNh79-eJDYpWEuxtYVKVLnQRninA5%2BWgxEa007FuQ%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKm8WCkgetPvhYForpGewZmemXpPyaN3Dt5Dq9PHMYxbiweEQQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(geegalrawat) #8

Hey Jorge

Another doubt i have about using Redis instead of ES. So my user record has
one primary key and a few fields ( lets say A,B,C,D ) . I want the
user-record to be indexed by those A,B,C & D. Now if i use Redis and
cluster it what key will i use to shard the data ? In case of RDBMS ( my
durable storage ) i am sharding based on primary key. And if i use the same
scheme to shard the data across Redis cluster, if i have to search a user
record that matches A1 ( a particular value of A ), i will have to search
in each of the Redis shards to find a matching value. Please correct me if
i am getting it wrong ?

On Mon, Dec 16, 2013 at 11:46 AM, Rawat geegalrawat@gmail.com wrote:

Thanks alot for replies.

Jorg, I want to understand your statement about direct and indirect
indexes better.

The main difference is that search engine uses not a direct index but an
inverted index for generating fast query result and an analyzer framework
for generating field values in a dictionary. For maintaining a user
index, you surely want a direct index without analyzer framework. And ES is
not a direct index.

Can you please elaborate or point me to some links that i can read to
understand it bit more ?

I do see how Redis can do the same thing that i want to do while still
being all in memory. I like the idea. However i would like to be able to
cluster the Redis to support horizontal scalability and for that you
suggested using Jedis. Do you think benefits of using Redis here is worth
going with our own sharding and managing nodes go down or added etc. One
thing i see with ES is that it will manage its own sharding and host going
down and coming up. Any thoughts ?

On Sun, Dec 15, 2013 at 5:25 AM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

Do not consider ES a simple memcache replacement. It is not a cache, it
is a distributed search engine. The main difference is that search engine
uses not a direct index but an inverted index for generating fast query
result and an analyzer framework for generating field values in a
dictionary.

For maintaining a user index, you surely want a direct index without
analyzer framework. And ES is not a direct index. In 1.0.0 ES will provide
"doc values" for loading field values of more than 100+ mio. docs from disk
very efficiently, like a direct index, but this is bit slower than from RAM.

For 100+ mio user sessions, you should test Redis Sets
http://redis.io/topics/data-types#sets if they meet your requirements.

With "Sharded Jedis", you have something like "shards"
https://github.com/xetorthio/jedis/wiki/AdvancedUsage in your Java
client if a single Redis server does not scale to your requirements.

Just my 2p.

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoF_AhNh79-eJDYpWEuxtYVKVLnQRninA5%2BWgxEa007FuQ%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKm8WCmpeArVExgAEZb%2BX5W8CzCgpik4h3s1-jBOjJRpJk_tVA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #9