I am writing a service that will be creating and managing user records.
100+ million of them. For each new user, service will generate a unique
user id and write it in database. Database is sharded based on unique user
id that gets generated.
Each user record has several fields. Now one of the requirement is that the
service be able to search if there exists a user with a matching field
value. So those fields are declared as index in database schema.
However since database is sharded based on primary key ( unique user id ).
I will need to search on all shards to find a user record that matches a
particular column.
So to make that lookup fast. One thing i am thinking of doing is setting up
an ElasticSearch cluster. Service will write to the ES cluster every time
it creates a new user record. ES cluster will index the user record based
on the relevant fields.
My question is :
-- What kind of performance can i expect from ES here ? Assuming i have
100+million user records where 5 columns of each user record need to be
indexed. I know it depends on hardware config as well. But please assume a
well tuned hardware.
-- Here i am trying to use ES as a memcache alternative as ES provides me
multi-key-value store. So i want all dataset to be in memory and does not
need to be durable. Is ES right tool to do that ?
Any comment/recommendation based on experience with ElasticSearch for large
dataset is very much appreciated.
Performance depends on a lot of variables, you'd be best placed to run up a
small cluster and test based on a sample data set and extrapolate. However
100 million isn't a massive data set, that's 30% less one of our weekly
indexes from just one of our data feeds.
ES doesn't store everything in memory but it caches things very
aggressively, and it will also provide durability.
I am writing a service that will be creating and managing user records.
100+ million of them. For each new user, service will generate a unique
user id and write it in database. Database is sharded based on unique user
id that gets generated.
Each user record has several fields. Now one of the requirement is that
the service be able to search if there exists a user with a matching field
value. So those fields are declared as index in database schema.
However since database is sharded based on primary key ( unique user id ).
I will need to search on all shards to find a user record that matches a
particular column.
So to make that lookup fast. One thing i am thinking of doing is setting
up an Elasticsearch cluster. Service will write to the ES cluster every
time it creates a new user record. ES cluster will index the user record
based on the relevant fields.
My question is :
-- What kind of performance can i expect from ES here ? Assuming i have
100+million user records where 5 columns of each user record need to be
indexed. I know it depends on hardware config as well. But please assume a
well tuned hardware.
-- Here i am trying to use ES as a memcache alternative as ES provides me
multi-key-value store. So i want all dataset to be in memory and does not
need to be durable. Is ES right tool to do that ?
Any comment/recommendation based on experience with Elasticsearch for
large dataset is very much appreciated.
Thanks Mark. I am trying to prepare a test setup for benchmarking purpose.
So would you answer my question below as "Yes" ?
Here i am trying to use ES as a memcache alternative as ES provides me
multi-key-value store. So i want all dataset to be in memory and does not
need to be durable. Is ES right tool to do that ?
Performance depends on a lot of variables, you'd be best placed to run up
a small cluster and test based on a sample data set and extrapolate.
However 100 million isn't a massive data set, that's 30% less one of our
weekly indexes from just one of our data feeds.
ES doesn't store everything in memory but it caches things very
aggressively, and it will also provide durability.
I am writing a service that will be creating and managing user records.
100+ million of them. For each new user, service will generate a unique
user id and write it in database. Database is sharded based on unique user
id that gets generated.
Each user record has several fields. Now one of the requirement is that
the service be able to search if there exists a user with a matching field
value. So those fields are declared as index in database schema.
However since database is sharded based on primary key ( unique user id
). I will need to search on all shards to find a user record that matches a
particular column.
So to make that lookup fast. One thing i am thinking of doing is setting
up an Elasticsearch cluster. Service will write to the ES cluster every
time it creates a new user record. ES cluster will index the user record
based on the relevant fields.
My question is :
-- What kind of performance can i expect from ES here ? Assuming i have
100+million user records where 5 columns of each user record need to be
indexed. I know it depends on hardware config as well. But please assume a
well tuned hardware.
-- Here i am trying to use ES as a memcache alternative as ES provides me
multi-key-value store. So i want all dataset to be in memory and does not
need to be durable. Is ES right tool to do that ?
Any comment/recommendation based on experience with Elasticsearch for
large dataset is very much appreciated.
Thanks Mark. I am trying to prepare a test setup for benchmarking purpose.
So would you answer my question below as "Yes" ?
Here i am trying to use ES as a memcache alternative as ES provides me
multi-key-value store. So i want all dataset to be in memory and does not
need to be durable. Is ES right tool to do that ?
Performance depends on a lot of variables, you'd be best placed to run up
a small cluster and test based on a sample data set and extrapolate.
However 100 million isn't a massive data set, that's 30% less one of our
weekly indexes from just one of our data feeds.
ES doesn't store everything in memory but it caches things very
aggressively, and it will also provide durability.
I am writing a service that will be creating and managing user records.
100+ million of them. For each new user, service will generate a unique
user id and write it in database. Database is sharded based on unique user
id that gets generated.
Each user record has several fields. Now one of the requirement is that
the service be able to search if there exists a user with a matching field
value. So those fields are declared as index in database schema.
However since database is sharded based on primary key ( unique user id
). I will need to search on all shards to find a user record that matches a
particular column.
So to make that lookup fast. One thing i am thinking of doing is setting
up an Elasticsearch cluster. Service will write to the ES cluster every
time it creates a new user record. ES cluster will index the user record
based on the relevant fields.
My question is :
-- What kind of performance can i expect from ES here ? Assuming i have
100+million user records where 5 columns of each user record need to be
indexed. I know it depends on hardware config as well. But please assume a
well tuned hardware.
-- Here i am trying to use ES as a memcache alternative as ES provides
me multi-key-value store. So i want all dataset to be in memory and does
not need to be durable. Is ES right tool to do that ?
Any comment/recommendation based on experience with Elasticsearch for
large dataset is very much appreciated.
On Friday, December 13, 2013 4:53:10 PM UTC-5, geega...@gmail.com wrote:
I am writing a service that will be creating and managing user records.
100+ million of them. For each new user, service will generate a unique
user id and write it in database. Database is sharded based on unique user
id that gets generated.
Each user record has several fields. Now one of the requirement is that
the service be able to search if there exists a user with a matching field
value. So those fields are declared as index in database schema.
However since database is sharded based on primary key ( unique user id ).
I will need to search on all shards to find a user record that matches a
particular column.
So to make that lookup fast. One thing i am thinking of doing is setting
up an Elasticsearch cluster. Service will write to the ES cluster every
time it creates a new user record. ES cluster will index the user record
based on the relevant fields.
My question is :
-- What kind of performance can i expect from ES here ? Assuming i have
100+million user records where 5 columns of each user record need to be
indexed. I know it depends on hardware config as well. But please assume a
well tuned hardware.
-- Here i am trying to use ES as a memcache alternative as ES provides me
multi-key-value store. So i want all dataset to be in memory and does not
need to be durable. Is ES right tool to do that ?
Any comment/recommendation based on experience with Elasticsearch for
large dataset is very much appreciated.
Do not consider ES a simple memcache replacement. It is not a cache, it is
a distributed search engine. The main difference is that search engine uses
not a direct index but an inverted index for generating fast query result
and an analyzer framework for generating field values in a dictionary.
For maintaining a user index, you surely want a direct index without
analyzer framework. And ES is not a direct index. In 1.0.0 ES will provide
"doc values" for loading field values of more than 100+ mio. docs from disk
very efficiently, like a direct index, but this is bit slower than from RAM.
Jorg, I want to understand your statement about direct and indirect indexes
better.
The main difference is that search engine uses not a direct index but an
inverted index for generating fast query result and an analyzer framework
for generating field values in a dictionary. For maintaining a user index,
you surely want a direct index without analyzer framework. And ES is not a
direct index.
Can you please elaborate or point me to some links that i can read to
understand it bit more ?
I do see how Redis can do the same thing that i want to do while still
being all in memory. I like the idea. However i would like to be able to
cluster the Redis to support horizontal scalability and for that you
suggested using Jedis. Do you think benefits of using Redis here is worth
going with our own sharding and managing nodes go down or added etc. One
thing i see with ES is that it will manage its own sharding and host going
down and coming up. Any thoughts ?
Do not consider ES a simple memcache replacement. It is not a cache, it is
a distributed search engine. The main difference is that search engine uses
not a direct index but an inverted index for generating fast query result
and an analyzer framework for generating field values in a dictionary.
For maintaining a user index, you surely want a direct index without
analyzer framework. And ES is not a direct index. In 1.0.0 ES will provide
"doc values" for loading field values of more than 100+ mio. docs from disk
very efficiently, like a direct index, but this is bit slower than from RAM.
With "Sharded Jedis", you have something like "shards" AdvancedUsage · redis/jedis Wiki · GitHub in your Java client
if a single Redis server does not scale to your requirements.
Another doubt i have about using Redis instead of ES. So my user record has
one primary key and a few fields ( lets say A,B,C,D ) . I want the
user-record to be indexed by those A,B,C & D. Now if i use Redis and
cluster it what key will i use to shard the data ? In case of RDBMS ( my
durable storage ) i am sharding based on primary key. And if i use the same
scheme to shard the data across Redis cluster, if i have to search a user
record that matches A1 ( a particular value of A ), i will have to search
in each of the Redis shards to find a matching value. Please correct me if
i am getting it wrong ?
Jorg, I want to understand your statement about direct and indirect
indexes better.
The main difference is that search engine uses not a direct index but an
inverted index for generating fast query result and an analyzer framework
for generating field values in a dictionary. For maintaining a user
index, you surely want a direct index without analyzer framework. And ES is
not a direct index.
Can you please elaborate or point me to some links that i can read to
understand it bit more ?
I do see how Redis can do the same thing that i want to do while still
being all in memory. I like the idea. However i would like to be able to
cluster the Redis to support horizontal scalability and for that you
suggested using Jedis. Do you think benefits of using Redis here is worth
going with our own sharding and managing nodes go down or added etc. One
thing i see with ES is that it will manage its own sharding and host going
down and coming up. Any thoughts ?
Do not consider ES a simple memcache replacement. It is not a cache, it
is a distributed search engine. The main difference is that search engine
uses not a direct index but an inverted index for generating fast query
result and an analyzer framework for generating field values in a
dictionary.
For maintaining a user index, you surely want a direct index without
analyzer framework. And ES is not a direct index. In 1.0.0 ES will provide
"doc values" for loading field values of more than 100+ mio. docs from disk
very efficiently, like a direct index, but this is bit slower than from RAM.
With "Sharded Jedis", you have something like "shards" AdvancedUsage · redis/jedis Wiki · GitHub in your Java
client if a single Redis server does not scale to your requirements.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.