Interesting profile. What are you server specs?
The 2 billion rows was tested on: 2U server, 4 7200RPM disks RAID-5, 2
CPU's (quad-core), 16-gb memory (8-gb heap size).
The 20 billion row test will be on slightly beefier: 2U server, 5 7200RPM
disks RAID-5, 2 CPU's (quad-core), 24-gb memory (12gb or 16-gb heap size)
What's your query load like?
Inserting at a sustained 3000 docs/sec, with bursts up to 6000 docs/sec.
Takes between 7-10 days to load 2 billion docs. I am testing multiple
indexers to see if I can increase indexing rate to 20k/sec so a large test
will be easier.
This is all done on a single server rather than distributed model, which of
course would allow even better performance. I haven't done the RAID-10 vs.
RAID-5 tests yet. We need to maximize disk capacity in a single physical
enclosure so use the biggest disks available, which means you have to deal
with slower speeds at 7200RPM and RAID-5.
Our environment is write intensive. Queries are relatively infrequent,
perhaps 1000 per day maximum. This probably means we should not be using
RAID-5 which has write penalty but no read penalty, but we want the disk
capacity. Since 6000 msgs/sec is sufficient for us at present, I haven't
done much testing on disk write performance.
For the queries themselves, Like with any indexing app, queries spanning
lots of data always take a long time. On the mailing list was discussed
how a query could not be "cancelled", for example, user makes a mistake and
queries data for last 10 months instead of last 10 days, and wants to
cancel the query. In our app, having lots of key/value pairs helps makes
the queries much faster. We can also tolerate a lag of at least 30 seconds
from when documents are inserted to when they are available in a query,
perhaps even longer but we flush at 30-seconds.
I'm working on another project using 100 Amazon EC2 micro instances, but
first need to build the automation layer as I don't want to administer 100
instances. In this case I need the distributed read performance and will
be testing HBase, MongoDB, and ES in separate trials. I am not sure the
micro instances give me enough memory, but am trying to build something
relatively inexpensively. Someone suggested IRC chat, we should have a
weekly hour chat session.
On Wed, Dec 7, 2011 at 5:12 PM, Michael Sick <
michael.sick@serenesoftware.com> wrote:
Tom,
Interesting profile. What are you server specs? What's your query load
like?
On Wed, Dec 7, 2011 at 8:09 PM, Tom Le dottom@gmail.com wrote:
We have inserted 2 billion documents, one index per day, one server, no
replicas, 500 days (= 500 indexes = 500 shards). Average document size 600
bytes (source and all key/value pairs), compression enable when source >
500 bytes. The only issues we had were memory consumption which was
resolving by adjusting max segment size, and disk usage which is not stored
compressed like with commercial solutions.
Am currently testing 20 billion documents. This is all on a single
server.
On Dec 7, 2011 3:42 PM, "Karussell" tableyourtime@googlemail.com wrote:
Here is one approach
Convenient rolling index method · Issue #1500 · elastic/elasticsearch · GitHub
Would be nice as a plugin though ...
Regards,
Peter.
On 7 Dez., 23:34, Michael Sick michael.s...@serenesoftware.com
wrote:
Anyone interested in kicking around some requirements/thoughts on IRC
in
the coming days on what's needed here? Have to build some of this and
would
be happy to write as a plugin.
On Wed, Dec 7, 2011 at 11:56 AM, Berkay Mollamustafaoglu
mber...@gmail.comwrote:
I think you're right. ES does not do this and it has to be done in
the
external app, but it's probably better to do it that way.
ES provides the APIs and the external app can do all the index
management,
from creating/assigning aliases, to
creating/opening/closing/deleting indices. I think the external app
can
even go further and keep some meta data about the indices. For
example, in
the case of time based indices like creating an index per day, etc.
external app can track start/end date and time for docs so that it
can
determine which indices to run the query against, which index to
reopen if
the index was closed, etc.
I'd imagine that there are many different use cases hence it's
probably
better to keep these type of capabilities out of ES.
Berkay
What might be nice to add to ES is the ability to say:
- "Hey, I'm interested in searching only the last N shards, so
please
manage the index alias for me to point to only last N shards, so I
don't have to manage this from my app"
or
- "Hey, I'm interested in searching only shards that contain
documents
added in the last N days, so please manage the index alias for me to
point to only last N shards"
I think neither is doable today, other than managing the index alias
and its mapping to appropriate indices or shards manually, with an
external app that talks to ES to add/remove indices from the alias?