Performance issues using Elasticsearch as a time window storage


(Daniel Galinkin) #1

We are using elastic search almost as a cache, storing documents found in a
time window. We continuously insert a lot of documents of different sizes
and then we search in the ES using text queries combined with a date filter
so the current thread does not get documents it has already seen. Something
like this:

"((word1 AND word 2) OR (word3 AND word4)) AND insertedDate > 1389000"

We maintain the data in the elastic search for 30 minutes, using the TTL
feature. Today we have at least 3 machines inserting new documents in bulk
requests every minute for each machine and searching using queries like the
one above pratically continuously.

We are having a lot of trouble indexing and retrieving these documents, we
are not getting a good throughput volume of documents being indexed and
returned by ES. We can't get even 200 documents indexed per second.

We believe the problem lies in the simultaneous queries, inserts and TTL
deletes. We don't need to keep old data in elastic, we just need a small
time window of documents indexed in elastic at a given time.
What should we do to improve our performance?

Thanks in advance

Machine type:

  • An Amazon EC2 medium instance (3.7 GB of RAM)

Additional information:

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Martijn Van Groningen) #2

How strict is this 30 minutes TTL? What would improve you data flow is if
you work with multiple indices. You create a new index and then start to
index into it. After 30 minutes you create a new index and start into that
new index. Then again after 30 minutes you delete the first index and
create a new index (third) and strat to index into that. So at most you
have 2 indices active and with using aliases you can point to the active
indices for read operations. I expect better performance indexing data via
this manner.

On 11 September 2013 15:21, Daniel Galinkin danielgalinkin@gmail.comwrote:

We are using elastic search almost as a cache, storing documents found in
a time window. We continuously insert a lot of documents of different sizes
and then we search in the ES using text queries combined with a date filter
so the current thread does not get documents it has already seen. Something
like this:

"((word1 AND word 2) OR (word3 AND word4)) AND insertedDate > 1389000"

We maintain the data in the elastic search for 30 minutes, using the TTL
feature. Today we have at least 3 machines inserting new documents in bulk
requests every minute for each machine and searching using queries like the
one above pratically continuously.

We are having a lot of trouble indexing and retrieving these documents, we
are not getting a good throughput volume of documents being indexed and
returned by ES. We can't get even 200 documents indexed per second.

We believe the problem lies in the simultaneous queries, inserts and TTL
deletes. We don't need to keep old data in elastic, we just need a small
time window of documents indexed in elastic at a given time.
What should we do to improve our performance?

Thanks in advance

Machine type:

  • An Amazon EC2 medium instance (3.7 GB of RAM)

Additional information:

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Daniel Galinkin) #3

Thanks for the reply!

Talking on the irc channel, the nice guys there also suggested something
along these lines. I'm currently implementing this, and will try it soon.

On another note: we used to have many mappings inside the same index, and
would use these mappings to filter the results for each mapping.
We changed that, and now have one index for each type that used to have a
mapping. Which is the theoretically better option?

On Wednesday, September 11, 2013 11:14:25 AM UTC-3, Martijn v Groningen
wrote:

How strict is this 30 minutes TTL? What would improve you data flow is if
you work with multiple indices. You create a new index and then start to
index into it. After 30 minutes you create a new index and start into that
new index. Then again after 30 minutes you delete the first index and
create a new index (third) and strat to index into that. So at most you
have 2 indices active and with using aliases you can point to the active
indices for read operations. I expect better performance indexing data via
this manner.

On 11 September 2013 15:21, Daniel Galinkin <danielg...@gmail.com<javascript:>

wrote:

We are using elastic search almost as a cache, storing documents found in
a time window. We continuously insert a lot of documents of different sizes
and then we search in the ES using text queries combined with a date filter
so the current thread does not get documents it has already seen. Something
like this:

"((word1 AND word 2) OR (word3 AND word4)) AND insertedDate > 1389000"

We maintain the data in the elastic search for 30 minutes, using the TTL
feature. Today we have at least 3 machines inserting new documents in bulk
requests every minute for each machine and searching using queries like the
one above pratically continuously.

We are having a lot of trouble indexing and retrieving these documents,
we are not getting a good throughput volume of documents being indexed and
returned by ES. We can't get even 200 documents indexed per second.

We believe the problem lies in the simultaneous queries, inserts and TTL
deletes. We don't need to keep old data in elastic, we just need a small
time window of documents indexed in elastic at a given time.
What should we do to improve our performance?

Thanks in advance

Machine type:

  • An Amazon EC2 medium instance (3.7 GB of RAM)

Additional information:

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Martijn Van Groningen) #4

How many mappings are we talking about? Having a single mapping per index
is a bit more optimized, but you can easily have a bunch of types per
index.

On 11 September 2013 16:38, Daniel Galinkin danielgalinkin@gmail.comwrote:

Thanks for the reply!

Talking on the irc channel, the nice guys there also suggested something
along these lines. I'm currently implementing this, and will try it soon.

On another note: we used to have many mappings inside the same index, and
would use these mappings to filter the results for each mapping.
We changed that, and now have one index for each type that used to have a
mapping. Which is the theoretically better option?

On Wednesday, September 11, 2013 11:14:25 AM UTC-3, Martijn v Groningen
wrote:

How strict is this 30 minutes TTL? What would improve you data flow is if
you work with multiple indices. You create a new index and then start to
index into it. After 30 minutes you create a new index and start into that
new index. Then again after 30 minutes you delete the first index and
create a new index (third) and strat to index into that. So at most you
have 2 indices active and with using aliases you can point to the active
indices for read operations. I expect better performance indexing data via
this manner.

On 11 September 2013 15:21, Daniel Galinkin danielg...@gmail.com wrote:

We are using elastic search almost as a cache, storing documents found
in a time window. We continuously insert a lot of documents of different
sizes and then we search in the ES using text queries combined with a date
filter so the current thread does not get documents it has already seen.
Something like this:

"((word1 AND word 2) OR (word3 AND word4)) AND insertedDate > 1389000"

We maintain the data in the elastic search for 30 minutes, using the TTL
feature. Today we have at least 3 machines inserting new documents in bulk
requests every minute for each machine and searching using queries like the
one above pratically continuously.

We are having a lot of trouble indexing and retrieving these documents,
we are not getting a good throughput volume of documents being indexed and
returned by ES. We can't get even 200 documents indexed per second.

We believe the problem lies in the simultaneous queries, inserts and TTL
deletes. We don't need to keep old data in elastic, we just need a small
time window of documents indexed in elastic at a given time.
What should we do to improve our performance?

Thanks in advance

Machine type:

  • An Amazon EC2 medium instance (3.7 GB of RAM)

Additional information:

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
Met vriendelijke groet,

Martijn van Groningen

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Daniel Galinkin) #5

We currently have 10 mappings.

On Wednesday, September 11, 2013 11:54:22 AM UTC-3, Martijn v Groningen
wrote:

How many mappings are we talking about? Having a single mapping per index
is a bit more optimized, but you can easily have a bunch of types per
index.

On 11 September 2013 16:38, Daniel Galinkin <danielg...@gmail.com<javascript:>

wrote:

Thanks for the reply!

Talking on the irc channel, the nice guys there also suggested something
along these lines. I'm currently implementing this, and will try it soon.

On another note: we used to have many mappings inside the same index, and
would use these mappings to filter the results for each mapping.
We changed that, and now have one index for each type that used to have a
mapping. Which is the theoretically better option?

On Wednesday, September 11, 2013 11:14:25 AM UTC-3, Martijn v Groningen
wrote:

How strict is this 30 minutes TTL? What would improve you data flow is
if you work with multiple indices. You create a new index and then start to
index into it. After 30 minutes you create a new index and start into that
new index. Then again after 30 minutes you delete the first index and
create a new index (third) and strat to index into that. So at most you
have 2 indices active and with using aliases you can point to the active
indices for read operations. I expect better performance indexing data via
this manner.

On 11 September 2013 15:21, Daniel Galinkin danielg...@gmail.comwrote:

We are using elastic search almost as a cache, storing documents found
in a time window. We continuously insert a lot of documents of different
sizes and then we search in the ES using text queries combined with a date
filter so the current thread does not get documents it has already seen.
Something like this:

"((word1 AND word 2) OR (word3 AND word4)) AND insertedDate > 1389000"

We maintain the data in the elastic search for 30 minutes, using the
TTL feature. Today we have at least 3 machines inserting new documents in
bulk requests every minute for each machine and searching using queries
like the one above pratically continuously.

We are having a lot of trouble indexing and retrieving these documents,
we are not getting a good throughput volume of documents being indexed and
returned by ES. We can't get even 200 documents indexed per second.

We believe the problem lies in the simultaneous queries, inserts and
TTL deletes. We don't need to keep old data in elastic, we just need a
small time window of documents indexed in elastic at a given time.
What should we do to improve our performance?

Thanks in advance

Machine type:

  • An Amazon EC2 medium instance (3.7 GB of RAM)

Additional information:

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
Met vriendelijke groet,

Martijn van Groningen

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Daniel Galinkin) #6

Sorry about the long delay to give you some feedback. Things were kind of
hectic here at our company, and I chose to wait for calmer times to give a
more detailed account of how we solved our issue. We still have to do some
benchmarks to measure the actual improvements, but the point is that we
solved the issue :slight_smile:

I posted how we solved our issue in my StackOverflow question, if you are
interested, check it out:

On Wednesday, September 11, 2013 1:04:49 PM UTC-3, Daniel Galinkin wrote:

We currently have 10 mappings.

On Wednesday, September 11, 2013 11:54:22 AM UTC-3, Martijn v Groningen
wrote:

How many mappings are we talking about? Having a single mapping per index
is a bit more optimized, but you can easily have a bunch of types per
index.

On 11 September 2013 16:38, Daniel Galinkin danielg...@gmail.com wrote:

Thanks for the reply!

Talking on the irc channel, the nice guys there also suggested something
along these lines. I'm currently implementing this, and will try it soon.

On another note: we used to have many mappings inside the same index,
and would use these mappings to filter the results for each mapping.
We changed that, and now have one index for each type that used to have
a mapping. Which is the theoretically better option?

On Wednesday, September 11, 2013 11:14:25 AM UTC-3, Martijn v Groningen
wrote:

How strict is this 30 minutes TTL? What would improve you data flow
is if you work with multiple indices. You create a new index and then start
to index into it. After 30 minutes you create a new index and start into
that new index. Then again after 30 minutes you delete the first index and
create a new index (third) and strat to index into that. So at most you
have 2 indices active and with using aliases you can point to the active
indices for read operations. I expect better performance indexing data via
this manner.

On 11 September 2013 15:21, Daniel Galinkin danielg...@gmail.comwrote:

We are using elastic search almost as a cache, storing documents found
in a time window. We continuously insert a lot of documents of different
sizes and then we search in the ES using text queries combined with a date
filter so the current thread does not get documents it has already seen.
Something like this:

"((word1 AND word 2) OR (word3 AND word4)) AND insertedDate > 1389000"

We maintain the data in the elastic search for 30 minutes, using the
TTL feature. Today we have at least 3 machines inserting new documents in
bulk requests every minute for each machine and searching using queries
like the one above pratically continuously.

We are having a lot of trouble indexing and retrieving these
documents, we are not getting a good throughput volume of documents being
indexed and returned by ES. We can't get even 200 documents indexed per
second.

We believe the problem lies in the simultaneous queries, inserts and
TTL deletes. We don't need to keep old data in elastic, we just need a
small time window of documents indexed in elastic at a given time.
What should we do to improve our performance?

Thanks in advance

Machine type:

  • An Amazon EC2 medium instance (3.7 GB of RAM)

Additional information:

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
Met vriendelijke groet,

Martijn van Groningen

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #7