Decrease "Real time" latency for large indices


(Karussell) #1

Hi,

I'm using one big index to feed a lot of tweets of one week. The
larger this index gets the longer it takes a tweet appears in the
search **

How can I reduce this time?

I can think of two solutions:

  1. Having one small feeding index which will be merged into the bigger
    index every 6 hours. But then the following questions arise:
  • How to merge one millions of docs of the 6 hours into a larger
    index? This approach does not work very well for more than 300k
    documents with this code: https://gist.github.com/819239 - see the
    mergeIndices method which uses scroll API and requires a lot of memory
    and takes more than 5 minutes for 300k docs (while 100k only require
    10sec)

  • While merging no indexing can happen OR I have to avoid that the
    newly merged tweets in the big index appear in the search before the
    older ones are removed from the small index - how to do that when
    there is no 'commit'?

  1. Having one index per day, but then I need to route every new tweet
    to the correct index.

Do you know of a better solution with ES? My thinking is too much Solr-
ified so perhaps I'm thinking the wrong way ...

Kind regards,
Peter.

**
In my case: the larger the index gets the larger the time a forced
'refresh' takes. For my app I'm using this refresh, because I need to
query for existing tweets before indexing (determining replies and
retweets etc) - yes, I'm lazy and using ES as datastore ...


(Lukáš Vlček) #2

Hi,

On Sat, Feb 12, 2011 at 9:00 PM, Karussell tableyourtime@googlemail.comwrote:

Hi,

I'm using one big index to feed a lot of tweets of one week. The
larger this index gets the longer it takes a tweet appears in the
search **

How can I reduce this time?

I can think of two solutions:

  1. Having one small feeding index which will be merged into the bigger
    index every 6 hours. But then the following questions arise:
  • How to merge one millions of docs of the 6 hours into a larger
    index? This approach does not work very well for more than 300k
    documents with this code: https://gist.github.com/819239 - see the
    mergeIndices method which uses scroll API and requires a lot of memory
    and takes more than 5 minutes for 300k docs (while 100k only require
    10sec)

  • While merging no indexing can happen OR I have to avoid that the
    newly merged tweets in the big index appear in the search before the
    older ones are removed from the small index - how to do that when
    there is no 'commit'?

  1. Having one index per day, but then I need to route every new tweet
    to the correct index.

Did you check on index aliases?
http://www.elasticsearch.org/guide/reference/api/admin-indices-aliases.html

You can have an index "foo" and give it an alias "bar". Then you can index
document by sending it to index "bar" and it will be indexed into index
"foo". Now imagine the index "foo" is called "tweets-from-ww23"... Isn't
this what you are looking for?

Do you know of a better solution with ES? My thinking is too much Solr-
ified so perhaps I'm thinking the wrong way ...

Kind regards,
Peter.

**
In my case: the larger the index gets the larger the time a forced
'refresh' takes. For my app I'm using this refresh, because I need to
query for existing tweets before indexing (determining replies and
retweets etc) - yes, I'm lazy and using ES as datastore ...

Though it is a "search engine" there is an evidence that people can use ES
as a datastore. I think ES is not a general replacement for database but if
you know that you are doing then why not?

Regards,
Lukas


(Karussell) #3

Hi Lukas,

This would be solution 2, right?

Then I would have to implement routing myself, right? The
"determineIndex" method here:

BulkRequestBuilder brb = client.prepareBulk();
for (Tweet tweet : tweets) {
String id = Long.toString(tweet.getTwitterId());
XContentBuilder source = createDoc(tweet);

brb.add(Requests.indexRequest(determineIndex(tweet)).type(getIndexType()).id(id).source(source));
}

But how I would use the alias function you mentioned? Or do you mean
that I'm doing one alias per day so that I don't need to change the
name of the feeding index?

I think ES is not a general replacement for database but
if you know that you are doing then why not?

Thanks for support :slight_smile:

There are two problems:

  • reindexing is hard, if not impossible at the moment ... Shay
    mentioned a scan-search could be my solution
    https://github.com/elasticsearch/elasticsearch/issues/#issue/605
  • and 'findById' isn't real time ... I'm working around this with an
    in-memory cache and doing an explicit 'refresh' after batch
    indexing ... to make sure all new tweets are available for findById in
    the next batch index

Regards,
Peter.

On 12 Feb., 21:43, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

On Sat, Feb 12, 2011 at 9:00 PM, Karussell tableyourt...@googlemail.comwrote:

Hi,

I'm using one big index to feed a lot of tweets of one week. The
larger this index gets the longer it takes a tweet appears in the
search **

How can I reduce this time?

I can think of two solutions:

  1. Having one small feeding index which will be merged into the bigger
    index every 6 hours. But then the following questions arise:
  • How to merge one millions of docs of the 6 hours into a larger
    index? This approach does not work very well for more than 300k
    documents with this code:https://gist.github.com/819239- see the
    mergeIndices method which uses scroll API and requires a lot of memory
    and takes more than 5 minutes for 300k docs (while 100k only require
    10sec)
  • While merging no indexing can happen OR I have to avoid that the
    newly merged tweets in the big index appear in the search before the
    older ones are removed from the small index - how to do that when
    there is no 'commit'?
  1. Having one index per day, but then I need to route every new tweet
    to the correct index.

Did you check on index aliases?http://www.elasticsearch.org/guide/reference/api/admin-indices-aliase...

You can have an index "foo" and give it an alias "bar". Then you can index
document by sending it to index "bar" and it will be indexed into index
"foo". Now imagine the index "foo" is called "tweets-from-ww23"... Isn't
this what you are looking for?

Do you know of a better solution with ES? My thinking is too much Solr-
ified so perhaps I'm thinking the wrong way ...

Kind regards,
Peter.

**
In my case: the larger the index gets the larger the time a forced
'refresh' takes. For my app I'm using this refresh, because I need to
query for existing tweets before indexing (determining replies and
retweets etc) - yes, I'm lazy and using ES as datastore ...

Though it is a "search engine" there is an evidence that people can use ES
as a datastore. I think ES is not a general replacement for database but if
you know that you are doing then why not?

Regards,
Lukas


(Lukáš Vlček) #4

Hi,

I am not too familiar with the Java API but what I was trying to say is that
you could create twitter indices by weeks (twitter-ww01, twitter-ww02, ...
etc) and dynamically assign alias "twitter-actual" to the latest one (and
remove from the others). This way you would always index into
"twitter-actual" index only.
So rather them merging indices into bigger index, let's keep them as they
are and just play around with aliases.

May be I just did not understand your problem.

Regards,
Lukas

On Sat, Feb 12, 2011 at 10:05 PM, Karussell tableyourtime@googlemail.comwrote:

Hi Lukas,

This would be solution 2, right?

Then I would have to implement routing myself, right? The
"determineIndex" method here:

BulkRequestBuilder brb = client.prepareBulk();
for (Tweet tweet : tweets) {
String id = Long.toString(tweet.getTwitterId());
XContentBuilder source = createDoc(tweet);

brb.add(Requests.indexRequest(determineIndex(tweet)).type(getIndexType()).id(id).source(source));
}

But how I would use the alias function you mentioned? Or do you mean
that I'm doing one alias per day so that I don't need to change the
name of the feeding index?

I think ES is not a general replacement for database but
if you know that you are doing then why not?

Thanks for support :slight_smile:

There are two problems:

  • reindexing is hard, if not impossible at the moment ... Shay
    mentioned a scan-search could be my solution
    https://github.com/elasticsearch/elasticsearch/issues/#issue/605
  • and 'findById' isn't real time ... I'm working around this with an
    in-memory cache and doing an explicit 'refresh' after batch
    indexing ... to make sure all new tweets are available for findById in
    the next batch index

Regards,
Peter.

On 12 Feb., 21:43, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

On Sat, Feb 12, 2011 at 9:00 PM, Karussell <tableyourt...@googlemail.com
wrote:

Hi,

I'm using one big index to feed a lot of tweets of one week. The
larger this index gets the longer it takes a tweet appears in the
search **

How can I reduce this time?

I can think of two solutions:

  1. Having one small feeding index which will be merged into the bigger
    index every 6 hours. But then the following questions arise:
  • How to merge one millions of docs of the 6 hours into a larger
    index? This approach does not work very well for more than 300k
    documents with this code:https://gist.github.com/819239- see the
    mergeIndices method which uses scroll API and requires a lot of memory
    and takes more than 5 minutes for 300k docs (while 100k only require
    10sec)
  • While merging no indexing can happen OR I have to avoid that the
    newly merged tweets in the big index appear in the search before the
    older ones are removed from the small index - how to do that when
    there is no 'commit'?
  1. Having one index per day, but then I need to route every new tweet
    to the correct index.

Did you check on index aliases?
http://www.elasticsearch.org/guide/reference/api/admin-indices-aliase...

You can have an index "foo" and give it an alias "bar". Then you can
index
document by sending it to index "bar" and it will be indexed into index
"foo". Now imagine the index "foo" is called "tweets-from-ww23"... Isn't
this what you are looking for?

Do you know of a better solution with ES? My thinking is too much Solr-
ified so perhaps I'm thinking the wrong way ...

Kind regards,
Peter.

**
In my case: the larger the index gets the larger the time a forced
'refresh' takes. For my app I'm using this refresh, because I need to
query for existing tweets before indexing (determining replies and
retweets etc) - yes, I'm lazy and using ES as datastore ...

Though it is a "search engine" there is an evidence that people can use
ES
as a datastore. I think ES is not a general replacement for database but
if
you know that you are doing then why not?

Regards,
Lukas


(Karussell) #5

but what I was trying to say is that you could create twitter indices by weeks (twitter-ww01, twitter-ww02, ...
etc)

Thanks, I'll have to think about how I'm now going further.

Regards,
Peter.


(Shay Banon) #6

Sounds like having an index per day might make more sense (with lower number of default shards). What is wrong with finding the correct index to place a tweet at?

Yes, the scan feature is planned post 0.15 (was hoping to get it to 0.15, but it got delayed). Also, merging indices can potentially be implemented in an optimized manner if both indices share the same number of shards (not replicas). Possible future (very delicate) API.

-shay.banon
On Saturday, February 12, 2011 at 11:47 PM, Karussell wrote:

but what I was trying to say is that you could create twitter indices by weeks (twitter-ww01, twitter-ww02, ...
etc)

Thanks, I'll have to think about how I'm now going further.

Regards,
Peter.


(Karussell) #7

Thanks Shay for the infos.

What is wrong with finding the correct index to place a tweet at?

Nothing, I only wasn't sure if this isn't a bit too much overhead ...
because some special things can happen in my app:

  • there can come in even old tweets either from search or because to
    update retweets
  • I need to call refresh after indexing a batch of tweets (again to
    find existing tweets -> update retweets)
    -> wouldn't this be expensive for so many indices? (Hmmh, but the
    most common operation should be to feed the current day ...)

still have to think about this :slight_smile:

Also, merging indices can potentially be implemented in an optimized manner
if both indices share the same number of shards (not replicas). Possible future (very delicate) API.

This would be good :slight_smile:
You mean implemented directly via file copying?

Regards,
Peter.

On 13 Feb., 23:06, Shay Banon shay.ba...@elasticsearch.com wrote:

Sounds like having an index per day might make more sense (with lower number of default shards). What is wrong with finding the correct index to place a tweet at?

Yes, the scan feature is planned post 0.15 (was hoping to get it to 0.15, but it got delayed). Also, merging indices can potentially be implemented in an optimized manner if both indices share the same number of shards (not replicas). Possible future (very delicate) API.

-shay.banon

On Saturday, February 12, 2011 at 11:47 PM, Karussell wrote:

but what I was trying to say is that you could create twitter indices by weeks (twitter-ww01, twitter-ww02, ...
etc)

Thanks, I'll have to think about how I'm now going further.

Regards,
Peter.


(Shay Banon) #8

On Monday, February 14, 2011 at 8:32 PM, Karussell wrote:
Thanks Shay for the infos.

What is wrong with finding the correct index to place a tweet at?

Nothing, I only wasn't sure if this isn't a bit too much overhead ...
because some special things can happen in my app:

  • there can come in even old tweets either from search or because to
    update retweets
  • I need to call refresh after indexing a batch of tweets (again to
    find existing tweets -> update retweets)
    -> wouldn't this be expensive for so many indices? (Hmmh, but the
    most common operation should be to feed the current day ...)

still have to think about this :slight_smile:
A better option that refresh each time is to try and update, and see if you don't get a version conflict (in 0.15).

Also, merging indices can potentially be implemented in an optimized manner
if both indices share the same number of shards (not replicas). Possible future (very delicate) API.

This would be good :slight_smile:
You mean implemented directly via file copying?
Well, first it will be fetching those index files over to the relevant nodes where merge will happen, and then merge it.

Regards,
Peter.

On 13 Feb., 23:06, Shay Banon shay.ba...@elasticsearch.com wrote:

Sounds like having an index per day might make more sense (with lower number of default shards). What is wrong with finding the correct index to place a tweet at?

Yes, the scan feature is planned post 0.15 (was hoping to get it to 0.15, but it got delayed). Also, merging indices can potentially be implemented in an optimized manner if both indices share the same number of shards (not replicas). Possible future (very delicate) API.

-shay.banon

On Saturday, February 12, 2011 at 11:47 PM, Karussell wrote:

but what I was trying to say is that you could create twitter indices by weeks (twitter-ww01, twitter-ww02, ...
etc)

Thanks, I'll have to think about how I'm now going further.

Regards,
Peter.


(Karussell) #9

A better option that refresh each time is to try and update, and see if you don't get a version conflict (in 0.15).

I thought about this too, but the problem is that I do not only need
to overwrite an existing tweet.

I also need to determine a retweet. E.g. the tweet B is "RT @user:
text" now I search in the timeline of 'user' to get the original tweet
A with 'text'.
Without forcing the refresh this won't give me tweet A

Regards,
Peter.

BTW: when using the twitter api for determining retweets I would have
to use one api point for every request which means >100k per hour and
I only have 350 per hour (and user)


(system) #10