Parent/Child inserting _ids in memory while indexing


(haarts) #1

All,

We love ES. But we ran into a snag. We are continuously indexing about a
hundred documents a second. These add to a collection of 700M documents.
The corpus of documents is divided in parent and child documents. All our
queries use the has_child query. These queries are very, very slow (think
seconds or minutes). This is because, when using the parent/child
functionality, all _ids are loaded into memory for efficient joins. This is
done at query time and that is what takes it's good time.

Our _ids are 16 bytes big. We have 700M of them so that would amount to
11.2GB of heap space. Very doable! So we don't mind having that all in
memory.
We are now considering fiddling with the source of ES to add the _ids to
this in memory map when it is indexed, not when it is queried.

Does this seem like a viable approach? Any pointers would be very much
appreciated.

With kind regards,


(Shay Banon) #2

The way caching is done is that its loaded once, and stored per Lucene segments. Large segments are rarely changed, so, the code of loading those values is done "once". But, yea, there is an issue to add "warmup" code / queries so when refreshing, part of it will be to warmup the caches before searches execute on it.

On Tuesday, February 28, 2012 at 1:51 PM, haarts wrote:

All,

We love ES. But we ran into a snag. We are continuously indexing about a hundred documents a second. These add to a collection of 700M documents. The corpus of documents is divided in parent and child documents. All our queries use the has_child query. These queries are very, very slow (think seconds or minutes). This is because, when using the parent/child functionality, all _ids are loaded into memory for efficient joins. This is done at query time and that is what takes it's good time.

Our _ids are 16 bytes big. We have 700M of them so that would amount to 11.2GB of heap space. Very doable! So we don't mind having that all in memory.
We are now considering fiddling with the source of ES to add the _ids to this in memory map when it is indexed, not when it is queried.

Does this seem like a viable approach? Any pointers would be very much appreciated.

With kind regards,


(haarts) #3

I'm afraid I don't understand.
You are referring to this
issue? https://github.com/elasticsearch/elasticsearch/issues/1006
And with "large segments" you mean big chunks of ID mappings in memory
right?

BTW I absolutely love the names the ES get, mine is called Atom-Smasher!

On Wednesday, 29 February 2012 14:58:00 UTC+1, kimchy wrote:

The way caching is done is that its loaded once, and stored per Lucene
segments. Large segments are rarely changed, so, the code of loading those
values is done "once". But, yea, there is an issue to add "warmup" code /
queries so when refreshing, part of it will be to warmup the caches before
searches execute on it.

On Tuesday, February 28, 2012 at 1:51 PM, haarts wrote:

All,

We love ES. But we ran into a snag. We are continuously indexing about a
hundred documents a second. These add to a collection of 700M documents.
The corpus of documents is divided in parent and child documents. All our
queries use the has_child query. These queries are very, very slow (think
seconds or minutes). This is because, when using the parent/child
functionality, all _ids are loaded into memory for efficient joins. This
is done at query time and that is what takes it's good time.

Our _ids are 16 bytes big. We have 700M of them so that would amount to
11.2GB of heap space. Very doable! So we don't mind having that all in
memory.
We are now considering fiddling with the source of ES to add the _ids to
this in memory map when it is indexed, not when it is queried.

Does this seem like a viable approach? Any pointers would be very much
appreciated.

With kind regards,

On Wednesday, 29 February 2012 14:58:00 UTC+1, kimchy wrote:

The way caching is done is that its loaded once, and stored per Lucene
segments. Large segments are rarely changed, so, the code of loading those
values is done "once". But, yea, there is an issue to add "warmup" code /
queries so when refreshing, part of it will be to warmup the caches before
searches execute on it.

On Tuesday, February 28, 2012 at 1:51 PM, haarts wrote:

All,

We love ES. But we ran into a snag. We are continuously indexing about a
hundred documents a second. These add to a collection of 700M documents.
The corpus of documents is divided in parent and child documents. All our
queries use the has_child query. These queries are very, very slow (think
seconds or minutes). This is because, when using the parent/child
functionality, all _ids are loaded into memory for efficient joins. This
is done at query time and that is what takes it's good time.

Our _ids are 16 bytes big. We have 700M of them so that would amount to
11.2GB of heap space. Very doable! So we don't mind having that all in
memory.
We are now considering fiddling with the source of ES to add the _ids to
this in memory map when it is indexed, not when it is queried.

Does this seem like a viable approach? Any pointers would be very much
appreciated.

With kind regards,

On Wednesday, 29 February 2012 14:58:00 UTC+1, kimchy wrote:

The way caching is done is that its loaded once, and stored per Lucene
segments. Large segments are rarely changed, so, the code of loading those
values is done "once". But, yea, there is an issue to add "warmup" code /
queries so when refreshing, part of it will be to warmup the caches before
searches execute on it.

On Tuesday, February 28, 2012 at 1:51 PM, haarts wrote:

All,

We love ES. But we ran into a snag. We are continuously indexing about a
hundred documents a second. These add to a collection of 700M documents.
The corpus of documents is divided in parent and child documents. All our
queries use the has_child query. These queries are very, very slow (think
seconds or minutes). This is because, when using the parent/child
functionality, all _ids are loaded into memory for efficient joins. This
is done at query time and that is what takes it's good time.

Our _ids are 16 bytes big. We have 700M of them so that would amount to
11.2GB of heap space. Very doable! So we don't mind having that all in
memory.
We are now considering fiddling with the source of ES to add the _ids to
this in memory map when it is indexed, not when it is queried.

Does this seem like a viable approach? Any pointers would be very much
appreciated.

With kind regards,


(Shay Banon) #4

Yes, its the mentioned issue. Segments basically revolves around how Lucene works, by creating immutable (up to deletes) internal segments (they pretty much a self sufficient inverted index) and keeps on merging them as they come to keep the number of them at bay.

On Wednesday, February 29, 2012 at 5:05 PM, haarts wrote:

I'm afraid I don't understand.
You are referring to this issue? https://github.com/elasticsearch/elasticsearch/issues/1006
And with "large segments" you mean big chunks of ID mappings in memory right?

BTW I absolutely love the names the ES get, mine is called Atom-Smasher!

On Wednesday, 29 February 2012 14:58:00 UTC+1, kimchy wrote:

The way caching is done is that its loaded once, and stored per Lucene segments. Large segments are rarely changed, so, the code of loading those values is done "once". But, yea, there is an issue to add "warmup" code / queries so when refreshing, part of it will be to warmup the caches before searches execute on it.

On Tuesday, February 28, 2012 at 1:51 PM, haarts wrote:

All,

We love ES. But we ran into a snag. We are continuously indexing about a hundred documents a second. These add to a collection of 700M documents. The corpus of documents is divided in parent and child documents. All our queries use the has_child query. These queries are very, very slow (think seconds or minutes). This is because, when using the parent/child functionality, all _ids are loaded into memory for efficient joins. This is done at query time and that is what takes it's good time.

Our _ids are 16 bytes big. We have 700M of them so that would amount to 11.2GB of heap space. Very doable! So we don't mind having that all in memory.
We are now considering fiddling with the source of ES to add the _ids to this in memory map when it is indexed, not when it is queried.

Does this seem like a viable approach? Any pointers would be very much appreciated.

With kind regards,

On Wednesday, 29 February 2012 14:58:00 UTC+1, kimchy wrote:

The way caching is done is that its loaded once, and stored per Lucene segments. Large segments are rarely changed, so, the code of loading those values is done "once". But, yea, there is an issue to add "warmup" code / queries so when refreshing, part of it will be to warmup the caches before searches execute on it.

On Tuesday, February 28, 2012 at 1:51 PM, haarts wrote:

All,

We love ES. But we ran into a snag. We are continuously indexing about a hundred documents a second. These add to a collection of 700M documents. The corpus of documents is divided in parent and child documents. All our queries use the has_child query. These queries are very, very slow (think seconds or minutes). This is because, when using the parent/child functionality, all _ids are loaded into memory for efficient joins. This is done at query time and that is what takes it's good time.

Our _ids are 16 bytes big. We have 700M of them so that would amount to 11.2GB of heap space. Very doable! So we don't mind having that all in memory.
We are now considering fiddling with the source of ES to add the _ids to this in memory map when it is indexed, not when it is queried.

Does this seem like a viable approach? Any pointers would be very much appreciated.

With kind regards,

On Wednesday, 29 February 2012 14:58:00 UTC+1, kimchy wrote:

The way caching is done is that its loaded once, and stored per Lucene segments. Large segments are rarely changed, so, the code of loading those values is done "once". But, yea, there is an issue to add "warmup" code / queries so when refreshing, part of it will be to warmup the caches before searches execute on it.

On Tuesday, February 28, 2012 at 1:51 PM, haarts wrote:

All,

We love ES. But we ran into a snag. We are continuously indexing about a hundred documents a second. These add to a collection of 700M documents. The corpus of documents is divided in parent and child documents. All our queries use the has_child query. These queries are very, very slow (think seconds or minutes). This is because, when using the parent/child functionality, all _ids are loaded into memory for efficient joins. This is done at query time and that is what takes it's good time.

Our _ids are 16 bytes big. We have 700M of them so that would amount to 11.2GB of heap space. Very doable! So we don't mind having that all in memory.
We are now considering fiddling with the source of ES to add the _ids to this in memory map when it is indexed, not when it is queried.

Does this seem like a viable approach? Any pointers would be very much appreciated.

With kind regards,


(system) #5