2 Node cluster hanging while bulk indexing and adding types

Greg_Brown · December 21, 2011, 3:10pm

Hi,

We're trying to index a few million responses to many (10000+) forms.
We're creating a separate type for each form since they can have
different fields. So each form is associated with a type named form-
.

In bulk indexing the existing data, we are creating the new types and
their mappings on the fly, interspersed with bulk indexing the
documents. After running through about 200k responses the server
stopped responding and appeared to be running out of memory. The
cluster node stats are pasted below.

This two node cluster (8 GB machines) has indices for other
applications (of similar size) already running successfully on it, and
we've can bulk index them without a problem. So I assume the large
number of types is causing the issue.

So some questions:

I'm planning to try creating all of the types first and then bulk
index. Is intermixing type creation and adding docs expected to run
into performance problems?
It seems from the cluster stats that there is a surprising amount
of data going in one direction. Could this be the performance problem?
We're planning to do term facet searches (for word cloud
generation) for each of the above types. I think I read that the
entire index gets loaded when a term facet is done. If I do a term
facet on a particular type, will only the portion of the index for
that type be loaded, or will it still be the whole thing? If the whole
thing, any way I can move the data into separate indices without
having 10k indices show up when I go to look at the size of my
indices. There is no need for them to be in the same index, I just
don't want the 10k tables making it harder for me to examine the
indices for other applications.

Thanks for the help
-Greg

The index in question:
"index" : {
"primary_size" : "55.8mb",
"primary_size_in_bytes" : 58586978,
"size" : "111.7mb",
"size_in_bytes" : 117177113
},
"translog" : {
"operations" : 0
},
"docs" : {
"num_docs" : 214308,
"max_doc" : 214314,
"deleted_docs" : 6
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 6426,
"total_time" : "4.8m",
"total_time_in_millis" : 289292
},
"flush" : {
"total" : 2702,
"total_time" : "1.6m",
"total_time_in_millis" : 97459
},

/_cluster/node/stats
{
"cluster_name" : "mgs2",
"nodes" : {
"x25SzznQQZ2uPGQvgGjs4Q" : {
"name" : "Nut",
"indices" : {
"store" : {
"size" : "1.6gb",
"size_in_bytes" : 1806371221
},
"docs" : {
"count" : 557654,
"deleted" : 114387
},
"indexing" : {
"index_total" : 18577042,
"index_time" : "40m",
"index_time_in_millis" : 2403162,
"index_current" : 0,
"delete_total" : 0,
"delete_time" : "0s",
"delete_time_in_millis" : 0,
"delete_current" : 0
},
"get" : {
"total" : 86,
"time" : "60ms",
"time_in_millis" : 60,
"exists_total" : 24,
"exists_time" : "40ms",
"exists_time_in_millis" : 40,
"missing_total" : 62,
"missing_time" : "20ms",
"missing_time_in_millis" : 20,
"current" : 0
},
"search" : {
"query_total" : 20121,
"query_time" : "4.8m",
"query_time_in_millis" : 288251,
"query_current" : 0,
"fetch_total" : 15337,
"fetch_time" : "57.4s",
"fetch_time_in_millis" : 57470,
"fetch_current" : 0
},
"cache" : {
"field_evictions" : 0,
"field_size" : "4.4mb",
"field_size_in_bytes" : 4675324,
"filter_count" : 3,
"filter_evictions" : 0,
"filter_size" : "92.9kb",
"filter_size_in_bytes" : 95144
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 56,
"total_time" : "15.4s",
"total_time_in_millis" : 15492,
"total_docs" : 1887347,
"total_size" : "495.7mb",
"total_size_in_bytes" : 519823123
},
"refresh" : {
"total" : 6748,
"total_time" : "4.8m",
"total_time_in_millis" : 290519
},
"flush" : {
"total" : 2972,
"total_time" : "1.6m",
"total_time_in_millis" : 101684
}
},
"os" : {
"timestamp" : 1324472103660,
"uptime" : "-1 seconds",
"uptime_in_millis" : -1000,
"load_average" : [ ]
},
"process" : {
"timestamp" : 1324472103660,
"open_file_descriptors" : 1257
},
"jvm" : {
"timestamp" : 1324472103660,
"uptime" : "15 hours, 58 minutes, 35 seconds and 124
milliseconds",
"uptime_in_millis" : 57515124,
"mem" : {
"heap_used" : "2.3gb",
"heap_used_in_bytes" : 2524820736,
"heap_committed" : "4.9gb",
"heap_committed_in_bytes" : 5340397568,
"non_heap_used" : "46mb",
"non_heap_used_in_bytes" : 48276152,
"non_heap_committed" : "69.8mb",
"non_heap_committed_in_bytes" : 73248768
},
"threads" : {
"count" : 72,
"peak_count" : 86
},
"gc" : {
"collection_count" : 16609,
"collection_time" : "4 minutes, 25 seconds and 568
milliseconds",
"collection_time_in_millis" : 265568,
"collectors" : {
"ParNew" : {
"collection_count" : 16600,
"collection_time" : "4 minutes, 24 seconds and 807
milliseconds",
"collection_time_in_millis" : 264807
},
"ConcurrentMarkSweep" : {
"collection_count" : 9,
"collection_time" : "761 milliseconds",
"collection_time_in_millis" : 761
}
}
}
},
"network" : {
},
"transport" : {
"server_open" : 14,
"rx_count" : 243057,
"rx_size" : "452.2mb",
"rx_size_in_bytes" : 474228725,
"tx_count" : 321262,
"tx_size" : "4.1gb",
"tx_size_in_bytes" : 4459909980
},
"http" : {
"current_open" : 8,
"total_opened" : 7208
}
},
"fxJdAyuKSZeLRQ0nrBybGg" : {
"name" : "Bantam",
"indices" : {
"store" : {
"size" : "1.7gb",
"size_in_bytes" : 1889182815
},
"docs" : {
"count" : 557472,
"deleted" : 114386
},
"indexing" : {
"index_total" : 18576863,
"index_time" : "1.1h",
"index_time_in_millis" : 4074696,
"index_current" : 5,
"delete_total" : 0,
"delete_time" : "0s",
"delete_time_in_millis" : 0,
"delete_current" : 0
},
"get" : {
"total" : 91,
"time" : "34ms",
"time_in_millis" : 34,
"exists_total" : 43,
"exists_time" : "16ms",
"exists_time_in_millis" : 16,
"missing_total" : 48,
"missing_time" : "18ms",
"missing_time_in_millis" : 18,
"current" : 0
},
"search" : {
"query_total" : 20060,
"query_time" : "11.4m",
"query_time_in_millis" : 686600,
"query_current" : 5,
"fetch_total" : 15146,
"fetch_time" : "2.4m",
"fetch_time_in_millis" : 148005,
"fetch_current" : 1
},
"cache" : {
"field_evictions" : 0,
"field_size" : "4.4mb",
"field_size_in_bytes" : 4674888,
"filter_count" : 2,
"filter_evictions" : 0,
"filter_size" : "92.5kb",
"filter_size_in_bytes" : 94760
},
"merges" : {
"current" : 5,
"current_docs" : 214061,
"current_size" : "55.8mb",
"current_size_in_bytes" : 58566804,
"total" : 55,
"total_time" : "5.6m",
"total_time_in_millis" : 341870,
"total_docs" : 2010928,
"total_size" : "527.5mb",
"total_size_in_bytes" : 553224978
},
"refresh" : {
"total" : 7365,
"total_time" : "7.4m",
"total_time_in_millis" : 447627
},
"flush" : {
"total" : 2977,
"total_time" : "13.4m",
"total_time_in_millis" : 804327
}
},
"os" : {
"timestamp" : 1324472120587,
"uptime" : "-1 seconds",
"uptime_in_millis" : -1000,
"load_average" : [ ]
},
"process" : {
"timestamp" : 1324472120587,
"open_file_descriptors" : 1438
},
"jvm" : {
"timestamp" : 1324472120589,
"uptime" : "15 hours, 58 minutes, 25 seconds and 155
milliseconds",
"uptime_in_millis" : 57505155,
"mem" : {
"heap_used" : "4.9gb",
"heap_used_in_bytes" : 5317828824,
"heap_committed" : "4.9gb",
"heap_committed_in_bytes" : 5340397568,
"non_heap_used" : "44.4mb",
"non_heap_used_in_bytes" : 46654016,
"non_heap_committed" : "68.6mb",
"non_heap_committed_in_bytes" : 71991296
},
"threads" : {
"count" : 94,
"peak_count" : 96
},
"gc" : {
"collection_count" : 15361,
"collection_time" : "7 minutes, 38 seconds and 876
milliseconds",
"collection_time_in_millis" : 458876,
"collectors" : {
"ParNew" : {
"collection_count" : 15245,
"collection_time" : "6 minutes, 6 seconds and 39
milliseconds",
"collection_time_in_millis" : 366039
},
"ConcurrentMarkSweep" : {
"collection_count" : 116,
"collection_time" : "1 minute, 32 seconds and 837
milliseconds",
"collection_time_in_millis" : 92837
}
}
}
},
"network" : {
},
"transport" : {
"server_open" : 14,
"rx_count" : 242790,
"rx_size" : "4.1gb",
"rx_size_in_bytes" : 4459575446,
"tx_count" : 250201,
"tx_size" : "452mb",
"tx_size_in_bytes" : 474044111
},
"http" : {
"current_open" : 3,
"total_opened" : 7074
}
}
}
}

Craig_Brown · December 21, 2011, 6:47pm

Greg, have you checked/increased open file handle limits for your machine?
ES/Lucene tend to require lots of file handles. I believe that more indices
require more file handles. I had a case last night where I was trying to
index 1m records and it hung at about 340K because it ran out of file
handles. Once I increased the open file handle limit, I was able to index
1m without problems.

Craig

On Wed, Dec 21, 2011 at 8:10 AM, Greg Brown gbrown5878@gmail.com wrote:

Hi,

We're trying to index a few million responses to many (10000+) forms.
We're creating a separate type for each form since they can have
different fields. So each form is associated with a type named form-
.

In bulk indexing the existing data, we are creating the new types and
their mappings on the fly, interspersed with bulk indexing the
documents. After running through about 200k responses the server
stopped responding and appeared to be running out of memory. The
cluster node stats are pasted below.

This two node cluster (8 GB machines) has indices for other
applications (of similar size) already running successfully on it, and
we've can bulk index them without a problem. So I assume the large
number of types is causing the issue.

So some questions:

I'm planning to try creating all of the types first and then bulk
index. Is intermixing type creation and adding docs expected to run
into performance problems?

It seems from the cluster stats that there is a surprising amount
of data going in one direction. Could this be the performance problem?

We're planning to do term facet searches (for word cloud
generation) for each of the above types. I think I read that the
entire index gets loaded when a term facet is done. If I do a term
facet on a particular type, will only the portion of the index for
that type be loaded, or will it still be the whole thing? If the whole
thing, any way I can move the data into separate indices without
having 10k indices show up when I go to look at the size of my
indices. There is no need for them to be in the same index, I just
don't want the 10k tables making it harder for me to examine the
indices for other applications.

Thanks for the help
-Greg

The index in question:
"index" : {
"primary_size" : "55.8mb",
"primary_size_in_bytes" : 58586978,
"size" : "111.7mb",
"size_in_bytes" : 117177113
},
"translog" : {
"operations" : 0
},
"docs" : {
"num_docs" : 214308,
"max_doc" : 214314,
"deleted_docs" : 6
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 6426,
"total_time" : "4.8m",
"total_time_in_millis" : 289292
},
"flush" : {
"total" : 2702,
"total_time" : "1.6m",
"total_time_in_millis" : 97459
},

/_cluster/node/stats
{
"cluster_name" : "mgs2",
"nodes" : {
"x25SzznQQZ2uPGQvgGjs4Q" : {
"name" : "Nut",
"indices" : {
"store" : {
"size" : "1.6gb",
"size_in_bytes" : 1806371221
},
"docs" : {
"count" : 557654,
"deleted" : 114387
},
"indexing" : {
"index_total" : 18577042,
"index_time" : "40m",
"index_time_in_millis" : 2403162,
"index_current" : 0,
"delete_total" : 0,
"delete_time" : "0s",
"delete_time_in_millis" : 0,
"delete_current" : 0
},
"get" : {
"total" : 86,
"time" : "60ms",
"time_in_millis" : 60,
"exists_total" : 24,
"exists_time" : "40ms",
"exists_time_in_millis" : 40,
"missing_total" : 62,
"missing_time" : "20ms",
"missing_time_in_millis" : 20,
"current" : 0
},
"search" : {
"query_total" : 20121,
"query_time" : "4.8m",
"query_time_in_millis" : 288251,
"query_current" : 0,
"fetch_total" : 15337,
"fetch_time" : "57.4s",
"fetch_time_in_millis" : 57470,
"fetch_current" : 0
},
"cache" : {
"field_evictions" : 0,
"field_size" : "4.4mb",
"field_size_in_bytes" : 4675324,
"filter_count" : 3,
"filter_evictions" : 0,
"filter_size" : "92.9kb",
"filter_size_in_bytes" : 95144
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 56,
"total_time" : "15.4s",
"total_time_in_millis" : 15492,
"total_docs" : 1887347,
"total_size" : "495.7mb",
"total_size_in_bytes" : 519823123
},
"refresh" : {
"total" : 6748,
"total_time" : "4.8m",
"total_time_in_millis" : 290519
},
"flush" : {
"total" : 2972,
"total_time" : "1.6m",
"total_time_in_millis" : 101684
}
},
"os" : {
"timestamp" : 1324472103660,
"uptime" : "-1 seconds",
"uptime_in_millis" : -1000,
"load_average" :
},
"process" : {
"timestamp" : 1324472103660,
"open_file_descriptors" : 1257
},
"jvm" : {
"timestamp" : 1324472103660,
"uptime" : "15 hours, 58 minutes, 35 seconds and 124
milliseconds",
"uptime_in_millis" : 57515124,
"mem" : {
"heap_used" : "2.3gb",
"heap_used_in_bytes" : 2524820736,
"heap_committed" : "4.9gb",
"heap_committed_in_bytes" : 5340397568,
"non_heap_used" : "46mb",
"non_heap_used_in_bytes" : 48276152,
"non_heap_committed" : "69.8mb",
"non_heap_committed_in_bytes" : 73248768
},
"threads" : {
"count" : 72,
"peak_count" : 86
},
"gc" : {
"collection_count" : 16609,
"collection_time" : "4 minutes, 25 seconds and 568
milliseconds",
"collection_time_in_millis" : 265568,
"collectors" : {
"ParNew" : {
"collection_count" : 16600,
"collection_time" : "4 minutes, 24 seconds and 807
milliseconds",
"collection_time_in_millis" : 264807
},
"ConcurrentMarkSweep" : {
"collection_count" : 9,
"collection_time" : "761 milliseconds",
"collection_time_in_millis" : 761
}
}
}
},
"network" : {
},
"transport" : {
"server_open" : 14,
"rx_count" : 243057,
"rx_size" : "452.2mb",
"rx_size_in_bytes" : 474228725,
"tx_count" : 321262,
"tx_size" : "4.1gb",
"tx_size_in_bytes" : 4459909980
},
"http" : {
"current_open" : 8,
"total_opened" : 7208
}
},
"fxJdAyuKSZeLRQ0nrBybGg" : {
"name" : "Bantam",
"indices" : {
"store" : {
"size" : "1.7gb",
"size_in_bytes" : 1889182815
},
"docs" : {
"count" : 557472,
"deleted" : 114386
},
"indexing" : {
"index_total" : 18576863,
"index_time" : "1.1h",
"index_time_in_millis" : 4074696,
"index_current" : 5,
"delete_total" : 0,
"delete_time" : "0s",
"delete_time_in_millis" : 0,
"delete_current" : 0
},
"get" : {
"total" : 91,
"time" : "34ms",
"time_in_millis" : 34,
"exists_total" : 43,
"exists_time" : "16ms",
"exists_time_in_millis" : 16,
"missing_total" : 48,
"missing_time" : "18ms",
"missing_time_in_millis" : 18,
"current" : 0
},
"search" : {
"query_total" : 20060,
"query_time" : "11.4m",
"query_time_in_millis" : 686600,
"query_current" : 5,
"fetch_total" : 15146,
"fetch_time" : "2.4m",
"fetch_time_in_millis" : 148005,
"fetch_current" : 1
},
"cache" : {
"field_evictions" : 0,
"field_size" : "4.4mb",
"field_size_in_bytes" : 4674888,
"filter_count" : 2,
"filter_evictions" : 0,
"filter_size" : "92.5kb",
"filter_size_in_bytes" : 94760
},
"merges" : {
"current" : 5,
"current_docs" : 214061,
"current_size" : "55.8mb",
"current_size_in_bytes" : 58566804,
"total" : 55,
"total_time" : "5.6m",
"total_time_in_millis" : 341870,
"total_docs" : 2010928,
"total_size" : "527.5mb",
"total_size_in_bytes" : 553224978
},
"refresh" : {
"total" : 7365,
"total_time" : "7.4m",
"total_time_in_millis" : 447627
},
"flush" : {
"total" : 2977,
"total_time" : "13.4m",
"total_time_in_millis" : 804327
}
},
"os" : {
"timestamp" : 1324472120587,
"uptime" : "-1 seconds",
"uptime_in_millis" : -1000,
"load_average" :
},
"process" : {
"timestamp" : 1324472120587,
"open_file_descriptors" : 1438
},
"jvm" : {
"timestamp" : 1324472120589,
"uptime" : "15 hours, 58 minutes, 25 seconds and 155
milliseconds",
"uptime_in_millis" : 57505155,
"mem" : {
"heap_used" : "4.9gb",
"heap_used_in_bytes" : 5317828824,
"heap_committed" : "4.9gb",
"heap_committed_in_bytes" : 5340397568,
"non_heap_used" : "44.4mb",
"non_heap_used_in_bytes" : 46654016,
"non_heap_committed" : "68.6mb",
"non_heap_committed_in_bytes" : 71991296
},
"threads" : {
"count" : 94,
"peak_count" : 96
},
"gc" : {
"collection_count" : 15361,
"collection_time" : "7 minutes, 38 seconds and 876
milliseconds",
"collection_time_in_millis" : 458876,
"collectors" : {
"ParNew" : {
"collection_count" : 15245,
"collection_time" : "6 minutes, 6 seconds and 39
milliseconds",
"collection_time_in_millis" : 366039
},
"ConcurrentMarkSweep" : {
"collection_count" : 116,
"collection_time" : "1 minute, 32 seconds and 837
milliseconds",
"collection_time_in_millis" : 92837
}
}
}
},
"network" : {
},
"transport" : {
"server_open" : 14,
"rx_count" : 242790,
"rx_size" : "4.1gb",
"rx_size_in_bytes" : 4459575446,
"tx_count" : 250201,
"tx_size" : "452mb",
"tx_size_in_bytes" : 474044111
},
"http" : {
"current_open" : 3,
"total_opened" : 7074
}
}
}
}

--
…
CRAIG BROWN
chief architect
youwho, Inc.

www.youwho.com http://www.youwho.com/

T: 801.855. 0921
M: 801.913. 0939

Karussell1 · December 21, 2011, 7:08pm

Greg, have you checked/increased open file handle limits for your machine?

First, check/post your logs. If too many files open ES would log that.

Peter.

kimchy · December 22, 2011, 12:54am

My guess is that the problem is with creating so many types, which ends up
being a large overhead in the system. Each time a type is introduced, it
needs to be broadcasted to the rest of the nodes and persisted as part of
the cluster meta data. Can you try just indexing into the same type as a
test and see if it still happens?

On Wed, Dec 21, 2011 at 9:08 PM, Karussell tableyourtime@googlemail.comwrote:

Greg, have you checked/increased open file handle limits for your
machine?

First, check/post your logs. If too many files open ES would log that.

Peter.

Greg_Brown · December 22, 2011, 2:37pm

Checking through the logs, there isn't any mention of there not being
enough file handles, the errors I am running into are out of memory on
the heap space errors.

Shay,

Thanks, will give that a try and let you know. Will have to wait until
after the weekend so I can set up a development cluster. I've brought
down the production cluster a few too many times this week, and its
time to be more careful.

Thanks for the fast responses.
-Greg

On Dec 21, 6:54 pm, Shay Banon kim...@gmail.com wrote:

My guess is that the problem is with creating so many types, which ends up
being a large overhead in the system. Each time a type is introduced, it
needs to be broadcasted to the rest of the nodes and persisted as part of
the cluster meta data. Can you try just indexing into the same type as a
test and see if it still happens?

On Wed, Dec 21, 2011 at 9:08 PM, Karussell tableyourt...@googlemail.comwrote:

Greg, have you checked/increased open file handle limits for your
machine?

First, check/post your logs. If too many files open ES would log that.

Peter.

Greg_Brown · January 10, 2012, 3:45am

Indexing all data to a single type did work fine (3.3mil docs) as
expected.

I submitted a bug (elasticsearch · GitHub
elasticsearch.github.com/issues/134) on the large number of types
because I was able to get the server to become unresponsive even when
there was only a single server and I tried to add many types.

For the moment I am going ahead with using all of the documents in a
single index. However, this significantly reduces query performance
compared to having a separate type for each set of documents. I looped
and profiled the following queries on the larger sets of documents
(10k-70k): Profiling queries for building word clouds · GitHub This was all run on a
single server, and the query from a different machine.

The first query has each set of docs in its own Type. On average it
took about 11 ms to complete.

The second has all of the docs in one index with a field pd_id to
distinguish the sets. The query uses the facet_filter and average ~190
ms.

The third uses the same index as the second, but uses a query to do
the "filtering" of the docs. ~140 ms. I was surprised that this was
faster than the facet_filter.

Any suggestions on how to improve the last two queries?

Any ideas on how to create multiple types without creating 10k
separate indices. In this case all I am using the Type for is a
partitioning/grouping of multiple separate indices, since the Mapping
of each Type is identical.

Thanks for the help.
-Greg

On Dec 22 2011, 7:37 am, Greg Brown gbrown5...@gmail.com wrote:

Checking through the logs, there isn't any mention of there not being
enough file handles, the errors I am running into are out of memory on
the heap space errors.

Shay,

Thanks, will give that a try and let you know. Will have to wait until
after the weekend so I can set up a development cluster. I've brought
down the production cluster a few too many times this week, and its
time to be more careful.

Thanks for the fast responses.
-Greg

On Dec 21, 6:54 pm, Shay Banon kim...@gmail.com wrote:

My guess is that the problem is with creating so many types, which ends up
being a large overhead in the system. Each time a type is introduced, it
needs to be broadcasted to the rest of the nodes and persisted as part of
the cluster meta data. Can you try just indexing into the same type as a
test and see if it still happens?

On Wed, Dec 21, 2011 at 9:08 PM, Karussell tableyourt...@googlemail.comwrote:

Greg, have you checked/increased open file handle limits for your
machine?

First, check/post your logs. If too many files open ES would log that.

Peter.

kimchy · January 10, 2012, 9:40am

Just add the "type" as a field to the doc, and filter by it.

On Tue, Jan 10, 2012 at 5:45 AM, Greg Brown gbrown5878@gmail.com wrote:

Indexing all data to a single type did work fine (3.3mil docs) as
expected.

I submitted a bug (elasticsearch · GitHub
elasticsearch.github.com/issues/134) on the large number of types
because I was able to get the server to become unresponsive even when
there was only a single server and I tried to add many types.

For the moment I am going ahead with using all of the documents in a
single index. However, this significantly reduces query performance
compared to having a separate type for each set of documents. I looped
and profiled the following queries on the larger sets of documents
(10k-70k): Profiling queries for building word clouds · GitHub This was all run on a
single server, and the query from a different machine.

The first query has each set of docs in its own Type. On average it
took about 11 ms to complete.

The second has all of the docs in one index with a field pd_id to
distinguish the sets. The query uses the facet_filter and average ~190
ms.

The third uses the same index as the second, but uses a query to do
the "filtering" of the docs. ~140 ms. I was surprised that this was
faster than the facet_filter.

Any suggestions on how to improve the last two queries?

Any ideas on how to create multiple types without creating 10k
separate indices. In this case all I am using the Type for is a
partitioning/grouping of multiple separate indices, since the Mapping
of each Type is identical.

Thanks for the help.
-Greg

On Dec 22 2011, 7:37 am, Greg Brown gbrown5...@gmail.com wrote:

Checking through the logs, there isn't any mention of there not being
enough file handles, the errors I am running into are out of memory on
the heap space errors.

Shay,

Thanks, will give that a try and let you know. Will have to wait until
after the weekend so I can set up a development cluster. I've brought
down the production cluster a few too many times this week, and its
time to be more careful.

Thanks for the fast responses.
-Greg

On Dec 21, 6:54 pm, Shay Banon kim...@gmail.com wrote:

My guess is that the problem is with creating so many types, which
ends up
being a large overhead in the system. Each time a type is introduced,
it
needs to be broadcasted to the rest of the nodes and persisted as part
of
the cluster meta data. Can you try just indexing into the same type as
a
test and see if it still happens?

On Wed, Dec 21, 2011 at 9:08 PM, Karussell <
tableyourt...@googlemail.com>wrote:

Greg, have you checked/increased open file handle limits for your
machine?

First, check/post your logs. If too many files open ES would log
that.

Peter.

Greg_Brown · January 10, 2012, 3:48pm

That's what I did. Functionally works, but it is 10x slower to query
using either a query to filter or a facet_filter. Is there another
way? According to the docs: "search filters restrict only returned
documents — but not facet counts"

On Jan 10, 2:40 am, Shay Banon kim...@gmail.com wrote:

Just add the "type" as a field to the doc, and filter by it.

On Tue, Jan 10, 2012 at 5:45 AM, Greg Brown gbrown5...@gmail.com wrote:

Indexing all data to a single type did work fine (3.3mil docs) as
expected.

I submitted a bug (elasticsearch · GitHub
elasticsearch.github.com/issues/134) on the large number of types
because I was able to get the server to become unresponsive even when
there was only a single server and I tried to add many types.

For the moment I am going ahead with using all of the documents in a
single index. However, this significantly reduces query performance
compared to having a separate type for each set of documents. I looped
and profiled the following queries on the larger sets of documents
(10k-70k):https://gist.github.com/1586723This was all run on a
single server, and the query from a different machine.

The first query has each set of docs in its own Type. On average it
took about 11 ms to complete.

The second has all of the docs in one index with a field pd_id to
distinguish the sets. The query uses the facet_filter and average ~190
ms.

The third uses the same index as the second, but uses a query to do
the "filtering" of the docs. ~140 ms. I was surprised that this was
faster than the facet_filter.

Any suggestions on how to improve the last two queries?

Any ideas on how to create multiple types without creating 10k
separate indices. In this case all I am using the Type for is a
partitioning/grouping of multiple separate indices, since the Mapping
of each Type is identical.

Thanks for the help.
-Greg

On Dec 22 2011, 7:37 am, Greg Brown gbrown5...@gmail.com wrote:

Checking through the logs, there isn't any mention of there not being
enough file handles, the errors I am running into are out of memory on
the heap space errors.

Shay,

Thanks, will give that a try and let you know. Will have to wait until
after the weekend so I can set up a development cluster. I've brought
down the production cluster a few too many times this week, and its
time to be more careful.

Thanks for the fast responses.
-Greg

On Dec 21, 6:54 pm, Shay Banon kim...@gmail.com wrote:

My guess is that the problem is with creating so many types, which
ends up
being a large overhead in the system. Each time a type is introduced,
it
needs to be broadcasted to the rest of the nodes and persisted as part
of
the cluster meta data. Can you try just indexing into the same type as
a
test and see if it still happens?

On Wed, Dec 21, 2011 at 9:08 PM, Karussell <
tableyourt...@googlemail.com>wrote:

Greg, have you checked/increased open file handle limits for your
machine?

First, check/post your logs. If too many files open ES would log
that.

Peter.

kimchy · January 10, 2012, 5:50pm

10x slower than types? It makes little sense since types, at teh end of the
day, is just a field called _type in a document, and when you search within
a type, your query provided is simply wrapped in a filtered query a filter
on the type. So, you can do it yourself, just wrap your query in a filtered
query with a filter on your "type".

On Tue, Jan 10, 2012 at 5:48 PM, Greg Ichneumon Brown
gbrown5878@gmail.comwrote:

That's what I did. Functionally works, but it is 10x slower to query
using either a query to filter or a facet_filter. Is there another
way? According to the docs: "search filters restrict only returned
documents — but not facet counts"

On Jan 10, 2:40 am, Shay Banon kim...@gmail.com wrote:

Just add the "type" as a field to the doc, and filter by it.

On Tue, Jan 10, 2012 at 5:45 AM, Greg Brown gbrown5...@gmail.com
wrote:

Indexing all data to a single type did work fine (3.3mil docs) as
expected.

I submitted a bug (elasticsearch · GitHub
elasticsearch.github.com/issues/134) on the large number of types
because I was able to get the server to become unresponsive even when
there was only a single server and I tried to add many types.

For the moment I am going ahead with using all of the documents in a
single index. However, this significantly reduces query performance
compared to having a separate type for each set of documents. I looped
and profiled the following queries on the larger sets of documents
(10k-70k):https://gist.github.com/1586723This was all run on a
single server, and the query from a different machine.

The first query has each set of docs in its own Type. On average it
took about 11 ms to complete.

The second has all of the docs in one index with a field pd_id to
distinguish the sets. The query uses the facet_filter and average ~190
ms.

The third uses the same index as the second, but uses a query to do
the "filtering" of the docs. ~140 ms. I was surprised that this was
faster than the facet_filter.

Any suggestions on how to improve the last two queries?

Any ideas on how to create multiple types without creating 10k
separate indices. In this case all I am using the Type for is a
partitioning/grouping of multiple separate indices, since the Mapping
of each Type is identical.

Thanks for the help.
-Greg

On Dec 22 2011, 7:37 am, Greg Brown gbrown5...@gmail.com wrote:

Checking through the logs, there isn't any mention of there not being
enough file handles, the errors I am running into are out of memory
on
the heap space errors.

Shay,

Thanks, will give that a try and let you know. Will have to wait
until
after the weekend so I can set up a development cluster. I've brought
down the production cluster a few too many times this week, and its
time to be more careful.

Thanks for the fast responses.
-Greg

On Dec 21, 6:54 pm, Shay Banon kim...@gmail.com wrote:

My guess is that the problem is with creating so many types, which
ends up
being a large overhead in the system. Each time a type is
introduced,
it
needs to be broadcasted to the rest of the nodes and persisted as
part
of
the cluster meta data. Can you try just indexing into the same
type as
a
test and see if it still happens?

On Wed, Dec 21, 2011 at 9:08 PM, Karussell <
tableyourt...@googlemail.com>wrote:

Greg, have you checked/increased open file handle limits for
your
machine?

First, check/post your logs. If too many files open ES would log
that.

Peter.

Greg_Brown · January 11, 2012, 4:35pm

Ah, I see what you are saying. But I am totally flummoxed as to how to
formulate the query to get a filtered query that matches the
performance of the default filtered query used for the type.

I think what you are suggesting is:
curl -XGET "${SERVER}/pd-test-0/_search?pretty"
-d '{
"size" : 0,
"query" : {
"filtered" : {
"query" : { "match_all" : { } },
"filter" : {
"term" : { "pd_id" : "$ID" }
}
} },
"facets" : {
"q1" : {
"terms" : {
"field" : "q1",
"size" : 100
}
}
}
}'

Is that query match_all correct? This query takes about 125 ms vs 13
ms for a query on the type (curl -XGET "${SERVER}/pd-0/${ID}/_search).

From what I can gather from the Java code this should mostly match
that query, but don't know the code well enough. Is there an easy way
to enable logging that would let me compare the structure of the
parsed queries for debugging this?

Thanks for all the help, Shay! Much appreciated.
-Greg

On Jan 10, 10:50 am, Shay Banon kim...@gmail.com wrote:

10x slower than types? It makes little sense since types, at teh end of the
day, is just a field called _type in a document, and when you search within
a type, your query provided is simply wrapped in a filtered query a filter
on the type. So, you can do it yourself, just wrap your query in a filtered
query with a filter on your "type".

On Tue, Jan 10, 2012 at 5:48 PM, Greg Ichneumon Brown
gbrown5...@gmail.comwrote:

That's what I did. Functionally works, but it is 10x slower to query
using either a query to filter or a facet_filter. Is there another
way? According to the docs: "search filters restrict only returned
documents — but not facet counts"

On Jan 10, 2:40 am, Shay Banon kim...@gmail.com wrote:

Just add the "type" as a field to the doc, and filter by it.

On Tue, Jan 10, 2012 at 5:45 AM, Greg Brown gbrown5...@gmail.com
wrote:

Indexing all data to a single type did work fine (3.3mil docs) as
expected.

I submitted a bug (elasticsearch · GitHub
elasticsearch.github.com/issues/134) on the large number of types
because I was able to get the server to become unresponsive even when
there was only a single server and I tried to add many types.

For the moment I am going ahead with using all of the documents in a
single index. However, this significantly reduces query performance
compared to having a separate type for each set of documents. I looped
and profiled the following queries on the larger sets of documents
(10k-70k):https://gist.github.com/1586723Thiswas all run on a
single server, and the query from a different machine.

The first query has each set of docs in its own Type. On average it
took about 11 ms to complete.

The second has all of the docs in one index with a field pd_id to
distinguish the sets. The query uses the facet_filter and average ~190
ms.

The third uses the same index as the second, but uses a query to do
the "filtering" of the docs. ~140 ms. I was surprised that this was
faster than the facet_filter.

Any suggestions on how to improve the last two queries?

Any ideas on how to create multiple types without creating 10k
separate indices. In this case all I am using the Type for is a
partitioning/grouping of multiple separate indices, since the Mapping
of each Type is identical.

Thanks for the help.
-Greg

On Dec 22 2011, 7:37 am, Greg Brown gbrown5...@gmail.com wrote:

Checking through the logs, there isn't any mention of there not being
enough file handles, the errors I am running into are out of memory
on
the heap space errors.

Shay,

Thanks, will give that a try and let you know. Will have to wait
until
after the weekend so I can set up a development cluster. I've brought
down the production cluster a few too many times this week, and its
time to be more careful.

Thanks for the fast responses.
-Greg

On Dec 21, 6:54 pm, Shay Banon kim...@gmail.com wrote:

My guess is that the problem is with creating so many types, which
ends up
being a large overhead in the system. Each time a type is
introduced,
it
needs to be broadcasted to the rest of the nodes and persisted as
part
of
the cluster meta data. Can you try just indexing into the same
type as
a
test and see if it still happens?

On Wed, Dec 21, 2011 at 9:08 PM, Karussell <
tableyourt...@googlemail.com>wrote:

Greg, have you checked/increased open file handle limits for
your
machine?

First, check/post your logs. If too many files open ES would log
that.

Peter.

kimchy · January 11, 2012, 6:00pm

Yes, what you posted is exactly the query that you will end up with when
searching against a type. Maybe you just didn't let it do its caching bit?
How fast is the 2-3rd execution using the same pid (the term filter result
is cached).

On Wed, Jan 11, 2012 at 6:35 PM, Greg Ichneumon Brown
gbrown5878@gmail.comwrote:

Ah, I see what you are saying. But I am totally flummoxed as to how to
formulate the query to get a filtered query that matches the
performance of the default filtered query used for the type.

I think what you are suggesting is:
curl -XGET "${SERVER}/pd-test-0/_search?pretty"
-d '{
"size" : 0,
"query" : {
"filtered" : {
"query" : { "match_all" : { } },
"filter" : {
"term" : { "pd_id" : "$ID" }
}
} },
"facets" : {
"q1" : {
"terms" : {
"field" : "q1",
"size" : 100
}
}
}
}'

Is that query match_all correct? This query takes about 125 ms vs 13
ms for a query on the type (curl -XGET "${SERVER}/pd-0/${ID}/_search).

From what I can gather from the Java code this should mostly match
that query, but don't know the code well enough. Is there an easy way
to enable logging that would let me compare the structure of the
parsed queries for debugging this?

Thanks for all the help, Shay! Much appreciated.
-Greg

On Jan 10, 10:50 am, Shay Banon kim...@gmail.com wrote:

10x slower than types? It makes little sense since types, at teh end of
the
day, is just a field called _type in a document, and when you search
within
a type, your query provided is simply wrapped in a filtered query a
filter
on the type. So, you can do it yourself, just wrap your query in a
filtered
query with a filter on your "type".

On Tue, Jan 10, 2012 at 5:48 PM, Greg Ichneumon Brown
gbrown5...@gmail.comwrote:

That's what I did. Functionally works, but it is 10x slower to query
using either a query to filter or a facet_filter. Is there another
way? According to the docs: "search filters restrict only returned
documents — but not facet counts"

On Jan 10, 2:40 am, Shay Banon kim...@gmail.com wrote:

Just add the "type" as a field to the doc, and filter by it.

On Tue, Jan 10, 2012 at 5:45 AM, Greg Brown gbrown5...@gmail.com
wrote:

Indexing all data to a single type did work fine (3.3mil docs) as
expected.

I submitted a bug (elasticsearch · GitHub
elasticsearch.github.com/issues/134) on the large number of types
because I was able to get the server to become unresponsive even
when
there was only a single server and I tried to add many types.

For the moment I am going ahead with using all of the documents in
a
single index. However, this significantly reduces query performance
compared to having a separate type for each set of documents. I
looped
and profiled the following queries on the larger sets of documents
(10k-70k):https://gist.github.com/1586723Thiswas all run on a
single server, and the query from a different machine.

The first query has each set of docs in its own Type. On average it
took about 11 ms to complete.

The second has all of the docs in one index with a field pd_id to
distinguish the sets. The query uses the facet_filter and average
~190
ms.

The third uses the same index as the second, but uses a query to do
the "filtering" of the docs. ~140 ms. I was surprised that this was
faster than the facet_filter.

Any suggestions on how to improve the last two queries?

Any ideas on how to create multiple types without creating 10k
separate indices. In this case all I am using the Type for is a
partitioning/grouping of multiple separate indices, since the
Mapping
of each Type is identical.

Thanks for the help.
-Greg

On Dec 22 2011, 7:37 am, Greg Brown gbrown5...@gmail.com wrote:

Checking through the logs, there isn't any mention of there not
being
enough file handles, the errors I am running into are out of
memory
on
the heap space errors.

Shay,

Thanks, will give that a try and let you know. Will have to wait
until
after the weekend so I can set up a development cluster. I've
brought
down the production cluster a few too many times this week, and
its
time to be more careful.

Thanks for the fast responses.
-Greg

On Dec 21, 6:54 pm, Shay Banon kim...@gmail.com wrote:

My guess is that the problem is with creating so many types,
which
ends up
being a large overhead in the system. Each time a type is
introduced,
it
needs to be broadcasted to the rest of the nodes and persisted
as
part
of
the cluster meta data. Can you try just indexing into the same
type as
a
test and see if it still happens?

On Wed, Dec 21, 2011 at 9:08 PM, Karussell <
tableyourt...@googlemail.com>wrote:

Greg, have you checked/increased open file handle limits for
your
machine?

First, check/post your logs. If too many files open ES would
log
that.

Peter.

Greg_Brown · January 11, 2012, 8:14pm

No matter how many times I repeat that query I am getting "took" :
125, whereas the type query gives "took" : 13. I also built a filtered
query using _type and that completes in 13 ms also.

Could my mapping for this field be the culprit? I am doing:

	'pd_id'   => array( 'type' => 'long', 'store' => 'yes', 'index' =>

'not_analyzed' )

Each document has an integer as an id, I used long as the storage type
to reduce memory, but does this not work with the term filter?

I tried setting _cache to true in the filter to see if that forced the
caching, but then I get the correct number of total hits, but no facet
results:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 11588,
"max_score" : 1.0,
"hits" :
}
}

Should be:
{
"took" : 13,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 11588,
"max_score" : 1.0,
"hits" :
},
"facets" : {
"q1" : {
"_type" : "terms",
"missing" : 0,
"total" : 11720,
"other" : 56,
"terms" : [ {
"term" : "adopt",
"count" : 11475
}, {
"term" : "adoption",
"count" : 39
}, {
etc...

On Jan 11, 11:00 am, Shay Banon kim...@gmail.com wrote:

Yes, what you posted is exactly the query that you will end up with when
searching against a type. Maybe you just didn't let it do its caching bit?
How fast is the 2-3rd execution using the same pid (the term filter result
is cached).

On Wed, Jan 11, 2012 at 6:35 PM, Greg Ichneumon Brown
gbrown5...@gmail.comwrote:

Ah, I see what you are saying. But I am totally flummoxed as to how to
formulate the query to get a filtered query that matches the
performance of the default filtered query used for the type.

I think what you are suggesting is:
curl -XGET "${SERVER}/pd-test-0/_search?pretty"
-d '{
"size" : 0,
"query" : {
"filtered" : {
"query" : { "match_all" : { } },
"filter" : {
"term" : { "pd_id" : "$ID" }
}
} },
"facets" : {
"q1" : {
"terms" : {
"field" : "q1",
"size" : 100
}
}
}
}'

Is that query match_all correct? This query takes about 125 ms vs 13
ms for a query on the type (curl -XGET "${SERVER}/pd-0/${ID}/_search).

From what I can gather from the Java code this should mostly match
that query, but don't know the code well enough. Is there an easy way
to enable logging that would let me compare the structure of the
parsed queries for debugging this?

Thanks for all the help, Shay! Much appreciated.
-Greg

On Jan 10, 10:50 am, Shay Banon kim...@gmail.com wrote:

10x slower than types? It makes little sense since types, at teh end of
the
day, is just a field called _type in a document, and when you search
within
a type, your query provided is simply wrapped in a filtered query a
filter
on the type. So, you can do it yourself, just wrap your query in a
filtered
query with a filter on your "type".

On Tue, Jan 10, 2012 at 5:48 PM, Greg Ichneumon Brown
gbrown5...@gmail.comwrote:

That's what I did. Functionally works, but it is 10x slower to query
using either a query to filter or a facet_filter. Is there another
way? According to the docs: "search filters restrict only returned
documents — but not facet counts"

On Jan 10, 2:40 am, Shay Banon kim...@gmail.com wrote:

Just add the "type" as a field to the doc, and filter by it.

On Tue, Jan 10, 2012 at 5:45 AM, Greg Brown gbrown5...@gmail.com
wrote:

Indexing all data to a single type did work fine (3.3mil docs) as
expected.

I submitted a bug (elasticsearch · GitHub
elasticsearch.github.com/issues/134) on the large number of types
because I was able to get the server to become unresponsive even
when
there was only a single server and I tried to add many types.

For the moment I am going ahead with using all of the documents in
a
single index. However, this significantly reduces query performance
compared to having a separate type for each set of documents. I
looped
and profiled the following queries on the larger sets of documents
(10k-70k):https://gist.github.com/1586723Thiswasall run on a
single server, and the query from a different machine.

The first query has each set of docs in its own Type. On average it
took about 11 ms to complete.

The second has all of the docs in one index with a field pd_id to
distinguish the sets. The query uses the facet_filter and average
~190
ms.

The third uses the same index as the second, but uses a query to do
the "filtering" of the docs. ~140 ms. I was surprised that this was
faster than the facet_filter.

Any suggestions on how to improve the last two queries?

Any ideas on how to create multiple types without creating 10k
separate indices. In this case all I am using the Type for is a
partitioning/grouping of multiple separate indices, since the
Mapping
of each Type is identical.

Thanks for the help.
-Greg

On Dec 22 2011, 7:37 am, Greg Brown gbrown5...@gmail.com wrote:

Checking through the logs, there isn't any mention of there not
being
enough file handles, the errors I am running into are out of
memory
on
the heap space errors.

Shay,

Thanks, will give that a try and let you know. Will have to wait
until
after the weekend so I can set up a development cluster. I've
brought
down the production cluster a few too many times this week, and
its
time to be more careful.

Thanks for the fast responses.
-Greg

On Dec 21, 6:54 pm, Shay Banon kim...@gmail.com wrote:

My guess is that the problem is with creating so many types,
which
ends up
being a large overhead in the system. Each time a type is
introduced,
it
needs to be broadcasted to the rest of the nodes and persisted
as
part
of
the cluster meta data. Can you try just indexing into the same
type as
a
test and see if it still happens?

On Wed, Dec 21, 2011 at 9:08 PM, Karussell <
tableyourt...@googlemail.com>wrote:

Greg, have you checked/increased open file handle limits for
your
machine?

First, check/post your logs. If too many files open ES would
log
that.

Peter.

kimchy · January 11, 2012, 8:38pm

Regarding the caching, you did not show an example of how you try and set
it, so I can't help, but the _cache is probably set in the wrong place. In
any cas, you don't need to set it, as term filter is cached by default.

Regarding the field type of pd_id, you define it as numeric, which is fine.
Note that _type is a String, but it does not really matter that much since
we are caching the filters results. Note though, long is 64bit signed , and
integer is 32bit signed.

I don't really understand where this change is coming from, it makes very
little sense. You can dropbox the ES data directory you are working with,
and two sample curl queries, one against type and one against pd_id, and I
can have a look.

On Wed, Jan 11, 2012 at 10:14 PM, Greg Ichneumon Brown <gbrown5878@gmail.com

wrote:

No matter how many times I repeat that query I am getting "took" :
125, whereas the type query gives "took" : 13. I also built a filtered
query using _type and that completes in 13 ms also.

Could my mapping for this field be the culprit? I am doing:
           'pd_id'   => array( 'type' => 'long', 'store' => 'yes',
'index' =>
'not_analyzed' )

Each document has an integer as an id, I used long as the storage type
to reduce memory, but does this not work with the term filter?

I tried setting _cache to true in the filter to see if that forced the
caching, but then I get the correct number of total hits, but no facet
results:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 11588,
"max_score" : 1.0,
"hits" :
}
}

Should be:
{
"took" : 13,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 11588,
"max_score" : 1.0,
"hits" :
},
"facets" : {
"q1" : {
"_type" : "terms",
"missing" : 0,
"total" : 11720,
"other" : 56,
"terms" : [ {
"term" : "adopt",
"count" : 11475
}, {
"term" : "adoption",
"count" : 39
}, {
etc...

On Jan 11, 11:00 am, Shay Banon kim...@gmail.com wrote:

Yes, what you posted is exactly the query that you will end up with when
searching against a type. Maybe you just didn't let it do its caching
bit?
How fast is the 2-3rd execution using the same pid (the term filter
result
is cached).

On Wed, Jan 11, 2012 at 6:35 PM, Greg Ichneumon Brown
gbrown5...@gmail.comwrote:

Ah, I see what you are saying. But I am totally flummoxed as to how to
formulate the query to get a filtered query that matches the
performance of the default filtered query used for the type.

I think what you are suggesting is:
curl -XGET "${SERVER}/pd-test-0/_search?pretty"
-d '{
"size" : 0,
"query" : {
"filtered" : {
"query" : { "match_all" : { } },
"filter" : {
"term" : { "pd_id" : "$ID" }
}
} },
"facets" : {
"q1" : {
"terms" : {
"field" : "q1",
"size" : 100
}
}
}
}'

Is that query match_all correct? This query takes about 125 ms vs 13
ms for a query on the type (curl -XGET "${SERVER}/pd-0/${ID}/_search).

From what I can gather from the Java code this should mostly match
that query, but don't know the code well enough. Is there an easy way
to enable logging that would let me compare the structure of the
parsed queries for debugging this?

Thanks for all the help, Shay! Much appreciated.
-Greg

On Jan 10, 10:50 am, Shay Banon kim...@gmail.com wrote:

10x slower than types? It makes little sense since types, at teh end
of
the
day, is just a field called _type in a document, and when you search
within
a type, your query provided is simply wrapped in a filtered query a
filter
on the type. So, you can do it yourself, just wrap your query in a
filtered
query with a filter on your "type".

On Tue, Jan 10, 2012 at 5:48 PM, Greg Ichneumon Brown
gbrown5...@gmail.comwrote:

That's what I did. Functionally works, but it is 10x slower to
query
using either a query to filter or a facet_filter. Is there another
way? According to the docs: "search filters restrict only returned
documents — but not facet counts"

On Jan 10, 2:40 am, Shay Banon kim...@gmail.com wrote:

Just add the "type" as a field to the doc, and filter by it.

On Tue, Jan 10, 2012 at 5:45 AM, Greg Brown <
gbrown5...@gmail.com>
wrote:

Indexing all data to a single type did work fine (3.3mil docs)
as
expected.

I submitted a bug (elasticsearch · GitHub
elasticsearch.github.com/issues/134) on the large number of
types
because I was able to get the server to become unresponsive
even
when
there was only a single server and I tried to add many types.

For the moment I am going ahead with using all of the
documents in
a
single index. However, this significantly reduces query
performance
compared to having a separate type for each set of documents. I
looped
and profiled the following queries on the larger sets of
documents
(10k-70k):https://gist.github.com/1586723Thiswasall run on a
single server, and the query from a different machine.

The first query has each set of docs in its own Type. On
average it
took about 11 ms to complete.

The second has all of the docs in one index with a field pd_id
to
distinguish the sets. The query uses the facet_filter and
average
~190
ms.

The third uses the same index as the second, but uses a query
to do
the "filtering" of the docs. ~140 ms. I was surprised that
this was
faster than the facet_filter.

Any suggestions on how to improve the last two queries?

Any ideas on how to create multiple types without creating 10k
separate indices. In this case all I am using the Type for is a
partitioning/grouping of multiple separate indices, since the
Mapping
of each Type is identical.

Thanks for the help.
-Greg

On Dec 22 2011, 7:37 am, Greg Brown gbrown5...@gmail.com
wrote:

Checking through the logs, there isn't any mention of there
not
being
enough file handles, the errors I am running into are out of
memory
on
the heap space errors.

Shay,

Thanks, will give that a try and let you know. Will have to
wait
until
after the weekend so I can set up a development cluster. I've
brought
down the production cluster a few too many times this week,
and
its
time to be more careful.

Thanks for the fast responses.
-Greg

On Dec 21, 6:54 pm, Shay Banon kim...@gmail.com wrote:

My guess is that the problem is with creating so many
types,
which
ends up
being a large overhead in the system. Each time a type is
introduced,
it
needs to be broadcasted to the rest of the nodes and
persisted
as
part
of
the cluster meta data. Can you try just indexing into the
same
type as
a
test and see if it still happens?

On Wed, Dec 21, 2011 at 9:08 PM, Karussell <
tableyourt...@googlemail.com>wrote:

Greg, have you checked/increased open file handle
limits for
your
machine?

First, check/post your logs. If too many files open ES
would
log
that.

Peter.

Topic		Replies	Views
Slow bulk indexing with lots of different 'types' Elasticsearch	7	795	July 5, 2017
ElasticSearch Node goes down Elasticsearch	6	3718	July 27, 2019
Issue Indexing 50mil Docs via Bulk API Elasticsearch	23	2365	July 5, 2017
Bulk indexing performance Elasticsearch	10	4445	February 10, 2017
Slowly Indexing speed Elasticsearch	26	857	August 18, 2020

2 Node cluster hanging while bulk indexing and adding types

Related topics