2 Node cluster hanging while bulk indexing and adding types


(Greg Brown) #1

Hi,

We're trying to index a few million responses to many (10000+) forms.
We're creating a separate type for each form since they can have
different fields. So each form is associated with a type named form-
.

In bulk indexing the existing data, we are creating the new types and
their mappings on the fly, interspersed with bulk indexing the
documents. After running through about 200k responses the server
stopped responding and appeared to be running out of memory. The
cluster node stats are pasted below.

This two node cluster (8 GB machines) has indices for other
applications (of similar size) already running successfully on it, and
we've can bulk index them without a problem. So I assume the large
number of types is causing the issue.

So some questions:

  1. I'm planning to try creating all of the types first and then bulk
    index. Is intermixing type creation and adding docs expected to run
    into performance problems?

  2. It seems from the cluster stats that there is a surprising amount
    of data going in one direction. Could this be the performance problem?

  3. We're planning to do term facet searches (for word cloud
    generation) for each of the above types. I think I read that the
    entire index gets loaded when a term facet is done. If I do a term
    facet on a particular type, will only the portion of the index for
    that type be loaded, or will it still be the whole thing? If the whole
    thing, any way I can move the data into separate indices without
    having 10k indices show up when I go to look at the size of my
    indices. There is no need for them to be in the same index, I just
    don't want the 10k tables making it harder for me to examine the
    indices for other applications.

Thanks for the help
-Greg

The index in question:
"index" : {
"primary_size" : "55.8mb",
"primary_size_in_bytes" : 58586978,
"size" : "111.7mb",
"size_in_bytes" : 117177113
},
"translog" : {
"operations" : 0
},
"docs" : {
"num_docs" : 214308,
"max_doc" : 214314,
"deleted_docs" : 6
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 6426,
"total_time" : "4.8m",
"total_time_in_millis" : 289292
},
"flush" : {
"total" : 2702,
"total_time" : "1.6m",
"total_time_in_millis" : 97459
},

/_cluster/node/stats
{
"cluster_name" : "mgs2",
"nodes" : {
"x25SzznQQZ2uPGQvgGjs4Q" : {
"name" : "Nut",
"indices" : {
"store" : {
"size" : "1.6gb",
"size_in_bytes" : 1806371221
},
"docs" : {
"count" : 557654,
"deleted" : 114387
},
"indexing" : {
"index_total" : 18577042,
"index_time" : "40m",
"index_time_in_millis" : 2403162,
"index_current" : 0,
"delete_total" : 0,
"delete_time" : "0s",
"delete_time_in_millis" : 0,
"delete_current" : 0
},
"get" : {
"total" : 86,
"time" : "60ms",
"time_in_millis" : 60,
"exists_total" : 24,
"exists_time" : "40ms",
"exists_time_in_millis" : 40,
"missing_total" : 62,
"missing_time" : "20ms",
"missing_time_in_millis" : 20,
"current" : 0
},
"search" : {
"query_total" : 20121,
"query_time" : "4.8m",
"query_time_in_millis" : 288251,
"query_current" : 0,
"fetch_total" : 15337,
"fetch_time" : "57.4s",
"fetch_time_in_millis" : 57470,
"fetch_current" : 0
},
"cache" : {
"field_evictions" : 0,
"field_size" : "4.4mb",
"field_size_in_bytes" : 4675324,
"filter_count" : 3,
"filter_evictions" : 0,
"filter_size" : "92.9kb",
"filter_size_in_bytes" : 95144
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 56,
"total_time" : "15.4s",
"total_time_in_millis" : 15492,
"total_docs" : 1887347,
"total_size" : "495.7mb",
"total_size_in_bytes" : 519823123
},
"refresh" : {
"total" : 6748,
"total_time" : "4.8m",
"total_time_in_millis" : 290519
},
"flush" : {
"total" : 2972,
"total_time" : "1.6m",
"total_time_in_millis" : 101684
}
},
"os" : {
"timestamp" : 1324472103660,
"uptime" : "-1 seconds",
"uptime_in_millis" : -1000,
"load_average" : [ ]
},
"process" : {
"timestamp" : 1324472103660,
"open_file_descriptors" : 1257
},
"jvm" : {
"timestamp" : 1324472103660,
"uptime" : "15 hours, 58 minutes, 35 seconds and 124
milliseconds",
"uptime_in_millis" : 57515124,
"mem" : {
"heap_used" : "2.3gb",
"heap_used_in_bytes" : 2524820736,
"heap_committed" : "4.9gb",
"heap_committed_in_bytes" : 5340397568,
"non_heap_used" : "46mb",
"non_heap_used_in_bytes" : 48276152,
"non_heap_committed" : "69.8mb",
"non_heap_committed_in_bytes" : 73248768
},
"threads" : {
"count" : 72,
"peak_count" : 86
},
"gc" : {
"collection_count" : 16609,
"collection_time" : "4 minutes, 25 seconds and 568
milliseconds",
"collection_time_in_millis" : 265568,
"collectors" : {
"ParNew" : {
"collection_count" : 16600,
"collection_time" : "4 minutes, 24 seconds and 807
milliseconds",
"collection_time_in_millis" : 264807
},
"ConcurrentMarkSweep" : {
"collection_count" : 9,
"collection_time" : "761 milliseconds",
"collection_time_in_millis" : 761
}
}
}
},
"network" : {
},
"transport" : {
"server_open" : 14,
"rx_count" : 243057,
"rx_size" : "452.2mb",
"rx_size_in_bytes" : 474228725,
"tx_count" : 321262,
"tx_size" : "4.1gb",
"tx_size_in_bytes" : 4459909980
},
"http" : {
"current_open" : 8,
"total_opened" : 7208
}
},
"fxJdAyuKSZeLRQ0nrBybGg" : {
"name" : "Bantam",
"indices" : {
"store" : {
"size" : "1.7gb",
"size_in_bytes" : 1889182815
},
"docs" : {
"count" : 557472,
"deleted" : 114386
},
"indexing" : {
"index_total" : 18576863,
"index_time" : "1.1h",
"index_time_in_millis" : 4074696,
"index_current" : 5,
"delete_total" : 0,
"delete_time" : "0s",
"delete_time_in_millis" : 0,
"delete_current" : 0
},
"get" : {
"total" : 91,
"time" : "34ms",
"time_in_millis" : 34,
"exists_total" : 43,
"exists_time" : "16ms",
"exists_time_in_millis" : 16,
"missing_total" : 48,
"missing_time" : "18ms",
"missing_time_in_millis" : 18,
"current" : 0
},
"search" : {
"query_total" : 20060,
"query_time" : "11.4m",
"query_time_in_millis" : 686600,
"query_current" : 5,
"fetch_total" : 15146,
"fetch_time" : "2.4m",
"fetch_time_in_millis" : 148005,
"fetch_current" : 1
},
"cache" : {
"field_evictions" : 0,
"field_size" : "4.4mb",
"field_size_in_bytes" : 4674888,
"filter_count" : 2,
"filter_evictions" : 0,
"filter_size" : "92.5kb",
"filter_size_in_bytes" : 94760
},
"merges" : {
"current" : 5,
"current_docs" : 214061,
"current_size" : "55.8mb",
"current_size_in_bytes" : 58566804,
"total" : 55,
"total_time" : "5.6m",
"total_time_in_millis" : 341870,
"total_docs" : 2010928,
"total_size" : "527.5mb",
"total_size_in_bytes" : 553224978
},
"refresh" : {
"total" : 7365,
"total_time" : "7.4m",
"total_time_in_millis" : 447627
},
"flush" : {
"total" : 2977,
"total_time" : "13.4m",
"total_time_in_millis" : 804327
}
},
"os" : {
"timestamp" : 1324472120587,
"uptime" : "-1 seconds",
"uptime_in_millis" : -1000,
"load_average" : [ ]
},
"process" : {
"timestamp" : 1324472120587,
"open_file_descriptors" : 1438
},
"jvm" : {
"timestamp" : 1324472120589,
"uptime" : "15 hours, 58 minutes, 25 seconds and 155
milliseconds",
"uptime_in_millis" : 57505155,
"mem" : {
"heap_used" : "4.9gb",
"heap_used_in_bytes" : 5317828824,
"heap_committed" : "4.9gb",
"heap_committed_in_bytes" : 5340397568,
"non_heap_used" : "44.4mb",
"non_heap_used_in_bytes" : 46654016,
"non_heap_committed" : "68.6mb",
"non_heap_committed_in_bytes" : 71991296
},
"threads" : {
"count" : 94,
"peak_count" : 96
},
"gc" : {
"collection_count" : 15361,
"collection_time" : "7 minutes, 38 seconds and 876
milliseconds",
"collection_time_in_millis" : 458876,
"collectors" : {
"ParNew" : {
"collection_count" : 15245,
"collection_time" : "6 minutes, 6 seconds and 39
milliseconds",
"collection_time_in_millis" : 366039
},
"ConcurrentMarkSweep" : {
"collection_count" : 116,
"collection_time" : "1 minute, 32 seconds and 837
milliseconds",
"collection_time_in_millis" : 92837
}
}
}
},
"network" : {
},
"transport" : {
"server_open" : 14,
"rx_count" : 242790,
"rx_size" : "4.1gb",
"rx_size_in_bytes" : 4459575446,
"tx_count" : 250201,
"tx_size" : "452mb",
"tx_size_in_bytes" : 474044111
},
"http" : {
"current_open" : 3,
"total_opened" : 7074
}
}
}
}


(Craig Brown) #2

Greg, have you checked/increased open file handle limits for your machine?
ES/Lucene tend to require lots of file handles. I believe that more indices
require more file handles. I had a case last night where I was trying to
index 1m records and it hung at about 340K because it ran out of file
handles. Once I increased the open file handle limit, I was able to index
1m without problems.

  • Craig

On Wed, Dec 21, 2011 at 8:10 AM, Greg Brown gbrown5878@gmail.com wrote:

Hi,

We're trying to index a few million responses to many (10000+) forms.
We're creating a separate type for each form since they can have
different fields. So each form is associated with a type named form-
.

In bulk indexing the existing data, we are creating the new types and
their mappings on the fly, interspersed with bulk indexing the
documents. After running through about 200k responses the server
stopped responding and appeared to be running out of memory. The
cluster node stats are pasted below.

This two node cluster (8 GB machines) has indices for other
applications (of similar size) already running successfully on it, and
we've can bulk index them without a problem. So I assume the large
number of types is causing the issue.

So some questions:

  1. I'm planning to try creating all of the types first and then bulk
    index. Is intermixing type creation and adding docs expected to run
    into performance problems?

  2. It seems from the cluster stats that there is a surprising amount
    of data going in one direction. Could this be the performance problem?

  3. We're planning to do term facet searches (for word cloud
    generation) for each of the above types. I think I read that the
    entire index gets loaded when a term facet is done. If I do a term
    facet on a particular type, will only the portion of the index for
    that type be loaded, or will it still be the whole thing? If the whole
    thing, any way I can move the data into separate indices without
    having 10k indices show up when I go to look at the size of my
    indices. There is no need for them to be in the same index, I just
    don't want the 10k tables making it harder for me to examine the
    indices for other applications.

Thanks for the help
-Greg

The index in question:
"index" : {
"primary_size" : "55.8mb",
"primary_size_in_bytes" : 58586978,
"size" : "111.7mb",
"size_in_bytes" : 117177113
},
"translog" : {
"operations" : 0
},
"docs" : {
"num_docs" : 214308,
"max_doc" : 214314,
"deleted_docs" : 6
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 6426,
"total_time" : "4.8m",
"total_time_in_millis" : 289292
},
"flush" : {
"total" : 2702,
"total_time" : "1.6m",
"total_time_in_millis" : 97459
},

/_cluster/node/stats
{
"cluster_name" : "mgs2",
"nodes" : {
"x25SzznQQZ2uPGQvgGjs4Q" : {
"name" : "Nut",
"indices" : {
"store" : {
"size" : "1.6gb",
"size_in_bytes" : 1806371221
},
"docs" : {
"count" : 557654,
"deleted" : 114387
},
"indexing" : {
"index_total" : 18577042,
"index_time" : "40m",
"index_time_in_millis" : 2403162,
"index_current" : 0,
"delete_total" : 0,
"delete_time" : "0s",
"delete_time_in_millis" : 0,
"delete_current" : 0
},
"get" : {
"total" : 86,
"time" : "60ms",
"time_in_millis" : 60,
"exists_total" : 24,
"exists_time" : "40ms",
"exists_time_in_millis" : 40,
"missing_total" : 62,
"missing_time" : "20ms",
"missing_time_in_millis" : 20,
"current" : 0
},
"search" : {
"query_total" : 20121,
"query_time" : "4.8m",
"query_time_in_millis" : 288251,
"query_current" : 0,
"fetch_total" : 15337,
"fetch_time" : "57.4s",
"fetch_time_in_millis" : 57470,
"fetch_current" : 0
},
"cache" : {
"field_evictions" : 0,
"field_size" : "4.4mb",
"field_size_in_bytes" : 4675324,
"filter_count" : 3,
"filter_evictions" : 0,
"filter_size" : "92.9kb",
"filter_size_in_bytes" : 95144
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 56,
"total_time" : "15.4s",
"total_time_in_millis" : 15492,
"total_docs" : 1887347,
"total_size" : "495.7mb",
"total_size_in_bytes" : 519823123
},
"refresh" : {
"total" : 6748,
"total_time" : "4.8m",
"total_time_in_millis" : 290519
},
"flush" : {
"total" : 2972,
"total_time" : "1.6m",
"total_time_in_millis" : 101684
}
},
"os" : {
"timestamp" : 1324472103660,
"uptime" : "-1 seconds",
"uptime_in_millis" : -1000,
"load_average" : [ ]
},
"process" : {
"timestamp" : 1324472103660,
"open_file_descriptors" : 1257
},
"jvm" : {
"timestamp" : 1324472103660,
"uptime" : "15 hours, 58 minutes, 35 seconds and 124
milliseconds",
"uptime_in_millis" : 57515124,
"mem" : {
"heap_used" : "2.3gb",
"heap_used_in_bytes" : 2524820736,
"heap_committed" : "4.9gb",
"heap_committed_in_bytes" : 5340397568,
"non_heap_used" : "46mb",
"non_heap_used_in_bytes" : 48276152,
"non_heap_committed" : "69.8mb",
"non_heap_committed_in_bytes" : 73248768
},
"threads" : {
"count" : 72,
"peak_count" : 86
},
"gc" : {
"collection_count" : 16609,
"collection_time" : "4 minutes, 25 seconds and 568
milliseconds",
"collection_time_in_millis" : 265568,
"collectors" : {
"ParNew" : {
"collection_count" : 16600,
"collection_time" : "4 minutes, 24 seconds and 807
milliseconds",
"collection_time_in_millis" : 264807
},
"ConcurrentMarkSweep" : {
"collection_count" : 9,
"collection_time" : "761 milliseconds",
"collection_time_in_millis" : 761
}
}
}
},
"network" : {
},
"transport" : {
"server_open" : 14,
"rx_count" : 243057,
"rx_size" : "452.2mb",
"rx_size_in_bytes" : 474228725,
"tx_count" : 321262,
"tx_size" : "4.1gb",
"tx_size_in_bytes" : 4459909980
},
"http" : {
"current_open" : 8,
"total_opened" : 7208
}
},
"fxJdAyuKSZeLRQ0nrBybGg" : {
"name" : "Bantam",
"indices" : {
"store" : {
"size" : "1.7gb",
"size_in_bytes" : 1889182815
},
"docs" : {
"count" : 557472,
"deleted" : 114386
},
"indexing" : {
"index_total" : 18576863,
"index_time" : "1.1h",
"index_time_in_millis" : 4074696,
"index_current" : 5,
"delete_total" : 0,
"delete_time" : "0s",
"delete_time_in_millis" : 0,
"delete_current" : 0
},
"get" : {
"total" : 91,
"time" : "34ms",
"time_in_millis" : 34,
"exists_total" : 43,
"exists_time" : "16ms",
"exists_time_in_millis" : 16,
"missing_total" : 48,
"missing_time" : "18ms",
"missing_time_in_millis" : 18,
"current" : 0
},
"search" : {
"query_total" : 20060,
"query_time" : "11.4m",
"query_time_in_millis" : 686600,
"query_current" : 5,
"fetch_total" : 15146,
"fetch_time" : "2.4m",
"fetch_time_in_millis" : 148005,
"fetch_current" : 1
},
"cache" : {
"field_evictions" : 0,
"field_size" : "4.4mb",
"field_size_in_bytes" : 4674888,
"filter_count" : 2,
"filter_evictions" : 0,
"filter_size" : "92.5kb",
"filter_size_in_bytes" : 94760
},
"merges" : {
"current" : 5,
"current_docs" : 214061,
"current_size" : "55.8mb",
"current_size_in_bytes" : 58566804,
"total" : 55,
"total_time" : "5.6m",
"total_time_in_millis" : 341870,
"total_docs" : 2010928,
"total_size" : "527.5mb",
"total_size_in_bytes" : 553224978
},
"refresh" : {
"total" : 7365,
"total_time" : "7.4m",
"total_time_in_millis" : 447627
},
"flush" : {
"total" : 2977,
"total_time" : "13.4m",
"total_time_in_millis" : 804327
}
},
"os" : {
"timestamp" : 1324472120587,
"uptime" : "-1 seconds",
"uptime_in_millis" : -1000,
"load_average" : [ ]
},
"process" : {
"timestamp" : 1324472120587,
"open_file_descriptors" : 1438
},
"jvm" : {
"timestamp" : 1324472120589,
"uptime" : "15 hours, 58 minutes, 25 seconds and 155
milliseconds",
"uptime_in_millis" : 57505155,
"mem" : {
"heap_used" : "4.9gb",
"heap_used_in_bytes" : 5317828824,
"heap_committed" : "4.9gb",
"heap_committed_in_bytes" : 5340397568,
"non_heap_used" : "44.4mb",
"non_heap_used_in_bytes" : 46654016,
"non_heap_committed" : "68.6mb",
"non_heap_committed_in_bytes" : 71991296
},
"threads" : {
"count" : 94,
"peak_count" : 96
},
"gc" : {
"collection_count" : 15361,
"collection_time" : "7 minutes, 38 seconds and 876
milliseconds",
"collection_time_in_millis" : 458876,
"collectors" : {
"ParNew" : {
"collection_count" : 15245,
"collection_time" : "6 minutes, 6 seconds and 39
milliseconds",
"collection_time_in_millis" : 366039
},
"ConcurrentMarkSweep" : {
"collection_count" : 116,
"collection_time" : "1 minute, 32 seconds and 837
milliseconds",
"collection_time_in_millis" : 92837
}
}
}
},
"network" : {
},
"transport" : {
"server_open" : 14,
"rx_count" : 242790,
"rx_size" : "4.1gb",
"rx_size_in_bytes" : 4459575446,
"tx_count" : 250201,
"tx_size" : "452mb",
"tx_size_in_bytes" : 474044111
},
"http" : {
"current_open" : 3,
"total_opened" : 7074
}
}
}
}

--

CRAIG BROWN
chief architect
youwho, Inc.

www.youwho.com http://www.youwho.com/

T: 801.855. 0921
M: 801.913. 0939


(Karussell) #3

Greg, have you checked/increased open file handle limits for your machine?

First, check/post your logs. If too many files open ES would log that.

Peter.


(Shay Banon) #4

My guess is that the problem is with creating so many types, which ends up
being a large overhead in the system. Each time a type is introduced, it
needs to be broadcasted to the rest of the nodes and persisted as part of
the cluster meta data. Can you try just indexing into the same type as a
test and see if it still happens?

On Wed, Dec 21, 2011 at 9:08 PM, Karussell tableyourtime@googlemail.comwrote:

Greg, have you checked/increased open file handle limits for your
machine?

First, check/post your logs. If too many files open ES would log that.

Peter.


(Greg Brown) #5

Checking through the logs, there isn't any mention of there not being
enough file handles, the errors I am running into are out of memory on
the heap space errors.

Shay,

Thanks, will give that a try and let you know. Will have to wait until
after the weekend so I can set up a development cluster. I've brought
down the production cluster a few too many times this week, and its
time to be more careful. :slight_smile:

Thanks for the fast responses.
-Greg

On Dec 21, 6:54 pm, Shay Banon kim...@gmail.com wrote:

My guess is that the problem is with creating so many types, which ends up
being a large overhead in the system. Each time a type is introduced, it
needs to be broadcasted to the rest of the nodes and persisted as part of
the cluster meta data. Can you try just indexing into the same type as a
test and see if it still happens?

On Wed, Dec 21, 2011 at 9:08 PM, Karussell tableyourt...@googlemail.comwrote:

Greg, have you checked/increased open file handle limits for your
machine?

First, check/post your logs. If too many files open ES would log that.

Peter.


(Greg Brown) #6

Indexing all data to a single type did work fine (3.3mil docs) as
expected.

I submitted a bug (https://github.com/elasticsearch/
elasticsearch.github.com/issues/134) on the large number of types
because I was able to get the server to become unresponsive even when
there was only a single server and I tried to add many types.

For the moment I am going ahead with using all of the documents in a
single index. However, this significantly reduces query performance
compared to having a separate type for each set of documents. I looped
and profiled the following queries on the larger sets of documents
(10k-70k): https://gist.github.com/1586723 This was all run on a
single server, and the query from a different machine.

The first query has each set of docs in its own Type. On average it
took about 11 ms to complete.

The second has all of the docs in one index with a field pd_id to
distinguish the sets. The query uses the facet_filter and average ~190
ms.

The third uses the same index as the second, but uses a query to do
the "filtering" of the docs. ~140 ms. I was surprised that this was
faster than the facet_filter.

Any suggestions on how to improve the last two queries?

Any ideas on how to create multiple types without creating 10k
separate indices. In this case all I am using the Type for is a
partitioning/grouping of multiple separate indices, since the Mapping
of each Type is identical.

Thanks for the help.
-Greg

On Dec 22 2011, 7:37 am, Greg Brown gbrown5...@gmail.com wrote:

Checking through the logs, there isn't any mention of there not being
enough file handles, the errors I am running into are out of memory on
the heap space errors.

Shay,

Thanks, will give that a try and let you know. Will have to wait until
after the weekend so I can set up a development cluster. I've brought
down the production cluster a few too many times this week, and its
time to be more careful. :slight_smile:

Thanks for the fast responses.
-Greg

On Dec 21, 6:54 pm, Shay Banon kim...@gmail.com wrote:

My guess is that the problem is with creating so many types, which ends up
being a large overhead in the system. Each time a type is introduced, it
needs to be broadcasted to the rest of the nodes and persisted as part of
the cluster meta data. Can you try just indexing into the same type as a
test and see if it still happens?

On Wed, Dec 21, 2011 at 9:08 PM, Karussell tableyourt...@googlemail.comwrote:

Greg, have you checked/increased open file handle limits for your
machine?

First, check/post your logs. If too many files open ES would log that.

Peter.


(Shay Banon) #7

Just add the "type" as a field to the doc, and filter by it.

On Tue, Jan 10, 2012 at 5:45 AM, Greg Brown gbrown5878@gmail.com wrote:

Indexing all data to a single type did work fine (3.3mil docs) as
expected.

I submitted a bug (https://github.com/elasticsearch/
elasticsearch.github.com/issues/134) on the large number of types
because I was able to get the server to become unresponsive even when
there was only a single server and I tried to add many types.

For the moment I am going ahead with using all of the documents in a
single index. However, this significantly reduces query performance
compared to having a separate type for each set of documents. I looped
and profiled the following queries on the larger sets of documents
(10k-70k): https://gist.github.com/1586723 This was all run on a
single server, and the query from a different machine.

The first query has each set of docs in its own Type. On average it
took about 11 ms to complete.

The second has all of the docs in one index with a field pd_id to
distinguish the sets. The query uses the facet_filter and average ~190
ms.

The third uses the same index as the second, but uses a query to do
the "filtering" of the docs. ~140 ms. I was surprised that this was
faster than the facet_filter.

Any suggestions on how to improve the last two queries?

Any ideas on how to create multiple types without creating 10k
separate indices. In this case all I am using the Type for is a
partitioning/grouping of multiple separate indices, since the Mapping
of each Type is identical.

Thanks for the help.
-Greg

On Dec 22 2011, 7:37 am, Greg Brown gbrown5...@gmail.com wrote:

Checking through the logs, there isn't any mention of there not being
enough file handles, the errors I am running into are out of memory on
the heap space errors.

Shay,

Thanks, will give that a try and let you know. Will have to wait until
after the weekend so I can set up a development cluster. I've brought
down the production cluster a few too many times this week, and its
time to be more careful. :slight_smile:

Thanks for the fast responses.
-Greg

On Dec 21, 6:54 pm, Shay Banon kim...@gmail.com wrote:

My guess is that the problem is with creating so many types, which
ends up

being a large overhead in the system. Each time a type is introduced,
it

needs to be broadcasted to the rest of the nodes and persisted as part
of

the cluster meta data. Can you try just indexing into the same type as
a

test and see if it still happens?

On Wed, Dec 21, 2011 at 9:08 PM, Karussell <
tableyourt...@googlemail.com>wrote:

Greg, have you checked/increased open file handle limits for your
machine?

First, check/post your logs. If too many files open ES would log
that.

Peter.


(Greg Brown) #8

That's what I did. Functionally works, but it is 10x slower to query
using either a query to filter or a facet_filter. Is there another
way? According to the docs: "search filters restrict only returned
documents — but not facet counts"

On Jan 10, 2:40 am, Shay Banon kim...@gmail.com wrote:

Just add the "type" as a field to the doc, and filter by it.

On Tue, Jan 10, 2012 at 5:45 AM, Greg Brown gbrown5...@gmail.com wrote:

Indexing all data to a single type did work fine (3.3mil docs) as
expected.

I submitted a bug (https://github.com/elasticsearch/
elasticsearch.github.com/issues/134) on the large number of types
because I was able to get the server to become unresponsive even when
there was only a single server and I tried to add many types.

For the moment I am going ahead with using all of the documents in a
single index. However, this significantly reduces query performance
compared to having a separate type for each set of documents. I looped
and profiled the following queries on the larger sets of documents
(10k-70k):https://gist.github.com/1586723This was all run on a
single server, and the query from a different machine.

The first query has each set of docs in its own Type. On average it
took about 11 ms to complete.

The second has all of the docs in one index with a field pd_id to
distinguish the sets. The query uses the facet_filter and average ~190
ms.

The third uses the same index as the second, but uses a query to do
the "filtering" of the docs. ~140 ms. I was surprised that this was
faster than the facet_filter.

Any suggestions on how to improve the last two queries?

Any ideas on how to create multiple types without creating 10k
separate indices. In this case all I am using the Type for is a
partitioning/grouping of multiple separate indices, since the Mapping
of each Type is identical.

Thanks for the help.
-Greg

On Dec 22 2011, 7:37 am, Greg Brown gbrown5...@gmail.com wrote:

Checking through the logs, there isn't any mention of there not being
enough file handles, the errors I am running into are out of memory on
the heap space errors.

Shay,

Thanks, will give that a try and let you know. Will have to wait until
after the weekend so I can set up a development cluster. I've brought
down the production cluster a few too many times this week, and its
time to be more careful. :slight_smile:

Thanks for the fast responses.
-Greg

On Dec 21, 6:54 pm, Shay Banon kim...@gmail.com wrote:

My guess is that the problem is with creating so many types, which
ends up

being a large overhead in the system. Each time a type is introduced,
it

needs to be broadcasted to the rest of the nodes and persisted as part
of

the cluster meta data. Can you try just indexing into the same type as
a

test and see if it still happens?

On Wed, Dec 21, 2011 at 9:08 PM, Karussell <
tableyourt...@googlemail.com>wrote:

Greg, have you checked/increased open file handle limits for your
machine?

First, check/post your logs. If too many files open ES would log
that.

Peter.


(Shay Banon) #9

10x slower than types? It makes little sense since types, at teh end of the
day, is just a field called _type in a document, and when you search within
a type, your query provided is simply wrapped in a filtered query a filter
on the type. So, you can do it yourself, just wrap your query in a filtered
query with a filter on your "type".

On Tue, Jan 10, 2012 at 5:48 PM, Greg Ichneumon Brown
gbrown5878@gmail.comwrote:

That's what I did. Functionally works, but it is 10x slower to query
using either a query to filter or a facet_filter. Is there another
way? According to the docs: "search filters restrict only returned
documents — but not facet counts"

On Jan 10, 2:40 am, Shay Banon kim...@gmail.com wrote:

Just add the "type" as a field to the doc, and filter by it.

On Tue, Jan 10, 2012 at 5:45 AM, Greg Brown gbrown5...@gmail.com
wrote:

Indexing all data to a single type did work fine (3.3mil docs) as
expected.

I submitted a bug (https://github.com/elasticsearch/
elasticsearch.github.com/issues/134) on the large number of types
because I was able to get the server to become unresponsive even when
there was only a single server and I tried to add many types.

For the moment I am going ahead with using all of the documents in a
single index. However, this significantly reduces query performance
compared to having a separate type for each set of documents. I looped
and profiled the following queries on the larger sets of documents
(10k-70k):https://gist.github.com/1586723This was all run on a
single server, and the query from a different machine.

The first query has each set of docs in its own Type. On average it
took about 11 ms to complete.

The second has all of the docs in one index with a field pd_id to
distinguish the sets. The query uses the facet_filter and average ~190
ms.

The third uses the same index as the second, but uses a query to do
the "filtering" of the docs. ~140 ms. I was surprised that this was
faster than the facet_filter.

Any suggestions on how to improve the last two queries?

Any ideas on how to create multiple types without creating 10k
separate indices. In this case all I am using the Type for is a
partitioning/grouping of multiple separate indices, since the Mapping
of each Type is identical.

Thanks for the help.
-Greg

On Dec 22 2011, 7:37 am, Greg Brown gbrown5...@gmail.com wrote:

Checking through the logs, there isn't any mention of there not being
enough file handles, the errors I am running into are out of memory
on

the heap space errors.

Shay,

Thanks, will give that a try and let you know. Will have to wait
until

after the weekend so I can set up a development cluster. I've brought
down the production cluster a few too many times this week, and its
time to be more careful. :slight_smile:

Thanks for the fast responses.
-Greg

On Dec 21, 6:54 pm, Shay Banon kim...@gmail.com wrote:

My guess is that the problem is with creating so many types, which
ends up

being a large overhead in the system. Each time a type is
introduced,

it

needs to be broadcasted to the rest of the nodes and persisted as
part

of

the cluster meta data. Can you try just indexing into the same
type as

a

test and see if it still happens?

On Wed, Dec 21, 2011 at 9:08 PM, Karussell <
tableyourt...@googlemail.com>wrote:

Greg, have you checked/increased open file handle limits for
your

machine?

First, check/post your logs. If too many files open ES would log
that.

Peter.


(Greg Brown) #10

Ah, I see what you are saying. But I am totally flummoxed as to how to
formulate the query to get a filtered query that matches the
performance of the default filtered query used for the type.

I think what you are suggesting is:
curl -XGET "${SERVER}/pd-test-0/_search?pretty"
-d '{
"size" : 0,
"query" : {
"filtered" : {
"query" : { "match_all" : { } },
"filter" : {
"term" : { "pd_id" : "$ID" }
}
} },
"facets" : {
"q1" : {
"terms" : {
"field" : "q1",
"size" : 100
}
}
}
}'

Is that query match_all correct? This query takes about 125 ms vs 13
ms for a query on the type (curl -XGET "${SERVER}/pd-0/${ID}/_search).

From what I can gather from the Java code this should mostly match
that query, but don't know the code well enough. Is there an easy way
to enable logging that would let me compare the structure of the
parsed queries for debugging this?

Thanks for all the help, Shay! Much appreciated.
-Greg

On Jan 10, 10:50 am, Shay Banon kim...@gmail.com wrote:

10x slower than types? It makes little sense since types, at teh end of the
day, is just a field called _type in a document, and when you search within
a type, your query provided is simply wrapped in a filtered query a filter
on the type. So, you can do it yourself, just wrap your query in a filtered
query with a filter on your "type".

On Tue, Jan 10, 2012 at 5:48 PM, Greg Ichneumon Brown
gbrown5...@gmail.comwrote:

That's what I did. Functionally works, but it is 10x slower to query
using either a query to filter or a facet_filter. Is there another
way? According to the docs: "search filters restrict only returned
documents — but not facet counts"

On Jan 10, 2:40 am, Shay Banon kim...@gmail.com wrote:

Just add the "type" as a field to the doc, and filter by it.

On Tue, Jan 10, 2012 at 5:45 AM, Greg Brown gbrown5...@gmail.com
wrote:

Indexing all data to a single type did work fine (3.3mil docs) as
expected.

I submitted a bug (https://github.com/elasticsearch/
elasticsearch.github.com/issues/134) on the large number of types
because I was able to get the server to become unresponsive even when
there was only a single server and I tried to add many types.

For the moment I am going ahead with using all of the documents in a
single index. However, this significantly reduces query performance
compared to having a separate type for each set of documents. I looped
and profiled the following queries on the larger sets of documents
(10k-70k):https://gist.github.com/1586723Thiswas all run on a
single server, and the query from a different machine.

The first query has each set of docs in its own Type. On average it
took about 11 ms to complete.

The second has all of the docs in one index with a field pd_id to
distinguish the sets. The query uses the facet_filter and average ~190
ms.

The third uses the same index as the second, but uses a query to do
the "filtering" of the docs. ~140 ms. I was surprised that this was
faster than the facet_filter.

Any suggestions on how to improve the last two queries?

Any ideas on how to create multiple types without creating 10k
separate indices. In this case all I am using the Type for is a
partitioning/grouping of multiple separate indices, since the Mapping
of each Type is identical.

Thanks for the help.
-Greg

On Dec 22 2011, 7:37 am, Greg Brown gbrown5...@gmail.com wrote:

Checking through the logs, there isn't any mention of there not being
enough file handles, the errors I am running into are out of memory
on

the heap space errors.

Shay,

Thanks, will give that a try and let you know. Will have to wait
until

after the weekend so I can set up a development cluster. I've brought
down the production cluster a few too many times this week, and its
time to be more careful. :slight_smile:

Thanks for the fast responses.
-Greg

On Dec 21, 6:54 pm, Shay Banon kim...@gmail.com wrote:

My guess is that the problem is with creating so many types, which
ends up

being a large overhead in the system. Each time a type is
introduced,

it

needs to be broadcasted to the rest of the nodes and persisted as
part

of

the cluster meta data. Can you try just indexing into the same
type as

a

test and see if it still happens?

On Wed, Dec 21, 2011 at 9:08 PM, Karussell <
tableyourt...@googlemail.com>wrote:

Greg, have you checked/increased open file handle limits for
your

machine?

First, check/post your logs. If too many files open ES would log
that.

Peter.


(Shay Banon) #11

Yes, what you posted is exactly the query that you will end up with when
searching against a type. Maybe you just didn't let it do its caching bit?
How fast is the 2-3rd execution using the same pid (the term filter result
is cached).

On Wed, Jan 11, 2012 at 6:35 PM, Greg Ichneumon Brown
gbrown5878@gmail.comwrote:

Ah, I see what you are saying. But I am totally flummoxed as to how to
formulate the query to get a filtered query that matches the
performance of the default filtered query used for the type.

I think what you are suggesting is:
curl -XGET "${SERVER}/pd-test-0/_search?pretty"
-d '{
"size" : 0,
"query" : {
"filtered" : {
"query" : { "match_all" : { } },
"filter" : {
"term" : { "pd_id" : "$ID" }
}
} },
"facets" : {
"q1" : {
"terms" : {
"field" : "q1",
"size" : 100
}
}
}
}'

Is that query match_all correct? This query takes about 125 ms vs 13
ms for a query on the type (curl -XGET "${SERVER}/pd-0/${ID}/_search).

From what I can gather from the Java code this should mostly match
that query, but don't know the code well enough. Is there an easy way
to enable logging that would let me compare the structure of the
parsed queries for debugging this?

Thanks for all the help, Shay! Much appreciated.
-Greg

On Jan 10, 10:50 am, Shay Banon kim...@gmail.com wrote:

10x slower than types? It makes little sense since types, at teh end of
the
day, is just a field called _type in a document, and when you search
within
a type, your query provided is simply wrapped in a filtered query a
filter
on the type. So, you can do it yourself, just wrap your query in a
filtered
query with a filter on your "type".

On Tue, Jan 10, 2012 at 5:48 PM, Greg Ichneumon Brown
gbrown5...@gmail.comwrote:

That's what I did. Functionally works, but it is 10x slower to query
using either a query to filter or a facet_filter. Is there another
way? According to the docs: "search filters restrict only returned
documents — but not facet counts"

On Jan 10, 2:40 am, Shay Banon kim...@gmail.com wrote:

Just add the "type" as a field to the doc, and filter by it.

On Tue, Jan 10, 2012 at 5:45 AM, Greg Brown gbrown5...@gmail.com
wrote:

Indexing all data to a single type did work fine (3.3mil docs) as
expected.

I submitted a bug (https://github.com/elasticsearch/
elasticsearch.github.com/issues/134) on the large number of types
because I was able to get the server to become unresponsive even
when

there was only a single server and I tried to add many types.

For the moment I am going ahead with using all of the documents in
a

single index. However, this significantly reduces query performance
compared to having a separate type for each set of documents. I
looped

and profiled the following queries on the larger sets of documents
(10k-70k):https://gist.github.com/1586723Thiswas all run on a
single server, and the query from a different machine.

The first query has each set of docs in its own Type. On average it
took about 11 ms to complete.

The second has all of the docs in one index with a field pd_id to
distinguish the sets. The query uses the facet_filter and average
~190

ms.

The third uses the same index as the second, but uses a query to do
the "filtering" of the docs. ~140 ms. I was surprised that this was
faster than the facet_filter.

Any suggestions on how to improve the last two queries?

Any ideas on how to create multiple types without creating 10k
separate indices. In this case all I am using the Type for is a
partitioning/grouping of multiple separate indices, since the
Mapping

of each Type is identical.

Thanks for the help.
-Greg

On Dec 22 2011, 7:37 am, Greg Brown gbrown5...@gmail.com wrote:

Checking through the logs, there isn't any mention of there not
being

enough file handles, the errors I am running into are out of
memory

on

the heap space errors.

Shay,

Thanks, will give that a try and let you know. Will have to wait
until

after the weekend so I can set up a development cluster. I've
brought

down the production cluster a few too many times this week, and
its

time to be more careful. :slight_smile:

Thanks for the fast responses.
-Greg

On Dec 21, 6:54 pm, Shay Banon kim...@gmail.com wrote:

My guess is that the problem is with creating so many types,
which

ends up

being a large overhead in the system. Each time a type is
introduced,

it

needs to be broadcasted to the rest of the nodes and persisted
as

part

of

the cluster meta data. Can you try just indexing into the same
type as

a

test and see if it still happens?

On Wed, Dec 21, 2011 at 9:08 PM, Karussell <
tableyourt...@googlemail.com>wrote:

Greg, have you checked/increased open file handle limits for
your

machine?

First, check/post your logs. If too many files open ES would
log

that.

Peter.


(Greg Brown) #12

No matter how many times I repeat that query I am getting "took" :
125, whereas the type query gives "took" : 13. I also built a filtered
query using _type and that completes in 13 ms also.

Could my mapping for this field be the culprit? I am doing:

	'pd_id'   => array( 'type' => 'long', 'store' => 'yes', 'index' =>

'not_analyzed' )

Each document has an integer as an id, I used long as the storage type
to reduce memory, but does this not work with the term filter?

I tried setting _cache to true in the filter to see if that forced the
caching, but then I get the correct number of total hits, but no facet
results:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 11588,
"max_score" : 1.0,
"hits" : [ ]
}
}

Should be:
{
"took" : 13,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 11588,
"max_score" : 1.0,
"hits" : [ ]
},
"facets" : {
"q1" : {
"_type" : "terms",
"missing" : 0,
"total" : 11720,
"other" : 56,
"terms" : [ {
"term" : "adopt",
"count" : 11475
}, {
"term" : "adoption",
"count" : 39
}, {
etc...

On Jan 11, 11:00 am, Shay Banon kim...@gmail.com wrote:

Yes, what you posted is exactly the query that you will end up with when
searching against a type. Maybe you just didn't let it do its caching bit?
How fast is the 2-3rd execution using the same pid (the term filter result
is cached).

On Wed, Jan 11, 2012 at 6:35 PM, Greg Ichneumon Brown
gbrown5...@gmail.comwrote:

Ah, I see what you are saying. But I am totally flummoxed as to how to
formulate the query to get a filtered query that matches the
performance of the default filtered query used for the type.

I think what you are suggesting is:
curl -XGET "${SERVER}/pd-test-0/_search?pretty"
-d '{
"size" : 0,
"query" : {
"filtered" : {
"query" : { "match_all" : { } },
"filter" : {
"term" : { "pd_id" : "$ID" }
}
} },
"facets" : {
"q1" : {
"terms" : {
"field" : "q1",
"size" : 100
}
}
}
}'

Is that query match_all correct? This query takes about 125 ms vs 13
ms for a query on the type (curl -XGET "${SERVER}/pd-0/${ID}/_search).

From what I can gather from the Java code this should mostly match
that query, but don't know the code well enough. Is there an easy way
to enable logging that would let me compare the structure of the
parsed queries for debugging this?

Thanks for all the help, Shay! Much appreciated.
-Greg

On Jan 10, 10:50 am, Shay Banon kim...@gmail.com wrote:

10x slower than types? It makes little sense since types, at teh end of
the
day, is just a field called _type in a document, and when you search
within
a type, your query provided is simply wrapped in a filtered query a
filter
on the type. So, you can do it yourself, just wrap your query in a
filtered
query with a filter on your "type".

On Tue, Jan 10, 2012 at 5:48 PM, Greg Ichneumon Brown
gbrown5...@gmail.comwrote:

That's what I did. Functionally works, but it is 10x slower to query
using either a query to filter or a facet_filter. Is there another
way? According to the docs: "search filters restrict only returned
documents — but not facet counts"

On Jan 10, 2:40 am, Shay Banon kim...@gmail.com wrote:

Just add the "type" as a field to the doc, and filter by it.

On Tue, Jan 10, 2012 at 5:45 AM, Greg Brown gbrown5...@gmail.com
wrote:

Indexing all data to a single type did work fine (3.3mil docs) as
expected.

I submitted a bug (https://github.com/elasticsearch/
elasticsearch.github.com/issues/134) on the large number of types
because I was able to get the server to become unresponsive even
when

there was only a single server and I tried to add many types.

For the moment I am going ahead with using all of the documents in
a

single index. However, this significantly reduces query performance
compared to having a separate type for each set of documents. I
looped

and profiled the following queries on the larger sets of documents
(10k-70k):https://gist.github.com/1586723Thiswasall run on a
single server, and the query from a different machine.

The first query has each set of docs in its own Type. On average it
took about 11 ms to complete.

The second has all of the docs in one index with a field pd_id to
distinguish the sets. The query uses the facet_filter and average
~190

ms.

The third uses the same index as the second, but uses a query to do
the "filtering" of the docs. ~140 ms. I was surprised that this was
faster than the facet_filter.

Any suggestions on how to improve the last two queries?

Any ideas on how to create multiple types without creating 10k
separate indices. In this case all I am using the Type for is a
partitioning/grouping of multiple separate indices, since the
Mapping

of each Type is identical.

Thanks for the help.
-Greg

On Dec 22 2011, 7:37 am, Greg Brown gbrown5...@gmail.com wrote:

Checking through the logs, there isn't any mention of there not
being

enough file handles, the errors I am running into are out of
memory

on

the heap space errors.

Shay,

Thanks, will give that a try and let you know. Will have to wait
until

after the weekend so I can set up a development cluster. I've
brought

down the production cluster a few too many times this week, and
its

time to be more careful. :slight_smile:

Thanks for the fast responses.
-Greg

On Dec 21, 6:54 pm, Shay Banon kim...@gmail.com wrote:

My guess is that the problem is with creating so many types,
which

ends up

being a large overhead in the system. Each time a type is
introduced,

it

needs to be broadcasted to the rest of the nodes and persisted
as

part

of

the cluster meta data. Can you try just indexing into the same
type as

a

test and see if it still happens?

On Wed, Dec 21, 2011 at 9:08 PM, Karussell <
tableyourt...@googlemail.com>wrote:

Greg, have you checked/increased open file handle limits for
your

machine?

First, check/post your logs. If too many files open ES would
log

that.

Peter.


(Shay Banon) #13

Regarding the caching, you did not show an example of how you try and set
it, so I can't help, but the _cache is probably set in the wrong place. In
any cas, you don't need to set it, as term filter is cached by default.

Regarding the field type of pd_id, you define it as numeric, which is fine.
Note that _type is a String, but it does not really matter that much since
we are caching the filters results. Note though, long is 64bit signed , and
integer is 32bit signed.

I don't really understand where this change is coming from, it makes very
little sense. You can dropbox the ES data directory you are working with,
and two sample curl queries, one against type and one against pd_id, and I
can have a look.

On Wed, Jan 11, 2012 at 10:14 PM, Greg Ichneumon Brown <gbrown5878@gmail.com

wrote:

No matter how many times I repeat that query I am getting "took" :
125, whereas the type query gives "took" : 13. I also built a filtered
query using _type and that completes in 13 ms also.

Could my mapping for this field be the culprit? I am doing:

           'pd_id'   => array( 'type' => 'long', 'store' => 'yes',

'index' =>
'not_analyzed' )

Each document has an integer as an id, I used long as the storage type
to reduce memory, but does this not work with the term filter?

I tried setting _cache to true in the filter to see if that forced the
caching, but then I get the correct number of total hits, but no facet
results:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 11588,
"max_score" : 1.0,
"hits" : [ ]
}
}

Should be:
{
"took" : 13,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 11588,
"max_score" : 1.0,
"hits" : [ ]
},
"facets" : {
"q1" : {
"_type" : "terms",
"missing" : 0,
"total" : 11720,
"other" : 56,
"terms" : [ {
"term" : "adopt",
"count" : 11475
}, {
"term" : "adoption",
"count" : 39
}, {
etc...

On Jan 11, 11:00 am, Shay Banon kim...@gmail.com wrote:

Yes, what you posted is exactly the query that you will end up with when
searching against a type. Maybe you just didn't let it do its caching
bit?
How fast is the 2-3rd execution using the same pid (the term filter
result
is cached).

On Wed, Jan 11, 2012 at 6:35 PM, Greg Ichneumon Brown
gbrown5...@gmail.comwrote:

Ah, I see what you are saying. But I am totally flummoxed as to how to
formulate the query to get a filtered query that matches the
performance of the default filtered query used for the type.

I think what you are suggesting is:
curl -XGET "${SERVER}/pd-test-0/_search?pretty"
-d '{
"size" : 0,
"query" : {
"filtered" : {
"query" : { "match_all" : { } },
"filter" : {
"term" : { "pd_id" : "$ID" }
}
} },
"facets" : {
"q1" : {
"terms" : {
"field" : "q1",
"size" : 100
}
}
}
}'

Is that query match_all correct? This query takes about 125 ms vs 13
ms for a query on the type (curl -XGET "${SERVER}/pd-0/${ID}/_search).

From what I can gather from the Java code this should mostly match
that query, but don't know the code well enough. Is there an easy way
to enable logging that would let me compare the structure of the
parsed queries for debugging this?

Thanks for all the help, Shay! Much appreciated.
-Greg

On Jan 10, 10:50 am, Shay Banon kim...@gmail.com wrote:

10x slower than types? It makes little sense since types, at teh end
of

the

day, is just a field called _type in a document, and when you search
within
a type, your query provided is simply wrapped in a filtered query a
filter
on the type. So, you can do it yourself, just wrap your query in a
filtered
query with a filter on your "type".

On Tue, Jan 10, 2012 at 5:48 PM, Greg Ichneumon Brown
gbrown5...@gmail.comwrote:

That's what I did. Functionally works, but it is 10x slower to
query

using either a query to filter or a facet_filter. Is there another
way? According to the docs: "search filters restrict only returned
documents — but not facet counts"

On Jan 10, 2:40 am, Shay Banon kim...@gmail.com wrote:

Just add the "type" as a field to the doc, and filter by it.

On Tue, Jan 10, 2012 at 5:45 AM, Greg Brown <
gbrown5...@gmail.com>

wrote:

Indexing all data to a single type did work fine (3.3mil docs)
as

expected.

I submitted a bug (https://github.com/elasticsearch/
elasticsearch.github.com/issues/134) on the large number of
types

because I was able to get the server to become unresponsive
even

when

there was only a single server and I tried to add many types.

For the moment I am going ahead with using all of the
documents in

a

single index. However, this significantly reduces query
performance

compared to having a separate type for each set of documents. I
looped

and profiled the following queries on the larger sets of
documents

(10k-70k):https://gist.github.com/1586723Thiswasall run on a
single server, and the query from a different machine.

The first query has each set of docs in its own Type. On
average it

took about 11 ms to complete.

The second has all of the docs in one index with a field pd_id
to

distinguish the sets. The query uses the facet_filter and
average

~190

ms.

The third uses the same index as the second, but uses a query
to do

the "filtering" of the docs. ~140 ms. I was surprised that
this was

faster than the facet_filter.

Any suggestions on how to improve the last two queries?

Any ideas on how to create multiple types without creating 10k
separate indices. In this case all I am using the Type for is a
partitioning/grouping of multiple separate indices, since the
Mapping

of each Type is identical.

Thanks for the help.
-Greg

On Dec 22 2011, 7:37 am, Greg Brown gbrown5...@gmail.com
wrote:

Checking through the logs, there isn't any mention of there
not

being

enough file handles, the errors I am running into are out of
memory

on

the heap space errors.

Shay,

Thanks, will give that a try and let you know. Will have to
wait

until

after the weekend so I can set up a development cluster. I've
brought

down the production cluster a few too many times this week,
and

its

time to be more careful. :slight_smile:

Thanks for the fast responses.
-Greg

On Dec 21, 6:54 pm, Shay Banon kim...@gmail.com wrote:

My guess is that the problem is with creating so many
types,

which

ends up

being a large overhead in the system. Each time a type is
introduced,

it

needs to be broadcasted to the rest of the nodes and
persisted

as

part

of

the cluster meta data. Can you try just indexing into the
same

type as

a

test and see if it still happens?

On Wed, Dec 21, 2011 at 9:08 PM, Karussell <
tableyourt...@googlemail.com>wrote:

Greg, have you checked/increased open file handle
limits for

your

machine?

First, check/post your logs. If too many files open ES
would

log

that.

Peter.


(system) #14