2 Node cluster hanging while bulk indexing and adding types

No matter how many times I repeat that query I am getting "took" :
125, whereas the type query gives "took" : 13. I also built a filtered
query using _type and that completes in 13 ms also.

Could my mapping for this field be the culprit? I am doing:

	'pd_id'   => array( 'type' => 'long', 'store' => 'yes', 'index' =>

'not_analyzed' )

Each document has an integer as an id, I used long as the storage type
to reduce memory, but does this not work with the term filter?

I tried setting _cache to true in the filter to see if that forced the
caching, but then I get the correct number of total hits, but no facet
results:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 11588,
"max_score" : 1.0,
"hits" :
}
}

Should be:
{
"took" : 13,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 11588,
"max_score" : 1.0,
"hits" :
},
"facets" : {
"q1" : {
"_type" : "terms",
"missing" : 0,
"total" : 11720,
"other" : 56,
"terms" : [ {
"term" : "adopt",
"count" : 11475
}, {
"term" : "adoption",
"count" : 39
}, {
etc...

On Jan 11, 11:00 am, Shay Banon kim...@gmail.com wrote:

Yes, what you posted is exactly the query that you will end up with when
searching against a type. Maybe you just didn't let it do its caching bit?
How fast is the 2-3rd execution using the same pid (the term filter result
is cached).

On Wed, Jan 11, 2012 at 6:35 PM, Greg Ichneumon Brown
gbrown5...@gmail.comwrote:

Ah, I see what you are saying. But I am totally flummoxed as to how to
formulate the query to get a filtered query that matches the
performance of the default filtered query used for the type.

I think what you are suggesting is:
curl -XGET "${SERVER}/pd-test-0/_search?pretty"
-d '{
"size" : 0,
"query" : {
"filtered" : {
"query" : { "match_all" : { } },
"filter" : {
"term" : { "pd_id" : "$ID" }
}
} },
"facets" : {
"q1" : {
"terms" : {
"field" : "q1",
"size" : 100
}
}
}
}'

Is that query match_all correct? This query takes about 125 ms vs 13
ms for a query on the type (curl -XGET "${SERVER}/pd-0/${ID}/_search).

From what I can gather from the Java code this should mostly match
that query, but don't know the code well enough. Is there an easy way
to enable logging that would let me compare the structure of the
parsed queries for debugging this?

Thanks for all the help, Shay! Much appreciated.
-Greg

On Jan 10, 10:50 am, Shay Banon kim...@gmail.com wrote:

10x slower than types? It makes little sense since types, at teh end of
the
day, is just a field called _type in a document, and when you search
within
a type, your query provided is simply wrapped in a filtered query a
filter
on the type. So, you can do it yourself, just wrap your query in a
filtered
query with a filter on your "type".

On Tue, Jan 10, 2012 at 5:48 PM, Greg Ichneumon Brown
gbrown5...@gmail.comwrote:

That's what I did. Functionally works, but it is 10x slower to query
using either a query to filter or a facet_filter. Is there another
way? According to the docs: "search filters restrict only returned
documents — but not facet counts"

On Jan 10, 2:40 am, Shay Banon kim...@gmail.com wrote:

Just add the "type" as a field to the doc, and filter by it.

On Tue, Jan 10, 2012 at 5:45 AM, Greg Brown gbrown5...@gmail.com
wrote:

Indexing all data to a single type did work fine (3.3mil docs) as
expected.

I submitted a bug (elasticsearch · GitHub
elasticsearch.github.com/issues/134) on the large number of types
because I was able to get the server to become unresponsive even
when
there was only a single server and I tried to add many types.

For the moment I am going ahead with using all of the documents in
a
single index. However, this significantly reduces query performance
compared to having a separate type for each set of documents. I
looped
and profiled the following queries on the larger sets of documents
(10k-70k):https://gist.github.com/1586723Thiswasall run on a
single server, and the query from a different machine.

The first query has each set of docs in its own Type. On average it
took about 11 ms to complete.

The second has all of the docs in one index with a field pd_id to
distinguish the sets. The query uses the facet_filter and average
~190
ms.

The third uses the same index as the second, but uses a query to do
the "filtering" of the docs. ~140 ms. I was surprised that this was
faster than the facet_filter.

Any suggestions on how to improve the last two queries?

Any ideas on how to create multiple types without creating 10k
separate indices. In this case all I am using the Type for is a
partitioning/grouping of multiple separate indices, since the
Mapping
of each Type is identical.

Thanks for the help.
-Greg

On Dec 22 2011, 7:37 am, Greg Brown gbrown5...@gmail.com wrote:

Checking through the logs, there isn't any mention of there not
being
enough file handles, the errors I am running into are out of
memory
on
the heap space errors.

Shay,

Thanks, will give that a try and let you know. Will have to wait
until
after the weekend so I can set up a development cluster. I've
brought
down the production cluster a few too many times this week, and
its
time to be more careful. :slight_smile:

Thanks for the fast responses.
-Greg

On Dec 21, 6:54 pm, Shay Banon kim...@gmail.com wrote:

My guess is that the problem is with creating so many types,
which
ends up
being a large overhead in the system. Each time a type is
introduced,
it
needs to be broadcasted to the rest of the nodes and persisted
as
part
of
the cluster meta data. Can you try just indexing into the same
type as
a
test and see if it still happens?

On Wed, Dec 21, 2011 at 9:08 PM, Karussell <
tableyourt...@googlemail.com>wrote:

Greg, have you checked/increased open file handle limits for
your
machine?

First, check/post your logs. If too many files open ES would
log
that.

Peter.