Ok,
I have deleted everything and restarted the ES service, on both nodes, and
rebuilt the indexes. I have changed my shard count to 1 -- thank you for
explaining that, I didn't realize the overhead was as large as it was. I do
have some indexes that are much larger than others. The largest one has
around 400k issues. Search performance seems fine with the 1 shard though,
so I'll leave it at that. I can introduce a per-project shard count setting
if needed later on.
Results:
My search count issue appears to be resolved, it was probably related to my
large shard count.
However, the slow queries once every 10 minutes remains. It's even still on
the same schedule -- always at wall clock time with 9 minutes, e.g.
12:29pm. Here is a hot_threads call I ran while the slow searches were
running (I replaced machine names and IPs with tokens for obscurity):
::: [][TbKES0BpQ---dAP1J9kKKQ][][inet[/:9300]]
Hot threads at 2015-04-21T19:29:27.841Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
::: [][aPof4TmzQlm7Ck7frfGeuA][][inet[/:9300]]
Hot threads at 2015-04-21T19:29:26.250Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
0.0% (0s out of 500ms) cpu usage by thread 'Attach Listener'
unique snapshot
unique snapshot
unique snapshot
unique snapshot
unique snapshot
unique snapshot
unique snapshot
unique snapshot
unique snapshot
unique snapshot
Here is the slow search log for the same timeframe (first # is the time to
get the response, second # is the .took field from the ES response json)
4/21/2015, 12:29:27 PM 8586 8361
4/21/2015, 12:29:27 PM 9681 9454
4/21/2015, 12:29:27 PM 7685 7457
There's nothing in ES logs after the indexes were initially created, and
it's just info type things like this:
[2015-04-21 10:48:05,029][INFO ][cluster.metadata ] []
updating number_of_replicas to [1] for indices []
It's so interesting that it's every 10 minutes and always at x9 minutes,
regardless of the fact that I've restarted ES. There still could be an
external reason for this, but like I said, there's no CPU usage or Disk I/O
going on during these periods, so it's very strange. I have no leads other
than these slow search times.
Thanks so much for your help. Any other troubleshooting ideas would be
greatly appreciated.
-Dave
On Tuesday, April 21, 2015 at 12:12:38 AM UTC-7, David Pilato wrote:
Some notes. You are using defaults. So you have 5 shards per index. 1000
primary shards.
With replicas, it means 1000 shards per Node. Which means 1000 Lucene
instances.
First thing to do is IMO to use only one shard per index unless you need
more for "big" indices.
Then, have 3 nodes and set minimum_master_nodes to 2. You probably ran
into a split brain issue which could explain the difference you are seeing.
I would probably set replica to 0 and then to 1 again. But if you follow
my first advice, you have to reindex, so reindex with 1 shard, 0 replica,
then set replica to 1.
My 2 cents
David
Le 21 avr. 2015 à 08:31, Dave Reed <infin...@gmail.com <javascript:>> a
écrit :
The logs are basically empty. There's activity from when I am creating the
indexes, but that's about it. Is there a logging level that could be
increased? I will run hot_thread as soon as I can in the morning and post
results when I can.
I have each index set to 1 replica. If it matters, I first import them
with 0 replicas then set it to 1 when they are done. Shard count I will
have to check and get back to you, but it's whatever the default would be,
we haven't tweaked that.
I have 200+ indexes because I have 200+ different "projects" represented,
and each one has it's own set of mappings. Mappings which could collide on
name. I originally tried having a single index with each project
represented by a Type instead, but a conversation I had on this forum about
that led me away from it due to the fact two different projects may
sometimes have the same field names but with different types. Most searches
(99.9%) are done on a per-project basis, and changes to projects can create
a need to reindex the project, so having the segregation is nice, lest I
have to reimport the entire thing.
In case this is related, I also found that a document was changed and
reindexed, but searches would sometimes include it and sometimes not. I
could literally just refresh the search over and over again, and it would
appear in the results roughly 50% of the time. 0 results, then 0, 0, then 1
result, 1, 1, then 0 again, etc.. I was running the search against one of
the nodes via _head. The cluster health was green during all of this. That
was surprising and something I was going to investigate more on my own, but
perhaps these two problems are related.
I'm really hitting the limit of what I know how to troubleshoot with ES,
hence I am really hoping for help here
On Monday, April 20, 2015 at 10:23:46 PM UTC-7, David Pilato wrote:
Could you run a hot_threads API call when this happens?
Anything in logs about GC?
BTW 200 indices is a lot for 2 nodes. And how many shards/replicas do you
have?
Why do you need so many indices for 2m docs?
David
Le 21 avr. 2015 à 01:16, Dave Reed infin...@gmail.com a écrit :
I have a 2-node cluster running on some beefy machines. 12g and 16g of
heap space. About 2.1 million documents, each relatively small in size,
spread across 200 or so indexes. The refresh interval is 0.5s (while I
don't need realtime I do need relatively quick refreshes). Documents are
continuously modified by the app, so reindex requests trickle in
constantly. By trickle I mean maybe a dozen a minute. All index requests
are made with _bulk, although a lot of the time there's only 1 in the list.
Searches are very fast -- normally taking 50ms or less.
But oddly, exactly every 10 minutes, searches slow down for a moment. The
exact same query that normally takes <50ms takes 9000ms, for example. Any
other queries regardless of what they are also take multiple seconds to
complete. Once this moment passes, search queries return to normal.
I have a tester I wrote that continuously posts the same query and logs
the results, which is how I discovered this pattern.
Here's an excerpt. Notice that query time is great at 3:49:10, then at
:11 things stop for 10 seconds. At :21 the queued up searches finally come
through. The numbers reported are the "took" field from the ES search
response. Then things resume as normal. This is true no matter which node I
run the search against.
This pattern repeats like this every 10 minutes, to the second, for days
now.
3:49:09, 47
3:49:09, 63
3:49:10, 31
3:49:10, 47
3:49:11, 47
3:49:11, 62
3:49:21, 8456
3:49:21, 5460
3:49:21, 7457
3:49:21, 4509
3:49:21, 3510
3:49:21, 515
3:49:21, 1498
3:49:21, 2496
3:49:22, 2465
3:49:22, 2636
3:49:22, 6506
3:49:22, 7504
3:49:22, 9501
3:49:22, 4509
3:49:22, 1638
3:49:22, 6646
3:49:22, 9641
3:49:22, 655
3:49:22, 3667
3:49:22, 31
3:49:22, 78
3:49:22, 47
3:49:23, 47
3:49:23, 47
3:49:24, 93
I've ruled out any obvious periodical process running on either node in
the cluster. It's pretty clean. Disk I/O, CPU, RAM, etc stays pretty
consistent and flat even during one of these blips. These are beefy
machines like I said, so there's plenty of CPU and RAM available.
Any advice on how I can figure out what ES is waiting for would be
helpful. Is there any process ES runs every 10 minutes by default?
Thanks!
-Dave
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0c715e1e-f45b-46fb-ae39-52318e126203%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0c715e1e-f45b-46fb-ae39-52318e126203%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/ee7e43a0-a4c4-4ff9-9898-2026c586ab77%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/ee7e43a0-a4c4-4ff9-9898-2026c586ab77%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1be341f1-684c-4122-be67-ade44cdef1bb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.