Bulk indexing, how to prevent flooding?

I am aware of Apache mod_qos but I'm rather anxious about using it in
ES indexing. I do not assume that shaping traffic on port 9300 between
client and server gives unpredictable behavior, it's just the amount
of extra work beside ES programming.

My use case is highly-structured data with well-defined
characteristics, a lot of fields, short values, but only a few
reasonable types of boolean queries. For query control, I have some
methods prepared, one is query language translation, i.e. mapping from
a simple query language domain into an ES DSL query, so I am able to
add constraints to ES requests easily.

True, there will be lagging problems when sorting large result sets. I
know this will also overwhelm the ES cluster under certain situations.
There is no solution I know for this class of problems, so a reverse
proxy will have to know about some 'evil queries' and will have to cut
off 'evil' sort requests in an ad-hoc manner. Other candidates for
being evil queries are some weird wildcard searches.

Full tenant isolation requires authentication beside quota. And I need
to provide "search views" via reverse proxy. A "search view" provides
styles for viewing same ES documents different ways, for only a subset
of fields, XML representation via Atom feeds etc. but that's a
different topic beside QoS.

Jörg

On Apr 21, 7:32 pm, David Williams williams.da...@gmail.com wrote:

The reverse proxy could limit the size of the uploads & the number of
concurrent requests per IP to some reasonable level.

But it's not just the bulk indexing you'd need to worry about, there
are different kinds of searches you'd have to worry about too (sorting
large numbers of results being the most obvious one). It's a difficult
for Elasticsearch to determine reasonable values any of these without
adding lots and lots of complexity to it. So your best bet in my
opinion would be to determine what would are reasonable limits for
your use case, and have the proxy enforce those limits. Fool-proof
multi-tenancy with public access is going to require you to write
intelligence into the proxy anyway to enforce tenant isolation &
security. Extending it to add per user resource quotas is only a
little bit harder.

-david

On Thu, Apr 21, 2011 at 7:16 AM, jprante joergpra...@gmail.com wrote:

Hi,

how can I prevent bad clients from flooding an Elasticsearch cluster
especiall when using bulk indexing?

Imagine remote indexing by a TransportClient with bad habits, i.e. it
ignores the messages in ActionListener and continues to
submit bulk index requests.

I tried exercising a cluster that way and the RHEL6 server of the
Elasticsearch master node at some time will sooner or later start to
drown. Even with more than 60.000 max files setting, the JDK will
start to report too many open files, obviously because of the pile of
open network connections. The shell stopped working. bash could not
execute commands, it reported messages like

-bash: start_pipeline: pgrp pipe: Too many open files in system
-bash: /bin/ls: Too many open files in system

After stopping the bad client and 10 minutes later, after GC'ing and
possibly working down the pile of Java exceptions, like

[2011-04-21 15:40:54,257][WARN ]
[netty.channel.socket.nio.NioServerSocketPipelineSink] Failed to
accept a connection.
java.io.IOException: Zu viele offene Dateien im System
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:
152)
at
org.elasticsearch.common.netty.channel.socket.nio.NioServerSocketPipelineSi nk
$Boss.run(NioServerSocketPipelineSink.java:244)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenami ngRunnable.java:
108)
at
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker
$1.run(DeadLockProofWorker.java:44)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

("Zu viele offene Dateien im System" = too many open files)

the system becomes responsive again and the "too many files" message
disappears.

Because I like a fool-proof multi-tenancy setup with public access for
remote indexing/search via a reverse proxy, I am very interested in
methods how to prevent an Elasticsearch cluster being flooded via the
(bulk) index API by TransportClients.

You might like calling this a feature request for QoS in Elasticsearch
(bulk) indexing.

Can someone give me a hint how to realize this feature? Thank you in
advance for your kind help!

Best Regards,

Jörg