Is there a way to completely drop incoming documents from indexing based on some criteria?

Konstantin_Erman · December 12, 2014, 10:11pm

I noticed that occasionally I need to shield my ES cluster from some
documents, which are too many or too big or otherwise poison ES.
Usually I can formulate pretty easy query or criteria to detect those
documents and I'm looking for a way to block them from entering the index.

Is there such pre-indexing filtering mechanism? May be Transforms can be
used for that purpose?

Thank you!
Konstantin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1bd11432-fd57-412f-8a22-52cf5249ddf1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

nik9000 · December 12, 2014, 10:30pm

Best way to do it is on the client side I believe. You could probably
abuse transforms to just blow up when you see something you don't like. I
don't think they have the ability to manipulate the operation (to make it
noop) though. If they do there certainly aren't any tests to make sure
that that doesn't break.

Nik

On Fri, Dec 12, 2014 at 5:11 PM, Konstantin Erman konste@gmail.com wrote:

I noticed that occasionally I need to shield my ES cluster from some
documents, which are too many or too big or otherwise poison ES.
Usually I can formulate pretty easy query or criteria to detect those
documents and I'm looking for a way to block them from entering the index.

Is there such pre-indexing filtering mechanism? May be Transforms can be
used for that purpose?

Thank you!
Konstantin

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1bd11432-fd57-412f-8a22-52cf5249ddf1%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/1bd11432-fd57-412f-8a22-52cf5249ddf1%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd0Ag1%3Dg_ZfqXia4LWk8brkp5EQnZJHg-%2BHuGkMR8RCjbA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Telax · December 13, 2014, 2:43pm

Write a river plugin to receive and filter documents based on whatever criteria you need would be the best approach

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cc6e95bc-367f-4f73-bd0a-8fafad8d36c0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

BillyEm · December 14, 2014, 3:33am

Why are you putting business logic of this type in ES? It belongs in your
workflow. At the ES indexer level you will have no idea of the source of
truth of the questionable content. Unless you're web crawliing which means
you're using the wrong search platform altogether imo.

On Friday, December 12, 2014 5:11:05 PM UTC-5, Konstantin Erman wrote:

I noticed that occasionally I need to shield my ES cluster from some
documents, which are too many or too big or otherwise poison ES.
Usually I can formulate pretty easy query or criteria to detect those
documents and I'm looking for a way to block them from entering the index.

Is there such pre-indexing filtering mechanism? May be Transforms can be
used for that purpose?

Thank you!
Konstantin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c5f5b748-a725-4d43-b248-67215e7da576%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

BillyEm · December 14, 2014, 3:34am

it might be the only approach you can come up with Telax. Don't sweat it.

On Saturday, December 13, 2014 9:43:38 AM UTC-5, Telax wrote:

Write a river plugin to receive and filter documents based on whatever
criteria you need would be the best approach

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/41b67f66-fc97-49d7-a03a-6f02b510d446%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Konstantin_Erman · December 14, 2014, 4:21am

I don't crawl the web, just collect rather verbose logs from multiple
private cloud services and try to keep the size of ES cluster just
sufficient for comfortable searching those logs. Monitored services are
under development and occasionally (because of bugs or specifics of the
source data) they start to send orders of magnitude higher than usual
torrent of logs. When this happens, very soon ES cluster become
non-responsive and drops logs from all services, bad behaving or not.

We cannot afford to keep the cluster of the size capable to handle those
peak loads (and idling most of the time). We rather need some kind of
Denial of Service attack prevention logic. When some client(s) goes over
its quota of logs it should be blocked, rather than melting cluster down.

River plugin looks like overkill to me, especially considering deprecation
of rivers.

On Saturday, December 13, 2014 7:33:05 PM UTC-8, BillyEm wrote:

Why are you putting business logic of this type in ES? It belongs in your
workflow. At the ES indexer level you will have no idea of the source of
truth of the questionable content. Unless you're web crawliing which means
you're using the wrong search platform altogether imo.

On Friday, December 12, 2014 5:11:05 PM UTC-5, Konstantin Erman wrote:

I noticed that occasionally I need to shield my ES cluster from some
documents, which are too many or too big or otherwise poison ES.
Usually I can formulate pretty easy query or criteria to detect those
documents and I'm looking for a way to block them from entering the index.

Is there such pre-indexing filtering mechanism? May be Transforms can be
used for that purpose?

Thank you!
Konstantin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/26556df6-a2a5-495f-bb23-95b5bd0fa63b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

nik9000 · December 14, 2014, 4:39am

We solve problems like this in two ways:
Adding queueing or concurrent request limits.

Queueing buys retries for free and can absorb temporary shocks. You can
also get things like priority, backlog monitoring, and manual backlog
grooming. I think logstash already supports this, but I don't know it very
well.

Concurrent request limits are more brutal. You just throw away requests to
index if there are too many in flight. You can make it more granular by
giving each incoming application its own pool and limits. We implement
these using a simple server called poolcounter. You can find it by
searching for WMF poolcounterd.

Either way you would have to implement a small application to get these
integrated. Well, maybe someone has already made the queueing one, I don't
know.

Nik
On Dec 13, 2014 11:21 PM, "Konstantin Erman" konste@gmail.com wrote:

I don't crawl the web, just collect rather verbose logs from multiple
private cloud services and try to keep the size of ES cluster just
sufficient for comfortable searching those logs. Monitored services are
under development and occasionally (because of bugs or specifics of the
source data) they start to send orders of magnitude higher than usual
torrent of logs. When this happens, very soon ES cluster become
non-responsive and drops logs from all services, bad behaving or not.

We cannot afford to keep the cluster of the size capable to handle those
peak loads (and idling most of the time). We rather need some kind of
Denial of Service attack prevention logic. When some client(s) goes over
its quota of logs it should be blocked, rather than melting cluster down.

River plugin looks like overkill to me, especially considering deprecation
of rivers.

On Saturday, December 13, 2014 7:33:05 PM UTC-8, BillyEm wrote:

Why are you putting business logic of this type in ES? It belongs in your
workflow. At the ES indexer level you will have no idea of the source of
truth of the questionable content. Unless you're web crawliing which means
you're using the wrong search platform altogether imo.

On Friday, December 12, 2014 5:11:05 PM UTC-5, Konstantin Erman wrote:

I noticed that occasionally I need to shield my ES cluster from some
documents, which are too many or too big or otherwise poison ES.
Usually I can formulate pretty easy query or criteria to detect those
documents and I'm looking for a way to block them from entering the index.

Is there such pre-indexing filtering mechanism? May be Transforms can be
used for that purpose?

Thank you!
Konstantin

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/26556df6-a2a5-495f-bb23-95b5bd0fa63b%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/26556df6-a2a5-495f-bb23-95b5bd0fa63b%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd3GuaF%3D9xNyBCtGOpyYZgWYHZKL2i1wR-LdfceV7BV0Og%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.