Alerting based on new docs based on saved users queries

Hello,
We have a system where we check every new document that comes in
against a large number of saved users queries. We do this, now, by
running the <USER_QUERY> AND . It isn't the most
efficient setup, since for every doc, we need to run every saved user
query.

So, I there are two questions:

  1. Within ES, is there a different approach that could be taken to
    streamline?
  2. When running the USER_QUERY AND NEWDOC query, what kind of caching
    makes the most sense? From my limited understanding, I think it'd make
    sense to filtercache both the USERQUERY and the NEWDOC query, since
    both of these should get repeated multiple times.

Thanks!
Paul

Hey,

On Thu, Nov 4, 2010 at 5:36 PM, Paul ppearcy@gmail.com wrote:

Hello,
We have a system where we check every new document that comes in
against a large number of saved users queries. We do this, now, by
running the <USER_QUERY> AND . It isn't the most
efficient setup, since for every doc, we need to run every saved user
query.

What is exactly the NEWDOCUMENT part? How do you create it?

So, I there are two questions:

  1. Within ES, is there a different approach that could be taken to
    streamline?

There are different ways to implement this internally, certainly with lower
overhead than doing what you are doing now. One of the main problems here
is the messaging aspect, which have different additional features. Can
messages be lost? are duplicates allowed? If the client is down (registered
to receive notifications), do docs need to be replayed for him? And, of
course, actually implementing the notification aspect protocol wise.... .

  1. When running the USER_QUERY AND NEWDOC query, what kind of caching
    makes the most sense? From my limited understanding, I think it'd make
    sense to filtercache both the USERQUERY and the NEWDOC query, since
    both of these should get repeated multiple times.

Depends on what the answer for NEWDOC is...., is it a term query with the
doc id?

Thanks!
Paul

Hey,
Thanks for the followup and sorry for not being totally clear. Here
is an explicit example:
A user saves a search for "elasticsearch AND cool" to be alerted on.
Every time a new document comes in, for every user query registered,
we run against that doc, so a new doc id of 1 comes in, we end up
with:
(elasticsearch AND cool) AND (_id:1)

I'm considering wrapping each part in a filter query for the least
overhead. At the moment these are two separate querystrings, but it'd
be easy to update on my side to make the _id match a termquery.

Like I said, not efficient, but simple to understand and work with.

If it matches we fire an alert.

I believe that any of the alerting should be done from outside the ES
realm and think (I really haven't looked, we have some custom setup)
the alerting could be handled with a RabbitMQ setup or similar.
What would be cool, but I have no idea the feasibility of would to:

  1. Register queries against an index
  2. When a document is indexed have details in the return citing which
    queries the document matched against
  3. The application logic then handles any alerting or other actions

Thanks,
Paul

On Nov 4, 1:01 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hey,

On Thu, Nov 4, 2010 at 5:36 PM, Paul ppea...@gmail.com wrote:

Hello,
We have a system where we check every new document that comes in
against a large number of saved users queries. We do this, now, by
running the <USER_QUERY> AND . It isn't the most
efficient setup, since for every doc, we need to run every saved user
query.

What is exactly the NEWDOCUMENT part? How do you create it?

So, I there are two questions:

  1. Within ES, is there a different approach that could be taken to
    streamline?

There are different ways to implement this internally, certainly with lower
overhead than doing what you are doing now. One of the main problems here
is the messaging aspect, which have different additional features. Can
messages be lost? are duplicates allowed? If the client is down (registered
to receive notifications), do docs need to be replayed for him? And, of
course, actually implementing the notification aspect protocol wise.... .

  1. When running the USER_QUERY AND NEWDOC query, what kind of caching
    makes the most sense? From my limited understanding, I think it'd make
    sense to filtercache both the USERQUERY and the NEWDOC query, since
    both of these should get repeated multiple times.

Depends on what the answer for NEWDOC is...., is it a term query with the
doc id?

Thanks!
Paul

Hi,

Regarding the filtering option, I suggest wrapping the alerting query in
a filter, and use the _id based matching as a term query. Make sure to
prefix the id with the type name to give it type isolation (my_type._id). No
need to cache the _id one as well.

Returning on each indexed document if it matched or not is a good idea.
The problem with how elasticsearch works now is that the relevant shard will
have to be refreshed to in order for the indexed doc to be visible for
search. Another option is to index it into a small in memory index, and run
the queries on it, but this will incur indexing the doc twice (at least with
the current way things are working).

This feature certainly requires some more thinking :).

-shay.banon

On Thu, Nov 4, 2010 at 9:29 PM, Paul ppearcy@gmail.com wrote:

Hey,
Thanks for the followup and sorry for not being totally clear. Here
is an explicit example:
A user saves a search for "elasticsearch AND cool" to be alerted on.
Every time a new document comes in, for every user query registered,
we run against that doc, so a new doc id of 1 comes in, we end up
with:
(elasticsearch AND cool) AND (_id:1)

I'm considering wrapping each part in a filter query for the least
overhead. At the moment these are two separate querystrings, but it'd
be easy to update on my side to make the _id match a termquery.

Like I said, not efficient, but simple to understand and work with.

If it matches we fire an alert.

I believe that any of the alerting should be done from outside the ES
realm and think (I really haven't looked, we have some custom setup)
the alerting could be handled with a RabbitMQ setup or similar.
What would be cool, but I have no idea the feasibility of would to:

  1. Register queries against an index
  2. When a document is indexed have details in the return citing which
    queries the document matched against
  3. The application logic then handles any alerting or other actions

Thanks,
Paul

On Nov 4, 1:01 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hey,

On Thu, Nov 4, 2010 at 5:36 PM, Paul ppea...@gmail.com wrote:

Hello,
We have a system where we check every new document that comes in
against a large number of saved users queries. We do this, now, by
running the <USER_QUERY> AND . It isn't the most
efficient setup, since for every doc, we need to run every saved user
query.

What is exactly the NEWDOCUMENT part? How do you create it?

So, I there are two questions:

  1. Within ES, is there a different approach that could be taken to
    streamline?

There are different ways to implement this internally, certainly with
lower
overhead than doing what you are doing now. One of the main problems here
is the messaging aspect, which have different additional features. Can
messages be lost? are duplicates allowed? If the client is down
(registered
to receive notifications), do docs need to be replayed for him? And, of
course, actually implementing the notification aspect protocol wise.... .

  1. When running the USER_QUERY AND NEWDOC query, what kind of caching
    makes the most sense? From my limited understanding, I think it'd make
    sense to filtercache both the USERQUERY and the NEWDOC query, since
    both of these should get repeated multiple times.

Depends on what the answer for NEWDOC is...., is it a term query with the
doc id?

Thanks!
Paul

Paul,

Can you run these queries periodically every x seconds instead of for each
new event? You'd need to track the last processed document and use that in
the query. Something like "elasticsearch AND cool" and docId > 5 (5 being
the id of the last previously processed doc). There may be 1 or more
matches, and you'd take action for each matching doc, etc.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Thu, Nov 4, 2010 at 3:29 PM, Paul ppearcy@gmail.com wrote:

Hey,
Thanks for the followup and sorry for not being totally clear. Here
is an explicit example:
A user saves a search for "elasticsearch AND cool" to be alerted on.
Every time a new document comes in, for every user query registered,
we run against that doc, so a new doc id of 1 comes in, we end up
with:
(elasticsearch AND cool) AND (_id:1)

I'm considering wrapping each part in a filter query for the least
overhead. At the moment these are two separate querystrings, but it'd
be easy to update on my side to make the _id match a termquery.

Like I said, not efficient, but simple to understand and work with.

If it matches we fire an alert.

I believe that any of the alerting should be done from outside the ES
realm and think (I really haven't looked, we have some custom setup)
the alerting could be handled with a RabbitMQ setup or similar.
What would be cool, but I have no idea the feasibility of would to:

  1. Register queries against an index
  2. When a document is indexed have details in the return citing which
    queries the document matched against
  3. The application logic then handles any alerting or other actions

Thanks,
Paul

On Nov 4, 1:01 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hey,

On Thu, Nov 4, 2010 at 5:36 PM, Paul ppea...@gmail.com wrote:

Hello,
We have a system where we check every new document that comes in
against a large number of saved users queries. We do this, now, by
running the <USER_QUERY> AND . It isn't the most
efficient setup, since for every doc, we need to run every saved user
query.

What is exactly the NEWDOCUMENT part? How do you create it?

So, I there are two questions:

  1. Within ES, is there a different approach that could be taken to
    streamline?

There are different ways to implement this internally, certainly with
lower
overhead than doing what you are doing now. One of the main problems here
is the messaging aspect, which have different additional features. Can
messages be lost? are duplicates allowed? If the client is down
(registered
to receive notifications), do docs need to be replayed for him? And, of
course, actually implementing the notification aspect protocol wise.... .

  1. When running the USER_QUERY AND NEWDOC query, what kind of caching
    makes the most sense? From my limited understanding, I think it'd make
    sense to filtercache both the USERQUERY and the NEWDOC query, since
    both of these should get repeated multiple times.

Depends on what the answer for NEWDOC is...., is it a term query with the
doc id?

Thanks!
Paul

Thanks for the feedback and ideas. There is definitely lots of room
for improvement on how we are matching and there are trade offs with
the number of registered users queries vs the doc volume, as well as,
with the amount of state that needs to be stored.

FWIW, another potential option in ES is to provide a callback to
notify the app via thrift or HTTP, but I have pretty much no idea what
I'm talking about here, in terms of feasibility or ES internals :slight_smile:

Thanks,
Paul

On Nov 4, 1:51 pm, Berkay Mollamustafaoglu mber...@gmail.com wrote:

Paul,

Can you run these queries periodically every x seconds instead of for each
new event? You'd need to track the last processed document and use that in
the query. Something like "elasticsearch AND cool" and docId > 5 (5 being
the id of the last previously processed doc). There may be 1 or more
matches, and you'd take action for each matching doc, etc.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Thu, Nov 4, 2010 at 3:29 PM, Paul ppea...@gmail.com wrote:

Hey,
Thanks for the followup and sorry for not being totally clear. Here
is an explicit example:
A user saves a search for "elasticsearch AND cool" to be alerted on.
Every time a new document comes in, for every user query registered,
we run against that doc, so a new doc id of 1 comes in, we end up
with:
(elasticsearch AND cool) AND (_id:1)

I'm considering wrapping each part in a filter query for the least
overhead. At the moment these are two separate querystrings, but it'd
be easy to update on my side to make the _id match a termquery.

Like I said, not efficient, but simple to understand and work with.

If it matches we fire an alert.

I believe that any of the alerting should be done from outside the ES
realm and think (I really haven't looked, we have some custom setup)
the alerting could be handled with a RabbitMQ setup or similar.
What would be cool, but I have no idea the feasibility of would to:

  1. Register queries against an index
  2. When a document is indexed have details in the return citing which
    queries the document matched against
  3. The application logic then handles any alerting or other actions

Thanks,
Paul

On Nov 4, 1:01 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hey,

On Thu, Nov 4, 2010 at 5:36 PM, Paul ppea...@gmail.com wrote:

Hello,
We have a system where we check every new document that comes in
against a large number of saved users queries. We do this, now, by
running the <USER_QUERY> AND . It isn't the most
efficient setup, since for every doc, we need to run every saved user
query.

What is exactly the NEWDOCUMENT part? How do you create it?

So, I there are two questions:

  1. Within ES, is there a different approach that could be taken to
    streamline?

There are different ways to implement this internally, certainly with
lower
overhead than doing what you are doing now. One of the main problems here
is the messaging aspect, which have different additional features. Can
messages be lost? are duplicates allowed? If the client is down
(registered
to receive notifications), do docs need to be replayed for him? And, of
course, actually implementing the notification aspect protocol wise.... .

  1. When running the USER_QUERY AND NEWDOC query, what kind of caching
    makes the most sense? From my limited understanding, I think it'd make
    sense to filtercache both the USERQUERY and the NEWDOC query, since
    both of these should get repeated multiple times.

Depends on what the answer for NEWDOC is...., is it a term query with the
doc id?

Thanks!
Paul

On Fri, Nov 5, 2010 at 7:26 PM, Paul ppearcy@gmail.com wrote:

Thanks for the feedback and ideas. There is definitely lots of room
for improvement on how we are matching and there are trade offs with
the number of registered users queries vs the doc volume, as well as,
with the amount of state that needs to be stored.

FWIW, another potential option in ES is to provide a callback to
notify the app via thrift or HTTP, but I have pretty much no idea what
I'm talking about here, in terms of feasibility or ES internals :slight_smile:

It is possible, everything is possible :). The point here is once ES will
provide callbacks, it gets into the messaging aspect, where what happens if
a client disconnects, does ES provides reliable notifications replaying
missing notifications, and so on. This is very different from returning the
matched queries on index operations (simpler to implement, more burden on
the client).

Thanks,
Paul

On Nov 4, 1:51 pm, Berkay Mollamustafaoglu mber...@gmail.com wrote:

Paul,

Can you run these queries periodically every x seconds instead of for
each
new event? You'd need to track the last processed document and use that
in
the query. Something like "elasticsearch AND cool" and docId > 5 (5 being
the id of the last previously processed doc). There may be 1 or more
matches, and you'd take action for each matching doc, etc.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Thu, Nov 4, 2010 at 3:29 PM, Paul ppea...@gmail.com wrote:

Hey,
Thanks for the followup and sorry for not being totally clear. Here
is an explicit example:
A user saves a search for "elasticsearch AND cool" to be alerted on.
Every time a new document comes in, for every user query registered,
we run against that doc, so a new doc id of 1 comes in, we end up
with:
(elasticsearch AND cool) AND (_id:1)

I'm considering wrapping each part in a filter query for the least
overhead. At the moment these are two separate querystrings, but it'd
be easy to update on my side to make the _id match a termquery.

Like I said, not efficient, but simple to understand and work with.

If it matches we fire an alert.

I believe that any of the alerting should be done from outside the ES
realm and think (I really haven't looked, we have some custom setup)
the alerting could be handled with a RabbitMQ setup or similar.
What would be cool, but I have no idea the feasibility of would to:

  1. Register queries against an index
  2. When a document is indexed have details in the return citing which
    queries the document matched against
  3. The application logic then handles any alerting or other actions

Thanks,
Paul

On Nov 4, 1:01 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hey,

On Thu, Nov 4, 2010 at 5:36 PM, Paul ppea...@gmail.com wrote:

Hello,
We have a system where we check every new document that comes in
against a large number of saved users queries. We do this, now, by
running the <USER_QUERY> AND . It isn't the most
efficient setup, since for every doc, we need to run every saved
user
query.

What is exactly the NEWDOCUMENT part? How do you create it?

So, I there are two questions:

  1. Within ES, is there a different approach that could be taken to
    streamline?

There are different ways to implement this internally, certainly with
lower
overhead than doing what you are doing now. One of the main problems
here
is the messaging aspect, which have different additional features.
Can
messages be lost? are duplicates allowed? If the client is down
(registered
to receive notifications), do docs need to be replayed for him? And,
of
course, actually implementing the notification aspect protocol
wise.... .

  1. When running the USER_QUERY AND NEWDOC query, what kind of
    caching
    makes the most sense? From my limited understanding, I think it'd
    make
    sense to filtercache both the USERQUERY and the NEWDOC query, since
    both of these should get repeated multiple times.

Depends on what the answer for NEWDOC is...., is it a term query with
the
doc id?

Thanks!
Paul