Classification with percolator


(Arthur Denning) #1

I am considering using the percolator API to classify document, namely, by
posting query like "football", "art" to the percolator, and then when
adding new documents, percolator should return the right tags. My concerns
is, suppose there is thousands of tag to be identified in this way, would
it be a performance nightmare? Is there thousands of query that is
implicitly running behind the scene?

And what would be the recommended way to tackle these kind of
classification problem in Elasticsearch?

It seems that Lucene has a classification api. Is it already integrated
elsewhere in Elasticsearch? Is there any roadmap concerning its
implementation?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8cd363be-5c9b-4b10-925c-fb4f1de4d4c3%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Binh Ly) #2

Arthur,

You should be able to use filters in your percolator queries so for example
you can use a term/terms filter. Also, in ES 1.0 you can shard the
percolator query index out so that percolation can distribute that load
around for better scalability. The best way is to experiment with it:
http://www.elasticsearch.org/downloads/1-0-0-RC1.

I actually worked for a company that did content classification this way,
and the percolator was a perfect fit for that use-case.

On Tuesday, January 21, 2014 10:01:36 AM UTC-5, Arthur Denning wrote:

I am considering using the percolator API to classify document, namely, by
posting query like "football", "art" to the percolator, and then when
adding new documents, percolator should return the right tags. My concerns
is, suppose there is thousands of tag to be identified in this way, would
it be a performance nightmare? Is there thousands of query that is
implicitly running behind the scene?

And what would be the recommended way to tackle these kind of
classification problem in Elasticsearch?

It seems that Lucene has a classification api. Is it already integrated
elsewhere in Elasticsearch? Is there any roadmap concerning its
implementation?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a81c8c74-06a2-452c-8c82-3b0358d18380%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Arthur Denning) #3

Hey Binh, Thanks a lot and it is really nice to hear from someone with
practical experience on this. Is it correct to say if I had a thousand
tags, I would need to make thousands of

curl -XPUT 'localhost:9200/my-index1/.percolator/tagname1'

to register each tags? In your implementation is there any pitfalls or nice
tricks that is worth noting?

On Wednesday, January 22, 2014 8:27:03 AM UTC+8, Binh Ly wrote:

Arthur,

You should be able to use filters in your percolator queries so for
example you can use a term/terms filter. Also, in ES 1.0 you can shard the
percolator query index out so that percolation can distribute that load
around for better scalability. The best way is to experiment with it:
http://www.elasticsearch.org/downloads/1-0-0-RC1.

I actually worked for a company that did content classification this way,
and the percolator was a perfect fit for that use-case.

On Tuesday, January 21, 2014 10:01:36 AM UTC-5, Arthur Denning wrote:

I am considering using the percolator API to classify document, namely,
by posting query like "football", "art" to the percolator, and then when
adding new documents, percolator should return the right tags. My concerns
is, suppose there is thousands of tag to be identified in this way, would
it be a performance nightmare? Is there thousands of query that is
implicitly running behind the scene?

And what would be the recommended way to tackle these kind of
classification problem in Elasticsearch?

It seems that Lucene has a classification api. Is it already integrated
elsewhere in Elasticsearch? Is there any roadmap concerning its
implementation?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/965b464c-1cf2-4ae5-83c1-5f18fe8d0228%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Binh Ly) #4

Arthur,

I am assuming that you will define a query/rule for each tag, so in your
case yes, that would be the way to define the percolator queries.

Couple of things that you might want to be aware:

  1. Percolation is CPU intensive
  2. The lesser the queries you can percolate against, the better. So when
    you call the percolate API, see if you can also pass in a query criteria to
    limit the queries to percolate against.

On Wednesday, January 22, 2014 5:12:54 AM UTC-5, Arthur Denning wrote:

Hey Binh, Thanks a lot and it is really nice to hear from someone with
practical experience on this. Is it correct to say if I had a thousand
tags, I would need to make thousands of

curl -XPUT 'localhost:9200/my-index1/.percolator/tagname1'

to register each tags? In your implementation is there any pitfalls or
nice tricks that is worth noting?

On Wednesday, January 22, 2014 8:27:03 AM UTC+8, Binh Ly wrote:

Arthur,

You should be able to use filters in your percolator queries so for
example you can use a term/terms filter. Also, in ES 1.0 you can shard the
percolator query index out so that percolation can distribute that load
around for better scalability. The best way is to experiment with it:
http://www.elasticsearch.org/downloads/1-0-0-RC1.

I actually worked for a company that did content classification this way,
and the percolator was a perfect fit for that use-case.

On Tuesday, January 21, 2014 10:01:36 AM UTC-5, Arthur Denning wrote:

I am considering using the percolator API to classify document, namely,
by posting query like "football", "art" to the percolator, and then when
adding new documents, percolator should return the right tags. My concerns
is, suppose there is thousands of tag to be identified in this way, would
it be a performance nightmare? Is there thousands of query that is
implicitly running behind the scene?

And what would be the recommended way to tackle these kind of
classification problem in Elasticsearch?

It seems that Lucene has a classification api. Is it already integrated
elsewhere in Elasticsearch? Is there any roadmap concerning its
implementation?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b6707b03-734a-4518-a12d-0e34e09e01f7%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #5