Classification with percolator

Arthur_Denning · January 21, 2014, 3:01pm

I am considering using the percolator API to classify document, namely, by
posting query like "football", "art" to the percolator, and then when
adding new documents, percolator should return the right tags. My concerns
is, suppose there is thousands of tag to be identified in this way, would
it be a performance nightmare? Is there thousands of query that is
implicitly running behind the scene?

And what would be the recommended way to tackle these kind of
classification problem in Elasticsearch?

It seems that Lucene has a classification api. Is it already integrated
elsewhere in Elasticsearch? Is there any roadmap concerning its
implementation?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8cd363be-5c9b-4b10-925c-fb4f1de4d4c3%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Binh_Ly · January 22, 2014, 12:27am

Arthur,

You should be able to use filters in your percolator queries so for example
you can use a term/terms filter. Also, in ES 1.0 you can shard the
percolator query index out so that percolation can distribute that load
around for better scalability. The best way is to experiment with it:
Elasticsearch Platform — Find real-time answers at scale | Elastic.

I actually worked for a company that did content classification this way,
and the percolator was a perfect fit for that use-case.

On Tuesday, January 21, 2014 10:01:36 AM UTC-5, Arthur Denning wrote:

I am considering using the percolator API to classify document, namely, by
posting query like "football", "art" to the percolator, and then when
adding new documents, percolator should return the right tags. My concerns
is, suppose there is thousands of tag to be identified in this way, would
it be a performance nightmare? Is there thousands of query that is
implicitly running behind the scene?

And what would be the recommended way to tackle these kind of
classification problem in Elasticsearch?

It seems that Lucene has a classification api. Is it already integrated
elsewhere in Elasticsearch? Is there any roadmap concerning its
implementation?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a81c8c74-06a2-452c-8c82-3b0358d18380%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Arthur_Denning · January 22, 2014, 10:12am

Hey Binh, Thanks a lot and it is really nice to hear from someone with
practical experience on this. Is it correct to say if I had a thousand
tags, I would need to make thousands of

curl -XPUT 'localhost:9200/my-index1/.percolator/tagname1'

to register each tags? In your implementation is there any pitfalls or nice
tricks that is worth noting?

On Wednesday, January 22, 2014 8:27:03 AM UTC+8, Binh Ly wrote:

Arthur,

You should be able to use filters in your percolator queries so for
example you can use a term/terms filter. Also, in ES 1.0 you can shard the
percolator query index out so that percolation can distribute that load
around for better scalability. The best way is to experiment with it:
Elasticsearch Platform — Find real-time answers at scale | Elastic.

I actually worked for a company that did content classification this way,
and the percolator was a perfect fit for that use-case.

On Tuesday, January 21, 2014 10:01:36 AM UTC-5, Arthur Denning wrote:

I am considering using the percolator API to classify document, namely,
by posting query like "football", "art" to the percolator, and then when
adding new documents, percolator should return the right tags. My concerns
is, suppose there is thousands of tag to be identified in this way, would
it be a performance nightmare? Is there thousands of query that is
implicitly running behind the scene?

And what would be the recommended way to tackle these kind of
classification problem in Elasticsearch?

It seems that Lucene has a classification api. Is it already integrated
elsewhere in Elasticsearch? Is there any roadmap concerning its
implementation?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/965b464c-1cf2-4ae5-83c1-5f18fe8d0228%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Binh_Ly · January 22, 2014, 5:28pm

Arthur,

I am assuming that you will define a query/rule for each tag, so in your
case yes, that would be the way to define the percolator queries.

Couple of things that you might want to be aware:

Percolation is CPU intensive
The lesser the queries you can percolate against, the better. So when
you call the percolate API, see if you can also pass in a query criteria to
limit the queries to percolate against.

On Wednesday, January 22, 2014 5:12:54 AM UTC-5, Arthur Denning wrote:

Hey Binh, Thanks a lot and it is really nice to hear from someone with
practical experience on this. Is it correct to say if I had a thousand
tags, I would need to make thousands of

curl -XPUT 'localhost:9200/my-index1/.percolator/tagname1'

to register each tags? In your implementation is there any pitfalls or
nice tricks that is worth noting?

On Wednesday, January 22, 2014 8:27:03 AM UTC+8, Binh Ly wrote:

Arthur,

You should be able to use filters in your percolator queries so for
example you can use a term/terms filter. Also, in ES 1.0 you can shard the
percolator query index out so that percolation can distribute that load
around for better scalability. The best way is to experiment with it:
Elasticsearch Platform — Find real-time answers at scale | Elastic.

I actually worked for a company that did content classification this way,
and the percolator was a perfect fit for that use-case.

On Tuesday, January 21, 2014 10:01:36 AM UTC-5, Arthur Denning wrote:

I am considering using the percolator API to classify document, namely,
by posting query like "football", "art" to the percolator, and then when
adding new documents, percolator should return the right tags. My concerns
is, suppose there is thousands of tag to be identified in this way, would
it be a performance nightmare? Is there thousands of query that is
implicitly running behind the scene?

And what would be the recommended way to tackle these kind of
classification problem in Elasticsearch?

It seems that Lucene has a classification api. Is it already integrated
elsewhere in Elasticsearch? Is there any roadmap concerning its
implementation?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b6707b03-734a-4518-a12d-0e34e09e01f7%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Classification pattern: Percolate, Tag, Index Elasticsearch	1	638	July 9, 2014
Classification Pattern: Percolate, Tag, Index Elasticsearch	0	417	July 8, 2014
Tagging/categorizing documents with customized rules Elasticsearch	1	498	June 13, 2017
Percolator performance ideas Elasticsearch	5	541	July 15, 2013
Just Pushed: Percolator Elasticsearch	6	345	January 14, 2011

Classification with percolator

Related topics