Ann: Elasticsearch Index Termlist Plugin

Hi,

almost forgot to release a new plugin, the Termlist plugin, but here it is.
You can find it here:

This plugin extends Elasticsearch with a term list capability. Term lists
can be generated from indexes, or even of all of the indexes in the cluster.

Getting the list of all terms indexed is useful for variuos purposes, for
example

  • building dictionaries
  • controlling the overall effects of analyzers on the indexed terms
  • automatic query building on indexed terms, e.g. for load tests
  • input to linguistic analysis tools
  • for other post-processing of the indexed terms outside of Elasticsearch

Example of getting the term list of index test

curl -XPUT 'http://localhost:9200/test/'
curl -XPUT 'http://localhost:9200/test/test/1' -d '{ "test": "Hello World"
}'
curl -XPUT 'http://localhost:9200/test/test/2' -d '{ "test": "Hello Jörg
Prante" }'
curl -XPUT 'http://localhost:9200/test/test/3' -d '{ "message": "elastic
search" }'
curl -XGET 'http://localhost:9200/test/_termlist'
{"ok":true,"_shards":{"total":5,"successful":5,"failed":0},"terms":["hello","prant","world","elastic","search","jorg"]}

Have a nice weekend,

Jörg

Heya,

Had a very quick look at the code, you can register a custom action by
having the plugin be called with ActionModule (on the plugin, with the
onModule method), and then register the action on the ActionModule itself.

On Sun, Mar 25, 2012 at 1:16 PM, Jörg Prante joergprante@gmail.com wrote:

Hi,

almost forgot to release a new plugin, the Termlist plugin, but here it
is. You can find it here:

GitHub - jprante/elasticsearch-index-termlist: Elasticsearch Index Termlist

This plugin extends Elasticsearch with a term list capability. Term lists
can be generated from indexes, or even of all of the indexes in the cluster.

Getting the list of all terms indexed is useful for variuos purposes, for
example

  • building dictionaries
  • controlling the overall effects of analyzers on the indexed terms
  • automatic query building on indexed terms, e.g. for load tests
  • input to linguistic analysis tools
  • for other post-processing of the indexed terms outside of Elasticsearch

Example of getting the term list of index test

curl -XPUT 'http://localhost:9200/test/'
curl -XPUT 'http://localhost:9200/test/test/1' -d '{ "test": "Hello
World" }'
curl -XPUT 'http://localhost:9200/test/test/2' -d '{ "test": "Hello Jörg
Prante" }'
curl -XPUT 'http://localhost:9200/test/test/3' -d '{ "message": "elastic
search" }'
curl -XGET 'http://localhost:9200/test/_termlist'

{"ok":true,"_shards":{"total":5,"successful":5,"failed":0},"terms":["hello","prant","world","elastic","search","jorg"]}

Have a nice weekend,

Jörg

Can we use this plugin in the following way

  • I want to see what all words are present in the matched documents for
    a particular query.
  • Also i want to know the frequency of occurrence of the word.

Like i want to know the hot topic on news between date X and Y.
I can run a range query , see the occurrence of the words and see which is
the word that come the most.

Thanks
Vineeth

On Sun, Mar 25, 2012 at 4:46 PM, Jörg Prante joergprante@gmail.com wrote:

Hi,

almost forgot to release a new plugin, the Termlist plugin, but here it
is. You can find it here:

GitHub - jprante/elasticsearch-index-termlist: Elasticsearch Index Termlist

This plugin extends Elasticsearch with a term list capability. Term lists
can be generated from indexes, or even of all of the indexes in the cluster.

Getting the list of all terms indexed is useful for variuos purposes, for
example

  • building dictionaries
  • controlling the overall effects of analyzers on the indexed terms
  • automatic query building on indexed terms, e.g. for load tests
  • input to linguistic analysis tools
  • for other post-processing of the indexed terms outside of Elasticsearch

Example of getting the term list of index test

curl -XPUT 'http://localhost:9200/test/'
curl -XPUT 'http://localhost:9200/test/test/1' -d '{ "test": "Hello
World" }'
curl -XPUT 'http://localhost:9200/test/test/2' -d '{ "test": "Hello Jörg
Prante" }'
curl -XPUT 'http://localhost:9200/test/test/3' -d '{ "message": "elastic
search" }'
curl -XGET 'http://localhost:9200/test/_termlist'

{"ok":true,"_shards":{"total":5,"successful":5,"failed":0},"terms":["hello","prant","world","elastic","search","jorg"]}

Have a nice weekend,

Jörg

Thanks Shay,

I digged deeper and I found out that something like

public void onModule(ActionModule module) {
module.registerAction(TermlistAction.INSTANCE,
TransportTermlistAction.class);
}

in the plugin class seems all I need to do, indeed.

Jörg

On Sunday, March 25, 2012 2:49:00 PM UTC+2, kimchy wrote:

Heya,

Had a very quick look at the code, you can register a custom action by
having the plugin be called with ActionModule (on the plugin, with the
onModule method), and then register the action on the ActionModule itself.

On Sun, Mar 25, 2012 at 1:16 PM, Jörg Prante joergprante@gmail.comwrote:

Hi,

almost forgot to release a new plugin, the Termlist plugin, but here it
is. You can find it here:

GitHub - jprante/elasticsearch-index-termlist: Elasticsearch Index Termlist

This plugin extends Elasticsearch with a term list capability. Term lists
can be generated from indexes, or even of all of the indexes in the cluster.

Getting the list of all terms indexed is useful for variuos purposes, for
example

  • building dictionaries
  • controlling the overall effects of analyzers on the indexed terms
  • automatic query building on indexed terms, e.g. for load tests
  • input to linguistic analysis tools
  • for other post-processing of the indexed terms outside of Elasticsearch

Example of getting the term list of index test

curl -XPUT 'http://localhost:9200/test/'
curl -XPUT 'http://localhost:9200/test/test/1' -d '{ "test": "Hello
World" }'
curl -XPUT 'http://localhost:9200/test/test/2' -d '{ "test": "Hello Jörg
Prante" }'
curl -XPUT 'http://localhost:9200/test/test/3' -d '{ "message": "elastic
search" }'
curl -XGET 'http://localhost:9200/test/_termlist'

{"ok":true,"_shards":{"total":5,"successful":5,"failed":0},"terms":["hello","prant","world","elastic","search","jorg"]}

Have a nice weekend,

Jörg

Hi,

On Sunday, March 25, 2012 6:53:32 PM UTC+2, Vineeth Mohan wrote:

Can we use this plugin in the following way

  • I want to see what all words are present in the matched documents
    for a particular query.

Unfortunately, dumping term lists is far from being able to evaluate
queries and trace all the participating fields. But I added a field
confinement. In version 1.1.0, it is possible to dump terms from a given
field in a given index.

  • Also i want to know the frequency of occurrence of the word.

The frequency available in the Lucene IndexReader is the "document
frequency" of a term. This looks like a very expensive call for each
term. I'm hesitating to implement it. See also the javadoc of
org.apache.lucene.search.similar.MoreLikeThis MoreLikeThis (Lucene 3.5.0 API)

Like i want to know the hot topic on news between date X and Y.
I can run a range query , see the occurrence of the words and see which is
the word that come the most.

If index names or field names can be organized to represent time periods,
the term list dump could be used to dump all terms that are of
interest. Have you considered the percolator feature? Looks like the
percolator is also close to the "hot topics" feature.

Jörg

Just another hint. With Lucene 4.0, it will be possible to get the total
term frequency, see

https://issues.apache.org/jira/browse/LUCENE-2862

So, let's look forward :slight_smile:

Jörg

On Monday, March 26, 2012 9:15:48 PM UTC+2, Jörg Prante wrote:

Hi,

On Sunday, March 25, 2012 6:53:32 PM UTC+2, Vineeth Mohan wrote:

Can we use this plugin in the following way

  • I want to see what all words are present in the matched documents
    for a particular query.

Unfortunately, dumping term lists is far from being able to evaluate
queries and trace all the participating fields. But I added a field
confinement. In version 1.1.0, it is possible to dump terms from a given
field in a given index.

  • Also i want to know the frequency of occurrence of the word.

The frequency available in the Lucene IndexReader is the "document
frequency" of a term. This looks like a very expensive call for each
term. I'm hesitating to implement it. See also the javadoc of
org.apache.lucene.search.similar.MoreLikeThis
MoreLikeThis (Lucene 3.5.0 API)

Like i want to know the hot topic on news between date X and Y.
I can run a range query , see the occurrence of the words and see which
is the word that come the most.

If index names or field names can be organized to represent time periods,
the term list dump could be used to dump all terms that are of
interest. Have you considered the percolator feature? Looks like the
percolator is also close to the "hot topics" feature.

Jörg