Ann: Elasticsearch Index Termlist Plugin

jprante · March 25, 2012, 11:16am

Hi,

almost forgot to release a new plugin, the Termlist plugin, but here it is.
You can find it here:

This plugin extends Elasticsearch with a term list capability. Term lists
can be generated from indexes, or even of all of the indexes in the cluster.

Getting the list of all terms indexed is useful for variuos purposes, for
example

building dictionaries
controlling the overall effects of analyzers on the indexed terms
automatic query building on indexed terms, e.g. for load tests
input to linguistic analysis tools
for other post-processing of the indexed terms outside of Elasticsearch

Example of getting the term list of index test

curl -XPUT 'http://localhost:9200/test/'
curl -XPUT 'http://localhost:9200/test/test/1' -d '{ "test": "Hello World"
}'
curl -XPUT 'http://localhost:9200/test/test/2' -d '{ "test": "Hello Jörg
Prante" }'
curl -XPUT 'http://localhost:9200/test/test/3' -d '{ "message": "elastic
search" }'
curl -XGET 'http://localhost:9200/test/_termlist'
{"ok":true,"_shards":{"total":5,"successful":5,"failed":0},"terms":["hello","prant","world","elastic","search","jorg"]}

Have a nice weekend,

Jörg

kimchy · March 25, 2012, 12:49pm

Heya,

Had a very quick look at the code, you can register a custom action by
having the plugin be called with ActionModule (on the plugin, with the
onModule method), and then register the action on the ActionModule itself.

On Sun, Mar 25, 2012 at 1:16 PM, Jörg Prante joergprante@gmail.com wrote:

Hi,

almost forgot to release a new plugin, the Termlist plugin, but here it
is. You can find it here:

GitHub - jprante/elasticsearch-index-termlist: Elasticsearch Index Termlist

This plugin extends Elasticsearch with a term list capability. Term lists
can be generated from indexes, or even of all of the indexes in the cluster.

Getting the list of all terms indexed is useful for variuos purposes, for
example

building dictionaries

controlling the overall effects of analyzers on the indexed terms

automatic query building on indexed terms, e.g. for load tests

input to linguistic analysis tools

for other post-processing of the indexed terms outside of Elasticsearch

Example of getting the term list of index test

curl -XPUT 'http://localhost:9200/test/'
curl -XPUT 'http://localhost:9200/test/test/1' -d '{ "test": "Hello
World" }'
curl -XPUT 'http://localhost:9200/test/test/2' -d '{ "test": "Hello Jörg
Prante" }'
curl -XPUT 'http://localhost:9200/test/test/3' -d '{ "message": "elastic
search" }'
curl -XGET 'http://localhost:9200/test/_termlist'

{"ok":true,"_shards":{"total":5,"successful":5,"failed":0},"terms":["hello","prant","world","elastic","search","jorg"]}

Have a nice weekend,

Jörg

vineeth_mohan · March 25, 2012, 4:53pm

Can we use this plugin in the following way

I want to see what all words are present in the matched documents for
a particular query.
Also i want to know the frequency of occurrence of the word.

Like i want to know the hot topic on news between date X and Y.
I can run a range query , see the occurrence of the words and see which is
the word that come the most.

Thanks
Vineeth

On Sun, Mar 25, 2012 at 4:46 PM, Jörg Prante joergprante@gmail.com wrote:

Hi,

almost forgot to release a new plugin, the Termlist plugin, but here it
is. You can find it here:

GitHub - jprante/elasticsearch-index-termlist: Elasticsearch Index Termlist

This plugin extends Elasticsearch with a term list capability. Term lists
can be generated from indexes, or even of all of the indexes in the cluster.

Getting the list of all terms indexed is useful for variuos purposes, for
example

building dictionaries

controlling the overall effects of analyzers on the indexed terms

automatic query building on indexed terms, e.g. for load tests

input to linguistic analysis tools

for other post-processing of the indexed terms outside of Elasticsearch

Example of getting the term list of index test

curl -XPUT 'http://localhost:9200/test/'
curl -XPUT 'http://localhost:9200/test/test/1' -d '{ "test": "Hello
World" }'
curl -XPUT 'http://localhost:9200/test/test/2' -d '{ "test": "Hello Jörg
Prante" }'
curl -XPUT 'http://localhost:9200/test/test/3' -d '{ "message": "elastic
search" }'
curl -XGET 'http://localhost:9200/test/_termlist'

{"ok":true,"_shards":{"total":5,"successful":5,"failed":0},"terms":["hello","prant","world","elastic","search","jorg"]}

Have a nice weekend,

Jörg

jprante · March 26, 2012, 6:52pm

Thanks Shay,

I digged deeper and I found out that something like

public void onModule(ActionModule module) {
module.registerAction(TermlistAction.INSTANCE,
TransportTermlistAction.class);
}

in the plugin class seems all I need to do, indeed.

Jörg

On Sunday, March 25, 2012 2:49:00 PM UTC+2, kimchy wrote:

Heya,

Had a very quick look at the code, you can register a custom action by
having the plugin be called with ActionModule (on the plugin, with the
onModule method), and then register the action on the ActionModule itself.

On Sun, Mar 25, 2012 at 1:16 PM, Jörg Prante joergprante@gmail.comwrote:

Hi,

almost forgot to release a new plugin, the Termlist plugin, but here it
is. You can find it here:

GitHub - jprante/elasticsearch-index-termlist: Elasticsearch Index Termlist

This plugin extends Elasticsearch with a term list capability. Term lists
can be generated from indexes, or even of all of the indexes in the cluster.

Getting the list of all terms indexed is useful for variuos purposes, for
example

building dictionaries

controlling the overall effects of analyzers on the indexed terms

automatic query building on indexed terms, e.g. for load tests

input to linguistic analysis tools

for other post-processing of the indexed terms outside of Elasticsearch

Example of getting the term list of index test

curl -XPUT 'http://localhost:9200/test/'
curl -XPUT 'http://localhost:9200/test/test/1' -d '{ "test": "Hello
World" }'
curl -XPUT 'http://localhost:9200/test/test/2' -d '{ "test": "Hello Jörg
Prante" }'
curl -XPUT 'http://localhost:9200/test/test/3' -d '{ "message": "elastic
search" }'
curl -XGET 'http://localhost:9200/test/_termlist'

{"ok":true,"_shards":{"total":5,"successful":5,"failed":0},"terms":["hello","prant","world","elastic","search","jorg"]}

Have a nice weekend,

Jörg

jprante · March 26, 2012, 7:15pm

Hi,

On Sunday, March 25, 2012 6:53:32 PM UTC+2, Vineeth Mohan wrote:

Can we use this plugin in the following way

I want to see what all words are present in the matched documents
for a particular query.

Unfortunately, dumping term lists is far from being able to evaluate
queries and trace all the participating fields. But I added a field
confinement. In version 1.1.0, it is possible to dump terms from a given
field in a given index.

Also i want to know the frequency of occurrence of the word.

The frequency available in the Lucene IndexReader is the "document
frequency" of a term. This looks like a very expensive call for each
term. I'm hesitating to implement it. See also the javadoc of
org.apache.lucene.search.similar.MoreLikeThis MoreLikeThis (Lucene 3.5.0 API)

Like i want to know the hot topic on news between date X and Y.
I can run a range query , see the occurrence of the words and see which is
the word that come the most.

If index names or field names can be organized to represent time periods,
the term list dump could be used to dump all terms that are of
interest. Have you considered the percolator feature? Looks like the
percolator is also close to the "hot topics" feature.

Jörg

jprante · March 26, 2012, 7:42pm

Just another hint. With Lucene 4.0, it will be possible to get the total
term frequency, see

https://issues.apache.org/jira/browse/LUCENE-2862

So, let's look forward

Jörg

On Monday, March 26, 2012 9:15:48 PM UTC+2, Jörg Prante wrote:

Hi,

On Sunday, March 25, 2012 6:53:32 PM UTC+2, Vineeth Mohan wrote:

Can we use this plugin in the following way

I want to see what all words are present in the matched documents
for a particular query.

Unfortunately, dumping term lists is far from being able to evaluate
queries and trace all the participating fields. But I added a field
confinement. In version 1.1.0, it is possible to dump terms from a given
field in a given index.

Also i want to know the frequency of occurrence of the word.

The frequency available in the Lucene IndexReader is the "document
frequency" of a term. This looks like a very expensive call for each
term. I'm hesitating to implement it. See also the javadoc of
org.apache.lucene.search.similar.MoreLikeThis
MoreLikeThis (Lucene 3.5.0 API)

Like i want to know the hot topic on news between date X and Y.
I can run a range query , see the occurrence of the words and see which
is the word that come the most.

If index names or field names can be organized to represent time periods,
the term list dump could be used to dump all terms that are of
interest. Have you considered the percolator feature? Looks like the
percolator is also close to the "hot topics" feature.

Jörg

Topic		Replies	Views
Creating a custom plugin to return hashes of the terms or the terms of an Elasticsearch index Elasticsearch	3	343	July 6, 2017
Elasticsearch Termlist Plugin 1.5.2.0 Community Ecosystem	1	1333	July 5, 2017
How to enumerate all terms in an index? Elasticsearch	3	556	July 6, 2017
Access Index reader to generate word cloud through plugin Elasticsearch	1	442	September 10, 2018
[Ann] Termlist plugin 1.4.0 Elasticsearch	1	309	July 6, 2017

Ann: Elasticsearch Index Termlist Plugin

Related topics