New to ES – count words when indexing documents


(Isabella) #1

Hi.

I am new to ES and I have to write a simple plugin, which should count all
words in all documents, that are indexed. This word count should then added
in a new field to the document. The jdbc-river-plugin synchronizes my
database with elasticsearch. My documents look like this:

"text": "Text to analyze"

So I only have one field with a text. Afterwards it should look like this:

"text": "text to analyze"
"wordcount":"3"

I have already tried to write a tokenfilter to count the words of my field
"text", but unfortunately that didn't work.

How can elasticsearch recognize, that new documents are indexed and how can
I add a new field “wordcount” to the document?

Thanks for help!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Nik Everett) #2

I'm working on this in and off and should have something in ES for it in the next few weeks. If you search for 'term count' in the github issues you should find it.

Sent from my iPhone

On Oct 24, 2013, at 5:12 AM, Isabella isi.huber90@gmail.com wrote:

Hi.
I am new to ES and I have to write a simple plugin, which should count all words in all documents, that are indexed. This word count should then added in a new field to the document. The jdbc-river-plugin synchronizes my database with elasticsearch. My documents look like this:

"text": "Text to analyze"
So I only have one field with a text. Afterwards it should look like this:
"text": "text to analyze"
"wordcount":"3"

I have already tried to write a tokenfilter to count the words of my field "text", but unfortunately that didn't work.
How can elasticsearch recognize, that new documents are indexed and how can I add a new field “wordcount” to the document?

Thanks for help!

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Isabella) #3

Thanks for your answer.

Is there maybe a tutorial how to write such a tokenfilter?
And how can I add a field to a document?

Am Donnerstag, 24. Oktober 2013 14:53:34 UTC+2 schrieb Nikolas Everett:

I'm working on this in and off and should have something in ES for it in
the next few weeks. If you search for 'term count' in the github issues you
should find it.

Sent from my iPhone

On Oct 24, 2013, at 5:12 AM, Isabella <isi.h...@gmail.com <javascript:>>
wrote:

Hi.

I am new to ES and I have to write a simple plugin, which should count all
words in all documents, that are indexed. This word count should then added
in a new field to the document. The jdbc-river-plugin synchronizes my
database with elasticsearch. My documents look like this:

"text": "Text to analyze"

So I only have one field with a text. Afterwards it should look like this:

"text": "text to analyze"
"wordcount":"3"

I have already tried to write a tokenfilter to count the words of my
field "text", but unfortunately that didn't work.

How can elasticsearch recognize, that new documents are indexed and how
can I add a new field “wordcount” to the document?

Thanks for help!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Ivan Brusic) #4

Why did your custom token filter did not work?

An analyzer applies token filters to each term returned by the tokenizer.
Many token filters will expand the number of tokens, such as the ngram and
synonym token filters. Make sure if you want to count the tokens before or
after the other token filters. That said, an analyzer works on a single
field and you want to create a new field. I would pre-tokenize the text on
the indexing side and create the field at that point.

Cheers,

Ivan

On Thu, Oct 24, 2013 at 9:32 AM, Isabella isi.huber90@gmail.com wrote:

Thanks for your answer.

Is there maybe a tutorial how to write such a tokenfilter?
And how can I add a field to a document?

Am Donnerstag, 24. Oktober 2013 14:53:34 UTC+2 schrieb Nikolas Everett:

I'm working on this in and off and should have something in ES for it in
the next few weeks. If you search for 'term count' in the github issues you
should find it.

Sent from my iPhone

On Oct 24, 2013, at 5:12 AM, Isabella isi.h...@gmail.com wrote:

Hi.

I am new to ES and I have to write a simple plugin, which should count
all words in all documents, that are indexed. This word count should then
added in a new field to the document. The jdbc-river-plugin synchronizes my
database with elasticsearch. My documents look like this:

"text": "Text to analyze"

So I only have one field with a text. Afterwards it should look like this:

"text": "text to analyze"
"wordcount":"3"

I have already tried to write a tokenfilter to count the words of my
field "text", but unfortunately that didn't work.

How can elasticsearch recognize, that new documents are indexed and how
can I add a new field “wordcount” to the document?

Thanks for help!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Isabella) #5

Hi Ivan,

thanks for your answer.

Why did your custom token filter did not work?

Unfortunately I don’t know why my tokenfilter doesn’t work. I tried to
implement a simple token filter (at the moment it is doing nothing at all),
just to try some logging. But it seems the token filter is not even called,
at least I don’t get any logging output.

I would pre-tokenize the text on the indexing side and create the field at

that point.

Which java-class or module do I need to do this pre-tokenization. Do I have
to use a AnalysisModule like this way?
public void onModule(AnalysisModule module) {
module.addProcessor(new WordCountAnalysisBinderProcessor());
}

And then create a TokenFilterFactory and the token filter, or is there
another way to implement a token filter?

Unfortunately I am really new to ES, so I have a few other questions.

I don’t know how to create a connection to my elasticsearch-server. Do I
have to use a Nodebuilder to get a Client, or is there another way to
connect with the server?

And how can I create a new field, to save my wordcount? Do I have to use a
XContentBuilder or is there another way?

Thanks for help!

Best regards,
Isabella

Am Freitag, 25. Oktober 2013 19:05:18 UTC+2 schrieb Ivan Brusic:

Why did your custom token filter did not work?

An analyzer applies token filters to each term returned by the tokenizer.
Many token filters will expand the number of tokens, such as the ngram and
synonym token filters. Make sure if you want to count the tokens before or
after the other token filters. That said, an analyzer works on a single
field and you want to create a new field. I would pre-tokenize the text on
the indexing side and create the field at that point.

Cheers,

Ivan

On Thu, Oct 24, 2013 at 9:32 AM, Isabella <isi.h...@gmail.com<javascript:>

wrote:

Thanks for your answer.

Is there maybe a tutorial how to write such a tokenfilter?
And how can I add a field to a document?

Am Donnerstag, 24. Oktober 2013 14:53:34 UTC+2 schrieb Nikolas Everett:

I'm working on this in and off and should have something in ES for it in
the next few weeks. If you search for 'term count' in the github issues you
should find it.

Sent from my iPhone

On Oct 24, 2013, at 5:12 AM, Isabella isi.h...@gmail.com wrote:

Hi.

I am new to ES and I have to write a simple plugin, which should count
all words in all documents, that are indexed. This word count should then
added in a new field to the document. The jdbc-river-plugin synchronizes my
database with elasticsearch. My documents look like this:

"text": "Text to analyze"

So I only have one field with a text. Afterwards it should look like this:

"text": "text to analyze"
"wordcount":"3"

I have already tried to write a tokenfilter to count the words of my
field "text", but unfortunately that didn't work.

How can elasticsearch recognize, that new documents are indexed and how
can I add a new field “wordcount” to the document?

Thanks for help!

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

Am Freitag, 25. Oktober 2013 19:05:18 UTC+2 schrieb Ivan Brusic:

Why did your custom token filter did not work?

An analyzer applies token filters to each term returned by the tokenizer.
Many token filters will expand the number of tokens, such as the ngram and
synonym token filters. Make sure if you want to count the tokens before or
after the other token filters. That said, an analyzer works on a single
field and you want to create a new field. I would pre-tokenize the text on
the indexing side and create the field at that point.

Cheers,

Ivan

On Thu, Oct 24, 2013 at 9:32 AM, Isabella <isi.h...@gmail.com<javascript:>

wrote:

Thanks for your answer.

Is there maybe a tutorial how to write such a tokenfilter?
And how can I add a field to a document?

Am Donnerstag, 24. Oktober 2013 14:53:34 UTC+2 schrieb Nikolas Everett:

I'm working on this in and off and should have something in ES for it in
the next few weeks. If you search for 'term count' in the github issues you
should find it.

Sent from my iPhone

On Oct 24, 2013, at 5:12 AM, Isabella isi.h...@gmail.com wrote:

Hi.

I am new to ES and I have to write a simple plugin, which should count
all words in all documents, that are indexed. This word count should then
added in a new field to the document. The jdbc-river-plugin synchronizes my
database with elasticsearch. My documents look like this:

"text": "Text to analyze"

So I only have one field with a text. Afterwards it should look like this:

"text": "text to analyze"
"wordcount":"3"

I have already tried to write a tokenfilter to count the words of my
field "text", but unfortunately that didn't work.

How can elasticsearch recognize, that new documents are indexed and how
can I add a new field “wordcount” to the document?

Thanks for help!

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

Am Freitag, 25. Oktober 2013 19:05:18 UTC+2 schrieb Ivan Brusic:

Why did your custom token filter did not work?

An analyzer applies token filters to each term returned by the tokenizer.
Many token filters will expand the number of tokens, such as the ngram and
synonym token filters. Make sure if you want to count the tokens before or
after the other token filters. That said, an analyzer works on a single
field and you want to create a new field. I would pre-tokenize the text on
the indexing side and create the field at that point.

Cheers,

Ivan

On Thu, Oct 24, 2013 at 9:32 AM, Isabella <isi.h...@gmail.com<javascript:>

wrote:

Thanks for your answer.

Is there maybe a tutorial how to write such a tokenfilter?
And how can I add a field to a document?

Am Donnerstag, 24. Oktober 2013 14:53:34 UTC+2 schrieb Nikolas Everett:

I'm working on this in and off and should have something in ES for it in
the next few weeks. If you search for 'term count' in the github issues you
should find it.

Sent from my iPhone

On Oct 24, 2013, at 5:12 AM, Isabella isi.h...@gmail.com wrote:

Hi.

I am new to ES and I have to write a simple plugin, which should count
all words in all documents, that are indexed. This word count should then
added in a new field to the document. The jdbc-river-plugin synchronizes my
database with elasticsearch. My documents look like this:

"text": "Text to analyze"

So I only have one field with a text. Afterwards it should look like this:

"text": "text to analyze"
"wordcount":"3"

I have already tried to write a tokenfilter to count the words of my
field "text", but unfortunately that didn't work.

How can elasticsearch recognize, that new documents are indexed and how
can I add a new field “wordcount” to the document?

Thanks for help!

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

Am Freitag, 25. Oktober 2013 19:05:18 UTC+2 schrieb Ivan Brusic:

Why did your custom token filter did not work?

An analyzer applies token filters to each term returned by the tokenizer.
Many token filters will expand the number of tokens, such as the ngram and
synonym token filters. Make sure if you want to count the tokens before or
after the other token filters. That said, an analyzer works on a single
field and you want to create a new field. I would pre-tokenize the text on
the indexing side and create the field at that point.

Cheers,

Ivan

On Thu, Oct 24, 2013 at 9:32 AM, Isabella <isi.h...@gmail.com<javascript:>

wrote:

Thanks for your answer.

Is there maybe a tutorial how to write such a tokenfilter?
And how can I add a field to a document?

Am Donnerstag, 24. Oktober 2013 14:53:34 UTC+2 schrieb Nikolas Everett:

I'm working on this in and off and should have something in ES for it in
the next few weeks. If you search for 'term count' in the github issues you
should find it.

Sent from my iPhone

On Oct 24, 2013, at 5:12 AM, Isabella isi.h...@gmail.com wrote:

Hi.

I am new to ES and I have to write a simple plugin, which should count
all words in all documents, that are indexed. This word count should then
added in a new field to the document. The jdbc-river-plugin synchronizes my
database with elasticsearch. My documents look like this:

"text": "Text to analyze"

So I only have one field with a text. Afterwards it should look like this:

"text": "text to analyze"
"wordcount":"3"

I have already tried to write a tokenfilter to count the words of my
field "text", but unfortunately that didn't work.

How can elasticsearch recognize, that new documents are indexed and how
can I add a new field “wordcount” to the document?

Thanks for help!

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #6

Take a look at my plugins
https://github.com/jprante/elasticsearch-langdetect/ and, for token
filtering, https://github.com/jprante/elasticsearch-analysis-baseform

In langdetect, I do not count words but I detect the language of a field
and add the detected language codes into a sub-field named "lang".

In the baseform analysis, I use a token filter to inject new tokens into
the token graph.

For adding fields, you have to use the ES field mapper.The version for
langdetect is here:

There are many plugins you can borrow code from, for example, I started
with studying the attachment mapper.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Isabella) #7

Thanks for your help.

I have one question regarding the ES field mapper. Is there a possibility,
to make the field visible in the document? I tested your plugin langdetect,
and the sub-field "lang" is searchable but not visible in your articles. Do
you know how to make the field visible in the document? Do I have to
configure it in the mapping?

Thanks a lot.

Isabella

2013/10/28 joergprante@gmail.com joergprante@gmail.com

Take a look at my plugins
https://github.com/jprante/elasticsearch-langdetect/ and, for token
filtering, https://github.com/jprante/elasticsearch-analysis-baseform

In langdetect, I do not count words but I detect the language of a field
and add the detected language codes into a sub-field named "lang".

In the baseform analysis, I use a token filter to inject new tokens into
the token graph.

For adding fields, you have to use the ES field mapper.The version for
langdetect is here:

https://github.com/jprante/elasticsearch-langdetect/blob/master/src/main/java/org/xbib/elasticsearch/index/mapper/langdetect/LangdetectMapper.java

There are many plugins you can borrow code from, for example, I started
with studying the attachment mapper.

Jörg

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/RAPNsAmTX1c/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Isabella) #8

Hi!

Does nobody knows how to make a field visible in the document?

Currently, I only can search for a document, given a certain number of
words. However, I want to make the field "wordcount" visible, so I can see
it in the _source field, blow my "text" field. Is this possible? Currently,
I am using a ES field mapper, but I can't find any documentation how to
make this field visible in the document.

This is what I see in my _source - field, when I search for a document:

"_source" : { "text" : "Test text" }

And this is, how it should look like:

"_source" : { "text" : "Test text",

          "wordcount" : "2" }

I would be really grateful for any help.
Thanks a lot.

Best regards,
Isabella

Am Dienstag, 29. Oktober 2013 17:52:18 UTC+1 schrieb Isabella:

Thanks for your help.

I have one question regarding the ES field mapper. Is there a possibility,
to make the field visible in the document? I tested your plugin langdetect,
and the sub-field "lang" is searchable but not visible in your articles. Do
you know how to make the field visible in the document? Do I have to
configure it in the mapping?

Thanks a lot.

Isabella

2013/10/28 joergprante@gmail.com joergprante@gmail.com

Take a look at my plugins
https://github.com/jprante/elasticsearch-langdetect/ and, for token
filtering, https://github.com/jprante/elasticsearch-analysis-baseform

In langdetect, I do not count words but I detect the language of a field
and add the detected language codes into a sub-field named "lang".

In the baseform analysis, I use a token filter to inject new tokens into
the token graph.

For adding fields, you have to use the ES field mapper.The version for
langdetect is here:

https://github.com/jprante/elasticsearch-langdetect/blob/master/src/main/java/org/xbib/elasticsearch/index/mapper/langdetect/LangdetectMapper.java

There are many plugins you can borrow code from, for example, I started
with studying the attachment mapper.

Jörg

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/RAPNsAmTX1c/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #9