Elastic Search Tokenizer (for tf-idf)

Hi,

We would like to use elastic search in order to generate idf score for each
token (for algorithm tf-idf).

What are the types of built in tokenizers in the elastic search ? Should
we specify which tokenizer to use in the indexing level (when inserting the
data) or when performing search on it ?

Is it also possible to make elastic search use a different tokenizer (that
was implemented by me) ?

Thanks,
Lital

--
This message may contain confidential and/or privileged information.
If you are not the addressee or authorized to receive this on behalf of the
addressee you must not use, copy, disclose or take action based on this
message or any information herein.
If you have received this message in error, please advise the sender
immediately by reply email and delete this message. Thank you.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/58047432-3f73-4a55-84cd-20051ff8738f%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

There are several tokenizers, some of them split on whitespaces, other
generate n-grams, etc. you can get a list of the built-in tokenizers at

It is also possible to plug in your own Lucene tokenizer through
AnalysisModule.addTokenizer.

On Wed, Feb 5, 2014 at 3:02 PM, Lital litalh@liveperson.com wrote:

Hi,

We would like to use Elasticsearch in order to generate idf score for
each token (for algorithm tf-idf).

What are the types of built in tokenizers in the Elasticsearch ? Should
we specify which tokenizer to use in the indexing level (when inserting the
data) or when performing search on it ?

Is it also possible to make Elasticsearch use a different tokenizer (that
was implemented by me) ?

Thanks,
Lital

This message may contain confidential and/or privileged information.
If you are not the addressee or authorized to receive this on behalf of
the addressee you must not use, copy, disclose or take action based on this
message or any information herein.
If you have received this message in error, please advise the sender
immediately by reply email and delete this message. Thank you.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/58047432-3f73-4a55-84cd-20051ff8738f%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j677OZeC-zgPqPJGKy6t6qtpnYcieqY7fFh2hGsFRetiA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Lital, why do you need Elasticsearch for this? it is going to be way easier
for you to use Lucene directly to do this?

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Wed, Feb 5, 2014 at 4:02 PM, Lital litalh@liveperson.com wrote:

Hi,

We would like to use Elasticsearch in order to generate idf score for
each token (for algorithm tf-idf).

What are the types of built in tokenizers in the Elasticsearch ? Should
we specify which tokenizer to use in the indexing level (when inserting the
data) or when performing search on it ?

Is it also possible to make Elasticsearch use a different tokenizer (that
was implemented by me) ?

Thanks,
Lital

This message may contain confidential and/or privileged information.
If you are not the addressee or authorized to receive this on behalf of
the addressee you must not use, copy, disclose or take action based on this
message or any information herein.
If you have received this message in error, please advise the sender
immediately by reply email and delete this message. Thank you.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/58047432-3f73-4a55-84cd-20051ff8738f%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zv1xWCpfb%3DMps-WGTpOYMpj7-g1nPzNPq4zGxh8SCkJ6Q%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

You can look at analysis plugins: http://www.elasticsearch.org/guide/en/elasticsearch/reference/master/modules-plugins.html#analysis-plugins
They provide analyzers, tokenizers, …

You can probably copy one of theses projects and add your own custom tokenizer.

For example: https://github.com/elasticsearch/elasticsearch-analysis-stempel

BTW, built-in tokenizers are here: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-tokenizers.html

HTH

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 5 février 2014 at 23:54:13, Itamar Syn-Hershko (itamar@code972.com) a écrit:

Lital, why do you need Elasticsearch for this? it is going to be way easier for you to use Lucene directly to do this?

--

Itamar Syn-Hershko
http://code972.com | @synhershko
Freelance Developer & Consultant
Author of RavenDB in Action

On Wed, Feb 5, 2014 at 4:02 PM, Lital litalh@liveperson.com wrote:
Hi,

We would like to use elastic search in order to generate idf score for each token (for algorithm tf-idf).

What are the types of built in tokenizers in the elastic search ? Should we specify which tokenizer to use in the indexing level (when inserting the data) or when performing search on it ?

Is it also possible to make elastic search use a different tokenizer (that was implemented by me) ?

Thanks,
Lital

This message may contain confidential and/or privileged information.
If you are not the addressee or authorized to receive this on behalf of the addressee you must not use, copy, disclose or take action based on this message or any information herein.
If you have received this message in error, please advise the sender immediately by reply email and delete this message. Thank you.

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/58047432-3f73-4a55-84cd-20051ff8738f%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zv1xWCpfb%3DMps-WGTpOYMpj7-g1nPzNPq4zGxh8SCkJ6Q%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/etPan.52f2c1fd.10233c99.d955%40MacBook-Air-de-David.local.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

We are already using Elasticsearch so I though we might get the TF-IDF
algorithm for "free" from it. Another option is to implement it ourselves.
Is it easy to use the Lucene embedded in the Elasticsearch for this ?

Thanks,
Lital

On Thursday, February 6, 2014 12:54:09 AM UTC+2, Itamar Syn-Hershko wrote:

Lital, why do you need Elasticsearch for this? it is going to be way
easier for you to use Lucene directly to do this?

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Wed, Feb 5, 2014 at 4:02 PM, Lital <lit...@liveperson.com <javascript:>

wrote:

Hi,

We would like to use Elasticsearch in order to generate idf score for
each token (for algorithm tf-idf).

What are the types of built in tokenizers in the Elasticsearch ? Should
we specify which tokenizer to use in the indexing level (when inserting the
data) or when performing search on it ?

Is it also possible to make Elasticsearch use a different tokenizer
(that was implemented by me) ?

Thanks,
Lital

This message may contain confidential and/or privileged information.
If you are not the addressee or authorized to receive this on behalf of
the addressee you must not use, copy, disclose or take action based on this
message or any information herein.
If you have received this message in error, please advise the sender
immediately by reply email and delete this message. Thank you.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/58047432-3f73-4a55-84cd-20051ff8738f%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
This message may contain confidential and/or privileged information.
If you are not the addressee or authorized to receive this on behalf of the
addressee you must not use, copy, disclose or take action based on this
message or any information herein.
If you have received this message in error, please advise the sender
immediately by reply email and delete this message. Thank you.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/11be946d-ad99-4db4-b8c0-000b771d908d%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

It is going to be easier actually, since Elasticsearch only exposes the
tf/idf data in the query Explanation in string format, and if all you need
is the tf/idf you better index the data locally and integrate with Lucene's
similarity / explanation classes yourself.

Lucene is just a dependency if you are on Java

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Thu, Feb 6, 2014 at 2:09 PM, Lital litalh@liveperson.com wrote:

Hi,

We are already using Elasticsearch so I though we might get the TF-IDF
algorithm for "free" from it. Another option is to implement it ourselves.
Is it easy to use the Lucene embedded in the Elasticsearch for this ?

Thanks,
Lital

On Thursday, February 6, 2014 12:54:09 AM UTC+2, Itamar Syn-Hershko wrote:

Lital, why do you need Elasticsearch for this? it is going to be way
easier for you to use Lucene directly to do this?

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Wed, Feb 5, 2014 at 4:02 PM, Lital lit...@liveperson.com wrote:

Hi,

We would like to use Elasticsearch in order to generate idf score for
each token (for algorithm tf-idf).

What are the types of built in tokenizers in the Elasticsearch ?
Should we specify which tokenizer to use in the indexing level (when
inserting the data) or when performing search on it ?

Is it also possible to make Elasticsearch use a different tokenizer
(that was implemented by me) ?

Thanks,
Lital

This message may contain confidential and/or privileged information.
If you are not the addressee or authorized to receive this on behalf of
the addressee you must not use, copy, disclose or take action based on this
message or any information herein.
If you have received this message in error, please advise the sender
immediately by reply email and delete this message. Thank you.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/58047432-3f73-4a55-84cd-20051ff8738f%
40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

This message may contain confidential and/or privileged information.
If you are not the addressee or authorized to receive this on behalf of
the addressee you must not use, copy, disclose or take action based on this
message or any information herein.
If you have received this message in error, please advise the sender
immediately by reply email and delete this message. Thank you.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/11be946d-ad99-4db4-b8c0-000b771d908d%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZvVcrFaOe7K7QzKdmgZqisC75Y%2BtL1zeyPJEkS933UMtQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Depending on what you need to accomplish, the new Term Vector API in 1.0
will likely provide what you need. When you enable both field_statistics
and term_statistics, it will show you TF + DF both in your dictionary and
in the document:

-Zach

On Thursday, February 6, 2014 7:09:52 AM UTC-5, Lital wrote:

Hi,

We are already using Elasticsearch so I though we might get the TF-IDF
algorithm for "free" from it. Another option is to implement it ourselves.
Is it easy to use the Lucene embedded in the Elasticsearch for this ?

Thanks,
Lital

On Thursday, February 6, 2014 12:54:09 AM UTC+2, Itamar Syn-Hershko wrote:

Lital, why do you need Elasticsearch for this? it is going to be way
easier for you to use Lucene directly to do this?

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Wed, Feb 5, 2014 at 4:02 PM, Lital lit...@liveperson.com wrote:

Hi,

We would like to use Elasticsearch in order to generate idf score for
each token (for algorithm tf-idf).

What are the types of built in tokenizers in the Elasticsearch ?
Should we specify which tokenizer to use in the indexing level (when
inserting the data) or when performing search on it ?

Is it also possible to make Elasticsearch use a different tokenizer
(that was implemented by me) ?

Thanks,
Lital

This message may contain confidential and/or privileged information.
If you are not the addressee or authorized to receive this on behalf of
the addressee you must not use, copy, disclose or take action based on this
message or any information herein.
If you have received this message in error, please advise the sender
immediately by reply email and delete this message. Thank you.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/58047432-3f73-4a55-84cd-20051ff8738f%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

This message may contain confidential and/or privileged information.
If you are not the addressee or authorized to receive this on behalf of
the addressee you must not use, copy, disclose or take action based on this
message or any information herein.
If you have received this message in error, please advise the sender
immediately by reply email and delete this message. Thank you.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8bcf8cdc-6101-4240-870e-ebf973b3fcdd%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

you can also get raw term statistics stored in the index such as doc
frequency, term frequency etc within a script (>=0.90.10):

You can use this information to calculate your own score. If you want
to use a native script, there are also examples for (very simple)
implementations of common scoring functions (tf-idf, cosine and
language model) here:

If you need the field lengths also for scoring, you can access that by
defining a field of type token_count as described here:

Cheers,
Britta

On Thu, Feb 6, 2014 at 4:10 PM, Zachary Tong zacharyjtong@gmail.com wrote:

Depending on what you need to accomplish, the new Term Vector API in 1.0
will likely provide what you need. When you enable both field_statistics
and term_statistics, it will show you TF + DF both in your dictionary and
in the document:

Elasticsearch Platform — Find real-time answers at scale | Elastic

-Zach

On Thursday, February 6, 2014 7:09:52 AM UTC-5, Lital wrote:

Hi,

We are already using Elasticsearch so I though we might get the TF-IDF
algorithm for "free" from it. Another option is to implement it ourselves.
Is it easy to use the Lucene embedded in the Elasticsearch for this ?

Thanks,
Lital

On Thursday, February 6, 2014 12:54:09 AM UTC+2, Itamar Syn-Hershko wrote:

Lital, why do you need Elasticsearch for this? it is going to be way
easier for you to use Lucene directly to do this?

--

Itamar Syn-Hershko
http://code972.com | @synhershko
Freelance Developer & Consultant
Author of RavenDB in Action

On Wed, Feb 5, 2014 at 4:02 PM, Lital lit...@liveperson.com wrote:

Hi,

We would like to use Elasticsearch in order to generate idf score for
each token (for algorithm tf-idf).

What are the types of built in tokenizers in the Elasticsearch ?
Should we specify which tokenizer to use in the indexing level (when
inserting the data) or when performing search on it ?

Is it also possible to make Elasticsearch use a different tokenizer
(that was implemented by me) ?

Thanks,
Lital

This message may contain confidential and/or privileged information.
If you are not the addressee or authorized to receive this on behalf of
the addressee you must not use, copy, disclose or take action based on this
message or any information herein.
If you have received this message in error, please advise the sender
immediately by reply email and delete this message. Thank you.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/58047432-3f73-4a55-84cd-20051ff8738f%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

This message may contain confidential and/or privileged information.
If you are not the addressee or authorized to receive this on behalf of
the addressee you must not use, copy, disclose or take action based on this
message or any information herein.
If you have received this message in error, please advise the sender
immediately by reply email and delete this message. Thank you.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/8bcf8cdc-6101-4240-870e-ebf973b3fcdd%40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALhJbBgy0Mshe3o20%3DbUF0ksz0w2ivQLiAQ%2Bp_8%2Buf39gxEOvA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.