Custom tokenizer or analyzer?

James_Cook_3 · May 5, 2012, 5:50pm

I have a local property in my documents. It can be a simple language code
or a language/country combo.

i.e.

{ locale: 'en' } or { locale: 'en_GB' }

My search truth looks like this:

           document locale
        en    en_GB    en_US

s
e en yes yes yes
a
r en_GB yes yes yes
c
h es no no no

I would like to tokenize these values in this way:
'en' => 'en'
'en_GB' => 'en', 'en_gb'

This is intended to allow me to find the correct document. I am not sure
how to integrate such a tokenization into ES. Any hints?

Of course this is just matching, not sorting at this moment. For sorting, I
would want the most specific match to score higher.

Clinton_Gormley · May 7, 2012, 9:02am

Hi Jim

On Sat, 2012-05-05 at 10:50 -0700, James Cook wrote:

I have a local property in my documents. It can be a simple language
code or a language/country combo.

{ locale: 'en' } or { locale: 'en_GB' }

I would like to tokenize these values in this way:
'en' => 'en'
'en_GB' => 'en', 'en_gb'

I think the easiest way to achieve this is by using a multi-field:

gist.github.com

https://gist.github.com/clintongormley/2626785

gistfile1.sh

# Create your mapping
# -------------------

curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1'  -d '
{
   "mappings" : {
      "test" : {
         "properties" : {
            "lang" : {
               "fields" : {

This file has been truncated. show original

Of course this is just matching, not sorting at this moment. For
sorting, I would want the most specific match to score higher.

In the gist above, I show one way that you could sort your results.

clint

James_Cook_3 · May 7, 2012, 1:01pm

That is so cool, thanks.

I gotta be honest, after using ES for two years now and building three
projects that use ES as their datastore, I know very little about the
Lucene side (mapping and querying) of things. I hope a book is forthcoming,
as this is really hard and somewhat unapproachable for those of us without
the Lucene experience. An example like yours would be an awesome cookbook
entry. Don't know why we haven't made an ES cookbook section on the website.

Thanks again.

On Saturday, May 5, 2012 1:50:02 PM UTC-4, James Cook wrote:

I have a local property in my documents. It can be a simple language code
or a language/country combo.

i.e.

{ locale: 'en' } or { locale: 'en_GB' }

My search truth looks like this:
           document locale
        en    en_GB    en_US
s
e en yes yes yes
a
r en_GB yes yes yes
c
h es no no no

I would like to tokenize these values in this way:
'en' => 'en'
'en_GB' => 'en', 'en_gb'

This is intended to allow me to find the correct document. I am not sure
how to integrate such a tokenization into ES. Any hints?

Of course this is just matching, not sorting at this moment. For sorting,
I would want the most specific match to score higher.

Ivan · May 7, 2012, 6:15pm

Well, if you lack knowledge on the Lucene side, then there already is
a great book on the subject: Lucene in Action, Second Edition

The analysis/tokenization chapters are very good. Some of the book is
already outdated, but the analysis part remains the same. Lucene does
have a documentation problem. Very little on the website and the
contrib libs have almost no documentation. The mailing list however is
good and not very chatty (much less than this one), so don't be afraid
to sign up.

An ES book would be great, but I see the system as still being a
moving system. Plus, it Shay were to write a book, that means less
time working on ES!

--
Ivan

On Mon, May 7, 2012 at 6:01 AM, James Cook jcook@pykl.com wrote:

That is so cool, thanks.

I gotta be honest, after using ES for two years now and building three
projects that use ES as their datastore, I know very little about the Lucene
side (mapping and querying) of things. I hope a book is forthcoming, as this
is really hard and somewhat unapproachable for those of us without the
Lucene experience. An example like yours would be an awesome cookbook entry.
Don't know why we haven't made an ES cookbook section on the website.

Thanks again.

On Saturday, May 5, 2012 1:50:02 PM UTC-4, James Cook wrote:
I have a local property in my documents. It can be a simple language code
or a language/country combo.

i.e.

{ locale: 'en' } or { locale: 'en_GB' }

My search truth looks like this:
           document locale
        en    en_GB    en_US
s
e en yes yes yes
a
r en_GB yes yes yes
c
h es no no no

I would like to tokenize these values in this way:
'en' => 'en'
'en_GB' => 'en', 'en_gb'

This is intended to allow me to find the correct document. I am not sure
how to integrate such a tokenization into ES. Any hints?

Of course this is just matching, not sorting at this moment. For sorting,
I would want the most specific match to score higher.

James_Cook_3 · May 7, 2012, 7:44pm

I know we have toyed with an online cookbook for a number of years now.
Perhaps I can find some time to make a pull request towards this effort.

I need to sleep two hours less a day. Edison must of been on to something.

-- jim

On Monday, May 7, 2012 2:15:00 PM UTC-4, Ivan Brusic wrote:

Well, if you lack knowledge on the Lucene side, then there already is
a great book on the subject: Lucene in Action, Second Edition

The analysis/tokenization chapters are very good. Some of the book is
already outdated, but the analysis part remains the same. Lucene does
have a documentation problem. Very little on the website and the
contrib libs have almost no documentation. The mailing list however is
good and not very chatty (much less than this one), so don't be afraid
to sign up.

An ES book would be great, but I see the system as still being a
moving system. Plus, it Shay were to write a book, that means less
time working on ES!

--
Ivan

On Mon, May 7, 2012 at 6:01 AM, James Cook wrote:
That is so cool, thanks.

I gotta be honest, after using ES for two years now and building three
projects that use ES as their datastore, I know very little about the
Lucene
side (mapping and querying) of things. I hope a book is forthcoming, as
this
is really hard and somewhat unapproachable for those of us without the
Lucene experience. An example like yours would be an awesome cookbook
entry.
Don't know why we haven't made an ES cookbook section on the website.

Thanks again.

On Saturday, May 5, 2012 1:50:02 PM UTC-4, James Cook wrote:
I have a local property in my documents. It can be a simple language
code
or a language/country combo.

i.e.

{ locale: 'en' } or { locale: 'en_GB' }

My search truth looks like this:
           document locale 
        en    en_GB    en_US 
s
e en yes yes yes
a
r en_GB yes yes yes
c
h es no no no

I would like to tokenize these values in this way:
'en' => 'en'
'en_GB' => 'en', 'en_gb'

This is intended to allow me to find the correct document. I am not
sure
how to integrate such a tokenization into ES. Any hints?

Of course this is just matching, not sorting at this moment. For
sorting,
I would want the most specific match to score higher.

James_Cook_3 · June 7, 2012, 8:42pm

The approach of using multifield types for a locale value works well in my
use case of returning a list of matching articles in order of the most
specific locale value.

Assume I have a property called 'key' which uniquely identifies an article
topic, and the combination of 'key' and 'locale' identifies a specific
document in my index.

For arguments sake, assume I have several documents with the same 'key',
but different locales. The query and mapping Clinton presented works very
well to return me a list of those matching documents in order of most
specific locale match for a specific key.

But what happens if I want a list of all documents (multiple keys) and pass
a specific locale? I will get a list of all matching documents (by locale)
for each key. But I really want a single match for each document key.

Any thoughts on how to filter this list?

On Monday, May 7, 2012 5:02:19 AM UTC-4, Clinton Gormley wrote:

Hi Jim

On Sat, 2012-05-05 at 10:50 -0700, James Cook wrote:

I have a local property in my documents. It can be a simple language
code or a language/country combo.

{ locale: 'en' } or { locale: 'en_GB' }

I would like to tokenize these values in this way:
'en' => 'en'
'en_GB' => 'en', 'en_gb'

I think the easiest way to achieve this is by using a multi-field:

gist:2626785 · GitHub

Of course this is just matching, not sorting at this moment. For
sorting, I would want the most specific match to score higher.

In the gist above, I show one way that you could sort your results.

clint

Topic		Replies	Views
Dealing with languages Elasticsearch	3	608	July 6, 2017
Alternative approaches to a query Elasticsearch	3	329	July 6, 2017
Terms filter on analyzed field Elasticsearch	1	385	July 5, 2017
Need suggestions on type of query to be used for a given analysis for better results? Elasticsearch	2	373	July 6, 2017
Custom analyzer registered but not used Elasticsearch	1	353	July 6, 2017

Custom tokenizer or analyzer?

Related topics