Custom tokenizer or analyzer?


(James Cook-3) #1

I have a local property in my documents. It can be a simple language code
or a language/country combo.

i.e.

{ locale: 'en' } or { locale: 'en_GB' }

My search truth looks like this:

           document locale
        en    en_GB    en_US

s
e en yes yes yes
a
r en_GB yes yes yes
c
h es no no no

I would like to tokenize these values in this way:
'en' => 'en'
'en_GB' => 'en', 'en_gb'

This is intended to allow me to find the correct document. I am not sure
how to integrate such a tokenization into ES. Any hints?

Of course this is just matching, not sorting at this moment. For sorting, I
would want the most specific match to score higher.


(Clinton Gormley) #2

Hi Jim

On Sat, 2012-05-05 at 10:50 -0700, James Cook wrote:

I have a local property in my documents. It can be a simple language
code or a language/country combo.

{ locale: 'en' } or { locale: 'en_GB' }

I would like to tokenize these values in this way:
'en' => 'en'
'en_GB' => 'en', 'en_gb'

I think the easiest way to achieve this is by using a multi-field:

Of course this is just matching, not sorting at this moment. For
sorting, I would want the most specific match to score higher.

In the gist above, I show one way that you could sort your results.

clint


(James Cook-3) #3

That is so cool, thanks.

I gotta be honest, after using ES for two years now and building three
projects that use ES as their datastore, I know very little about the
Lucene side (mapping and querying) of things. I hope a book is forthcoming,
as this is really hard and somewhat unapproachable for those of us without
the Lucene experience. An example like yours would be an awesome cookbook
entry. Don't know why we haven't made an ES cookbook section on the website.

Thanks again.

On Saturday, May 5, 2012 1:50:02 PM UTC-4, James Cook wrote:

I have a local property in my documents. It can be a simple language code
or a language/country combo.

i.e.

{ locale: 'en' } or { locale: 'en_GB' }

My search truth looks like this:

           document locale
        en    en_GB    en_US

s
e en yes yes yes
a
r en_GB yes yes yes
c
h es no no no

I would like to tokenize these values in this way:
'en' => 'en'
'en_GB' => 'en', 'en_gb'

This is intended to allow me to find the correct document. I am not sure
how to integrate such a tokenization into ES. Any hints?

Of course this is just matching, not sorting at this moment. For sorting,
I would want the most specific match to score higher.


(Ivan Brusic) #4

Well, if you lack knowledge on the Lucene side, then there already is
a great book on the subject: http://www.manning.com/hatcher3/

The analysis/tokenization chapters are very good. Some of the book is
already outdated, but the analysis part remains the same. Lucene does
have a documentation problem. Very little on the website and the
contrib libs have almost no documentation. The mailing list however is
good and not very chatty (much less than this one), so don't be afraid
to sign up.

An ES book would be great, but I see the system as still being a
moving system. Plus, it Shay were to write a book, that means less
time working on ES!

--
Ivan

On Mon, May 7, 2012 at 6:01 AM, James Cook jcook@pykl.com wrote:

That is so cool, thanks.

I gotta be honest, after using ES for two years now and building three
projects that use ES as their datastore, I know very little about the Lucene
side (mapping and querying) of things. I hope a book is forthcoming, as this
is really hard and somewhat unapproachable for those of us without the
Lucene experience. An example like yours would be an awesome cookbook entry.
Don't know why we haven't made an ES cookbook section on the website.

Thanks again.

On Saturday, May 5, 2012 1:50:02 PM UTC-4, James Cook wrote:

I have a local property in my documents. It can be a simple language code
or a language/country combo.

i.e.

{ locale: 'en' } or { locale: 'en_GB' }

My search truth looks like this:

           document locale
        en    en_GB    en_US

s
e en yes yes yes
a
r en_GB yes yes yes
c
h es no no no

I would like to tokenize these values in this way:
'en' => 'en'
'en_GB' => 'en', 'en_gb'

This is intended to allow me to find the correct document. I am not sure
how to integrate such a tokenization into ES. Any hints?

Of course this is just matching, not sorting at this moment. For sorting,
I would want the most specific match to score higher.


(James Cook-3) #5

I know we have toyed with an online cookbook for a number of years now.
Perhaps I can find some time to make a pull request towards this effort.

I need to sleep two hours less a day. Edison must of been on to something.

-- jim

On Monday, May 7, 2012 2:15:00 PM UTC-4, Ivan Brusic wrote:

Well, if you lack knowledge on the Lucene side, then there already is
a great book on the subject: http://www.manning.com/hatcher3/

The analysis/tokenization chapters are very good. Some of the book is
already outdated, but the analysis part remains the same. Lucene does
have a documentation problem. Very little on the website and the
contrib libs have almost no documentation. The mailing list however is
good and not very chatty (much less than this one), so don't be afraid
to sign up.

An ES book would be great, but I see the system as still being a
moving system. Plus, it Shay were to write a book, that means less
time working on ES!

--
Ivan

On Mon, May 7, 2012 at 6:01 AM, James Cook wrote:

That is so cool, thanks.

I gotta be honest, after using ES for two years now and building three
projects that use ES as their datastore, I know very little about the
Lucene
side (mapping and querying) of things. I hope a book is forthcoming, as
this
is really hard and somewhat unapproachable for those of us without the
Lucene experience. An example like yours would be an awesome cookbook
entry.
Don't know why we haven't made an ES cookbook section on the website.

Thanks again.

On Saturday, May 5, 2012 1:50:02 PM UTC-4, James Cook wrote:

I have a local property in my documents. It can be a simple language
code

or a language/country combo.

i.e.

{ locale: 'en' } or { locale: 'en_GB' }

My search truth looks like this:

           document locale 
        en    en_GB    en_US 

s
e en yes yes yes
a
r en_GB yes yes yes
c
h es no no no

I would like to tokenize these values in this way:
'en' => 'en'
'en_GB' => 'en', 'en_gb'

This is intended to allow me to find the correct document. I am not
sure

how to integrate such a tokenization into ES. Any hints?

Of course this is just matching, not sorting at this moment. For
sorting,

I would want the most specific match to score higher.


(James Cook-3) #6

The approach of using multifield types for a locale value works well in my
use case of returning a list of matching articles in order of the most
specific locale value.

Assume I have a property called 'key' which uniquely identifies an article
topic, and the combination of 'key' and 'locale' identifies a specific
document in my index.

For arguments sake, assume I have several documents with the same 'key',
but different locales. The query and mapping Clinton presented works very
well to return me a list of those matching documents in order of most
specific locale match for a specific key.

But what happens if I want a list of all documents (multiple keys) and pass
a specific locale? I will get a list of all matching documents (by locale)
for each key. But I really want a single match for each document key.

Any thoughts on how to filter this list?

On Monday, May 7, 2012 5:02:19 AM UTC-4, Clinton Gormley wrote:

Hi Jim

On Sat, 2012-05-05 at 10:50 -0700, James Cook wrote:

I have a local property in my documents. It can be a simple language
code or a language/country combo.

{ locale: 'en' } or { locale: 'en_GB' }

I would like to tokenize these values in this way:
'en' => 'en'
'en_GB' => 'en', 'en_gb'

I think the easiest way to achieve this is by using a multi-field:

https://gist.github.com/2626785

Of course this is just matching, not sorting at this moment. For
sorting, I would want the most specific match to score higher.

In the gist above, I show one way that you could sort your results.

clint


(system) #7