Filename search using nGram-tokenizer and span_near-query


(da-mkay) #1

Hi,

I try to build a simple filename search. I want that the user can search
for any part of the name.

Let's say the following filenames are indexed:
[1] My_file_2012.01.12.txt
[2] My_file_2012.01.05.txt
[3] My_file_2012.05.01.txt
[4] My_file_2012.08.27.txt
[5] My_file_2012.12.12.txt
[6] My_file_2011.12.12.txt
[7] file_01_2012.09.09.txt

Then the user might search for:
"ile_20" (finds the first six documents)
"12.txt" (finds 1, 5, 6)
"12" followed by "01" (finds 1, 2, 3 - NOT 7)
"2012" followed by "01" (finds 1, 2, 3 - NOT 7)

(Note: Yes, the user might really search for strings like "ile_20" ... e.g.
because of copy-and-paste mistakes :slight_smile: )

Therefore I use a nGram-tokenizer to index each possible input-string. This
works fine so far.
To support the "followed by"-search mentioned above I need a query that
respects the order of the terms, no matter how many text is between these
two terms (okay let's say max. 100 characters :slight_smile: ).

Since a "text_phrase"-query with a "slop" does not respect the ordering of
the terms correctly, I decided to use a "span_near" query. This works fine
in most cases.

See here my full example-index: https://gist.github.com/3487909

As mentioned in the example above the query "'2012' followed by '01'" does
not work since the nGram tokenizer generates a position-value for each
token that is not very useful when used by the "span_near" query. While
indexing, the term "2012" is assigned to a position value (50) which is
bigger than the position value for the term "01" (e.g. 10). Since 50 and 10
are not in order the query will have no results. The in-order-thing works
only correct for terms which have the same length (e.g. "'12' followed by
'01'") or if the terms are ordered by length (e.g. "'20' followed by
'.12'").

So how can I achieve the correct search-behaviour? I just want the ability
to search for any part(s) of the filename while respecting the order of the
terms. :slight_smile:
Maybe there is a way to tell "span_near" to not use the position but
instead the "start_offset"?
Or is there another query I can use?

Best regards,
da-mkay

--


(da-mkay) #2

Does nobody have an idea? :slight_smile:

da-mkay:

Hi,

I try to build a simple filename search. I want that the user can search
for any part of the name.

Let's say the following filenames are indexed:
[1] My_file_2012.01.12.txt
[2] My_file_2012.01.05.txt
[3] My_file_2012.05.01.txt
[4] My_file_2012.08.27.txt
[5] My_file_2012.12.12.txt
[6] My_file_2011.12.12.txt
[7] file_01_2012.09.09.txt

Then the user might search for:
"ile_20" (finds the first six documents)
"12.txt" (finds 1, 5, 6)
"12" followed by "01" (finds 1, 2, 3 - NOT 7)
"2012" followed by "01" (finds 1, 2, 3 - NOT 7)

(Note: Yes, the user might really search for strings like "ile_20" ... e.g.
because of copy-and-paste mistakes :slight_smile: )

Therefore I use a nGram-tokenizer to index each possible input-string. This
works fine so far.
To support the "followed by"-search mentioned above I need a query that
respects the order of the terms, no matter how many text is between these
two terms (okay let's say max. 100 characters :slight_smile: ).

Since a "text_phrase"-query with a "slop" does not respect the ordering of
the terms correctly, I decided to use a "span_near" query. This works fine
in most cases.

See here my full example-index: https://gist.github.com/3487909

As mentioned in the example above the query "'2012' followed by '01'" does
not work since the nGram tokenizer generates a position-value for each
token that is not very useful when used by the "span_near" query. While
indexing, the term "2012" is assigned to a position value (50) which is
bigger than the position value for the term "01" (e.g. 10). Since 50 and 10
are not in order the query will have no results. The in-order-thing works
only correct for terms which have the same length (e.g. "'12' followed by
'01'") or if the terms are ordered by length (e.g. "'20' followed by
'.12'").

So how can I achieve the correct search-behaviour? I just want the ability
to search for any part(s) of the filename while respecting the order of the
terms. :slight_smile:
Maybe there is a way to tell "span_near" to not use the position but
instead the "start_offset"?
Or is there another query I can use?

Best regards,
da-mkay

--


(Clinton Gormley) #3

Have a look at this explanation:

clint

On Mon, Sep 3, 2012 at 4:34 PM, Manuel manuel@wenns-um-email-geht.dewrote:

Does nobody have an idea? :slight_smile:

da-mkay:

Hi,

I try to build a simple filename search. I want that the user can search
for any part of the name.

Let's say the following filenames are indexed:
[1] My_file_2012.01.12.txt
[2] My_file_2012.01.05.txt
[3] My_file_2012.05.01.txt
[4] My_file_2012.08.27.txt
[5] My_file_2012.12.12.txt
[6] My_file_2011.12.12.txt
[7] file_01_2012.09.09.txt

Then the user might search for:
"ile_20" (finds the first six documents)
"12.txt" (finds 1, 5, 6)
"12" followed by "01" (finds 1, 2, 3 - NOT 7)
"2012" followed by "01" (finds 1, 2, 3 - NOT 7)

(Note: Yes, the user might really search for strings like "ile_20" ...
e.g.
because of copy-and-paste mistakes :slight_smile: )

Therefore I use a nGram-tokenizer to index each possible input-string.
This
works fine so far.
To support the "followed by"-search mentioned above I need a query that
respects the order of the terms, no matter how many text is between these
two terms (okay let's say max. 100 characters :slight_smile: ).

Since a "text_phrase"-query with a "slop" does not respect the ordering
of
the terms correctly, I decided to use a "span_near" query. This works
fine
in most cases.

See here my full example-index: https://gist.github.com/3487909

As mentioned in the example above the query "'2012' followed by '01'"
does
not work since the nGram tokenizer generates a position-value for each
token that is not very useful when used by the "span_near" query. While
indexing, the term "2012" is assigned to a position value (50) which is
bigger than the position value for the term "01" (e.g. 10). Since 50 and
10
are not in order the query will have no results. The in-order-thing works
only correct for terms which have the same length (e.g. "'12' followed by
'01'") or if the terms are ordered by length (e.g. "'20' followed by
'.12'").

So how can I achieve the correct search-behaviour? I just want the
ability
to search for any part(s) of the filename while respecting the order of
the
terms. :slight_smile:
Maybe there is a way to tell "span_near" to not use the position but
instead the "start_offset"?
Or is there another query I can use?

Best regards,
da-mkay

--

--


(da-mkay) #4

Thanks, but that is my question at SO which I posted a few month ago :slight_smile:
In the answer a text_phrase-query is used which I cannot use. This is
why I tried using a span-query. See my text below. :wink:

Best regards,
da-mkay

Am 04.09.2012 12:20, schrieb Clinton Gormley:

Have a look at this explanation:
http://stackoverflow.com/questions/9421358/filename-search-with-elasticsearch/9432450#9432450

clint

On Mon, Sep 3, 2012 at 4:34 PM, Manuel manuel@wenns-um-email-geht.dewrote:

Does nobody have an idea? :slight_smile:

da-mkay:

Hi,

I try to build a simple filename search. I want that the user can search
for any part of the name.

Let's say the following filenames are indexed:
[1] My_file_2012.01.12.txt
[2] My_file_2012.01.05.txt
[3] My_file_2012.05.01.txt
[4] My_file_2012.08.27.txt
[5] My_file_2012.12.12.txt
[6] My_file_2011.12.12.txt
[7] file_01_2012.09.09.txt

Then the user might search for:
"ile_20" (finds the first six documents)
"12.txt" (finds 1, 5, 6)
"12" followed by "01" (finds 1, 2, 3 - NOT 7)
"2012" followed by "01" (finds 1, 2, 3 - NOT 7)

(Note: Yes, the user might really search for strings like "ile_20" ...
e.g.
because of copy-and-paste mistakes :slight_smile: )

Therefore I use a nGram-tokenizer to index each possible input-string.
This
works fine so far.
To support the "followed by"-search mentioned above I need a query that
respects the order of the terms, no matter how many text is between these
two terms (okay let's say max. 100 characters :slight_smile: ).

Since a "text_phrase"-query with a "slop" does not respect the ordering
of
the terms correctly, I decided to use a "span_near" query. This works
fine
in most cases.

See here my full example-index: https://gist.github.com/3487909

As mentioned in the example above the query "'2012' followed by '01'"
does
not work since the nGram tokenizer generates a position-value for each
token that is not very useful when used by the "span_near" query. While
indexing, the term "2012" is assigned to a position value (50) which is
bigger than the position value for the term "01" (e.g. 10). Since 50 and
10
are not in order the query will have no results. The in-order-thing works
only correct for terms which have the same length (e.g. "'12' followed by
'01'") or if the terms are ordered by length (e.g. "'20' followed by
'.12'").

So how can I achieve the correct search-behaviour? I just want the
ability
to search for any part(s) of the filename while respecting the order of
the
terms. :slight_smile:
Maybe there is a way to tell "span_near" to not use the position but
instead the "start_offset"?
Or is there another query I can use?

Best regards,
da-mkay

--

--


(da-mkay) #5

I found a solution. Due to the NGram-tokenizer each possible keyword is
indexed as a separate token. So I can use a "query_string"-query with a
wildcard on the filename-field: e.g. "2012*01".
This takes the order of the terms into account. But I think that those
queries can get a bit slow with heavy wildcard-usage, right? Especially
with many requests in parallel.

Best regards,
da-mkay

Am 04.09.2012 13:52, schrieb Manuel:

Thanks, but that is my question at SO which I posted a few month ago :slight_smile:
In the answer a text_phrase-query is used which I cannot use. This is
why I tried using a span-query. See my text below. :wink:

Best regards,
da-mkay

Am 04.09.2012 12:20, schrieb Clinton Gormley:

Have a look at this explanation:
http://stackoverflow.com/questions/9421358/filename-search-with-elasticsearch/9432450#9432450

clint

On Mon, Sep 3, 2012 at 4:34 PM, Manuel manuel@wenns-um-email-geht.dewrote:

Does nobody have an idea? :slight_smile:

da-mkay:

Hi,

I try to build a simple filename search. I want that the user can search
for any part of the name.

Let's say the following filenames are indexed:
[1] My_file_2012.01.12.txt
[2] My_file_2012.01.05.txt
[3] My_file_2012.05.01.txt
[4] My_file_2012.08.27.txt
[5] My_file_2012.12.12.txt
[6] My_file_2011.12.12.txt
[7] file_01_2012.09.09.txt

Then the user might search for:
"ile_20" (finds the first six documents)
"12.txt" (finds 1, 5, 6)
"12" followed by "01" (finds 1, 2, 3 - NOT 7)
"2012" followed by "01" (finds 1, 2, 3 - NOT 7)

(Note: Yes, the user might really search for strings like "ile_20" ...
e.g.
because of copy-and-paste mistakes :slight_smile: )

Therefore I use a nGram-tokenizer to index each possible input-string.
This
works fine so far.
To support the "followed by"-search mentioned above I need a query that
respects the order of the terms, no matter how many text is between these
two terms (okay let's say max. 100 characters :slight_smile: ).

Since a "text_phrase"-query with a "slop" does not respect the ordering
of
the terms correctly, I decided to use a "span_near" query. This works
fine
in most cases.

See here my full example-index: https://gist.github.com/3487909

As mentioned in the example above the query "'2012' followed by '01'"
does
not work since the nGram tokenizer generates a position-value for each
token that is not very useful when used by the "span_near" query. While
indexing, the term "2012" is assigned to a position value (50) which is
bigger than the position value for the term "01" (e.g. 10). Since 50 and
10
are not in order the query will have no results. The in-order-thing works
only correct for terms which have the same length (e.g. "'12' followed by
'01'") or if the terms are ordered by length (e.g. "'20' followed by
'.12'").

So how can I achieve the correct search-behaviour? I just want the
ability
to search for any part(s) of the filename while respecting the order of
the
terms. :slight_smile:
Maybe there is a way to tell "span_near" to not use the position but
instead the "start_offset"?
Or is there another query I can use?

Best regards,
da-mkay

--

--


(system) #6