Length Token Filter


(windoz) #1

I'm new to Elasticsearch and want to know how I can use the length token
filter. I'm trying to limit my search to exclude two letter words.


(Drew Raines) #2

windoz wrote:

I'm new to Elasticsearch and want to know how I can use the length
token filter. I'm trying to limit my search to exclude two letter
words.

Can you give us an example of what you've tried based on the
documentation?

http://www.elasticsearch.org/guide/reference/index-modules/analysis/

Note that there's a length token filter at the bottom of the sample
config:

myTokenFilter2 :
   type : length
   min : 0
   max : 2000

-Drew


(windoz) #3

I've been reading around and found that I can use a custom analyzer with
custom stop. But It is not working. I posted the analyzer (custom.txt
)below using curl - X POST --data "@custom.txt"
http//localhost:9200/sample/test/1 on Windows OS. Is the correct way of
using the analyzers.

Here is custom.txt
{
"analysis": {
"analyzer": {
"symphony_fulltext" : {
"type": "custom",
"tokenizer" : "standard",
"filter": ["stop", "asciifolding", "snowball", "lowercase",
"custom_synonyms", "custom_stop"]
},
"symphony_autocomplete" : {
"type": "custom",
"tokenizer" : "standard",
"filter": ["asciifolding", "lowercase"]
}
},
"filter" : {
"custom_synonyms": {
"type": "synonym",
"ignore_case": "true",
"synonyms": [
"i-pod, i pod => ipod",
"definately, definitly, definetly => definitely"
]
},
"custom_stop": {
"type": "stop",
"stopwords": ["a", "an", "and", "are", "as", "at", "be", "but", "by",
"into", "is", "it", "of", "on", "or", "such", "that", "the", "their",
"there", "these", "they", "this", "to", "was", "will"]
}
}
}
}
On Monday, July 2, 2012 2:38:23 PM UTC+2, windoz wrote:

I'm new to Elasticsearch and want to know how I can use the length token
filter. I'm trying to limit my search to exclude two letter words.


(Drew Raines) #4

windoz wrote:

I've been reading around and found that I can use a custom analyzer with
custom stop. But It is not working. I posted the analyzer (custom.txt
)below using curl - X POST --data "@custom.txt"
http//localhost:9200/sample/test/1 on Windows OS. Is the correct way of
using the analyzers.

[...]

By sending that data to /sample/test/1, you're just indexing it as a
regular doc in ES. You need to store it as part of your index
settings. Try something like this:

curl -s -XPUT localhost:9200/test
-d @<(curl -s http://p.draines.com/13413250381814c87452d.txt)

Then you can check the settings with:

curl -s localhost:9200/test/_settings?pretty=1

-Drew


(windoz) #5

I tried what you said, but unfortunately when I do my search for the top
ten most used words in the documents I still get the stop words being
indexed. What could be the problem?

On Tuesday, July 3, 2012 4:20:55 PM UTC+2, Drew Raines wrote:

windoz wrote:

I've been reading around and found that I can use a custom analyzer with
custom stop. But It is not working. I posted the analyzer (custom.txt
)below using curl - X POST --data "@custom.txt"
http//localhost:9200/sample/test/1 on Windows OS. Is the correct way
of
using the analyzers.

[...]

By sending that data to /sample/test/1, you're just indexing it as a
regular doc in ES. You need to store it as part of your index
settings. Try something like this:

curl -s -XPUT localhost:9200/test \
-d @<(curl -s http://p.draines.com/13413250381814c87452d.txt)

Then you can check the settings with:

curl -s localhost:9200/test/_settings?pretty=1

-Drew


(Ivan Brusic) #6

Are you correctly apply your analyzer as the mapping of your field?
Can you gist your mapping as well?

--
Ivan

On Wed, Jul 4, 2012 at 3:37 AM, windoz victor.21.marisa@gmail.com wrote:

I tried what you said, but unfortunately when I do my search for the top ten
most used words in the documents I still get the stop words being indexed.
What could be the problem?

On Tuesday, July 3, 2012 4:20:55 PM UTC+2, Drew Raines wrote:

windoz wrote:

I've been reading around and found that I can use a custom analyzer with
custom stop. But It is not working. I posted the analyzer (custom.txt
)below using curl - X POST --data "@custom.txt"
http//localhost:9200/sample/test/1 on Windows OS. Is the correct way
of
using the analyzers.

[...]

By sending that data to /sample/test/1, you're just indexing it as a
regular doc in ES. You need to store it as part of your index
settings. Try something like this:

curl -s -XPUT localhost:9200/test
-d @<(curl -s http://p.draines.com/13413250381814c87452d.txt)

Then you can check the settings with:

curl -s localhost:9200/test/_settings?pretty=1

-Drew


(windoz) #7

I'm now trying a new way, shown below, If i use a query to search the top
ten words in the message field of the docs index, I still get the words
[is, the, this,.... ] that i have included in the stop words list in my
custom filter. The search_analyzer deals with the searching part and
index_analyzer with the indexing part.

Here is the mapping, analyzers and filters.
{
"mappings" : {
"message" : {
"properties" : {
"title" : {
"type" : "string",
"search_analyzer" : "str_search_analyzer",
"index_analyzer" : "str_index_analyzer"

    }
  }
}

},

"settings" : {
"analysis" : {
"analyzer" : {
"str_search_analyzer" : {
"tokenizer" : "keyword",
"filter" : ["lowercase","custom_stop"]
},

    "str_index_analyzer" : {
      "tokenizer" : "keyword",
      "filter" : ["lowercase", ]
    }
  },

  "filter" :

"custom_stop": {
"type": "stop",
"stopwords": ["a", "an", "and", "are", "as", "at", "be", "but", "by",
"into", "is", "it", "of", "on", "or", "such", "that", "the", "their",
"there", "these", "they", "this", "to", "was", "will","we"]
}
}
}
}
}

On Thursday, July 5, 2012 11:46:09 PM UTC+2, Ivan Brusic wrote:

Are you correctly apply your analyzer as the mapping of your field?
Can you gist your mapping as well?

--
Ivan

On Wed, Jul 4, 2012 at 3:37 AM, windoz wrote:

I tried what you said, but unfortunately when I do my search for the top
ten
most used words in the documents I still get the stop words being
indexed.
What could be the problem?

On Tuesday, July 3, 2012 4:20:55 PM UTC+2, Drew Raines wrote:

windoz wrote:

I've been reading around and found that I can use a custom analyzer
with

custom stop. But It is not working. I posted the analyzer (custom.txt
)below using curl - X POST --data "@custom.txt"
http//localhost:9200/sample/test/1 on Windows OS. Is the correct
way

of
using the analyzers.

[...]

By sending that data to /sample/test/1, you're just indexing it as a
regular doc in ES. You need to store it as part of your index
settings. Try something like this:

curl -s -XPUT localhost:9200/test \
-d @<(curl -s http://p.draines.com/13413250381814c87452d.txt)

Then you can check the settings with:

curl -s localhost:9200/test/_settings?pretty=1

-Drew


(Drew Raines) #8

windoz wrote:

I'm now trying a new way, shown below, If i use a query to search the top
ten words in the message field of the docs index, I still get the words
[is, the, this,.... ] that i have included in the stop words list in my
custom filter.

Can you provide a script that reproduces what you're seeing and what
you would like it to do instead? Something that sets up your index,
indexes something, queries, and then tell us how it differs from what
you expected.

http://www.elasticsearch.org/help/

-Drew


(Igor Motov) #9

Hi windoz,

There is a couple of syntax errors in your example. A curly braket is
missing here:

"filter" :
"custom_stop": {

And a filter is missing in str_index_analyzer definition:

   "str_index_analyzer" : {
      "tokenizer" : "keyword",
  •      "filter" : ["lowercase", ]*
      }
    

I also don't think that "keyword" tokenizer is what you want in your case.
It emits content of the entire field as a single token, which doesn't allow
stop word filter to do its job unless your fields consist of single words.
I think, it might be better to use standard tokenizer instead. With these
changes, this is how your example might look
like: https://gist.github.com/3071582

Igor

On Friday, July 6, 2012 3:26:50 PM UTC-4, Drew Raines wrote:

windoz wrote:

I'm now trying a new way, shown below, If i use a query to search the
top
ten words in the message field of the docs index, I still get the words
[is, the, this,.... ] that i have included in the stop words list in my
custom filter.

Can you provide a script that reproduces what you're seeing and what
you would like it to do instead? Something that sets up your index,
indexes something, queries, and then tell us how it differs from what
you expected.

http://www.elasticsearch.org/help/

-Drew


(windoz) #10

Thanks Motov !

Your code seems to be working fine so far.

On Sunday, July 8, 2012 6:19:36 PM UTC+2, Igor Motov wrote:

Hi windoz,

There is a couple of syntax errors in your example. A curly braket is
missing here:

"filter" :
"custom_stop": {

And a filter is missing in str_index_analyzer definition:

   "str_index_analyzer" : {
      "tokenizer" : "keyword",
  •      "filter" : ["lowercase", ]*
      }
    

I also don't think that "keyword" tokenizer is what you want in your case.
It emits content of the entire field as a single token, which doesn't allow
stop word filter to do its job unless your fields consist of single words.
I think, it might be better to use standard tokenizer instead. With these
changes, this is how your example might look like:
https://gist.github.com/3071582

Igor

On Friday, July 6, 2012 3:26:50 PM UTC-4, Drew Raines wrote:

windoz wrote:

I'm now trying a new way, shown below, If i use a query to search the
top
ten words in the message field of the docs index, I still get the words
[is, the, this,.... ] that i have included in the stop words list in
my
custom filter.

Can you provide a script that reproduces what you're seeing and what
you would like it to do instead? Something that sets up your index,
indexes something, queries, and then tell us how it differs from what
you expected.

http://www.elasticsearch.org/help/

-Drew


(system) #11