Stop words not used by the analyzer


(Joaquin Cuenca Abela) #1

Hi,

I'm trying to use the snowball stemmer in the latest (yesterday)
version of elasticsearch (from git), I have the following index:

index:
analysis:
analyzer:
stemming:
type: custom
tokenizer: standard
language: Spanish
filter: [standard, lowercase, stop, asciifolding, snowball]
stop_words: ["de", "la", "que", "el", "en", "y", "a", "los",
"del", "se", "las", "por", "un", "para", "con", "no", "una", "su",
"al", "lo", "como", "más", "pero", "sus", "le", "ya", "o", "este",
"sí", "porque", "esta", "entre", "cuando", "muy", "sin", "sobre",
"también", "me", "hasta", "hay", "donde", "quien", "desde", "todo",
"nos", "durante", "todos", "uno", "les", "ni", "contra", "otros",
"ese", "eso", "ante", "ellos", "e", "esto", "mí", "antes", "algunos",
"qué", "unos", "yo", "otro", "otras", "otra", "él", "tanto", "esa",
"estos", "mucho", "quienes", "nada", "muchos", "cual", "poco", "ella",
"estar", "estas", "algunas", "algo", "nosotros", "mi", "mis", "tú",
"te", "ti", "tu", "tus", "ellas", "nosotras", "vosotros", "vosotras",
"os", "mío", "mía", "míos", "mías", "tuyo", "tuya", "tuyos", "tuyas",
"suyo", "suya", "suyos", "suyas", "nuestro", "nuestra", "nuestros",
"nuestras", "vuestro", "vuestra", "vuestros", "vuestras", "esos",
"esas", "estoy", "estás", "está", "estamos", "estáis", "están",
"esté", "estés", "estemos", "estéis", "estén", "estaré", "estarás",
"estará", "estaremos", "estaréis", "estarán", "estaría", "estarías",
"estaríamos", "estaríais", "estarían", "estaba", "estabas",
"estábamos", "estabais", "estaban", "estuve", "estuviste", "estuvo",
"estuvimos", "estuvisteis", "estuvieron", "estuviera", "estuvieras",
"estuviéramos", "estuvierais", "estuvieran", "estuviese",
"estuvieses", "estuviésemos", "estuvieseis", "estuviesen", "estando",
"estado", "estada", "estados", "estadas", "estad", "he", "has", "ha",
"hemos", "habéis", "han", "haya", "hayas", "hayamos", "hayáis",
"hayan", "habré", "habrás", "habrá", "habremos", "habréis", "habrán",
"habría", "habrías", "habríamos", "habríais", "habrían", "había",
"habías", "habíamos", "habíais", "habían", "hube", "hubiste", "hubo",
"hubimos", "hubisteis", "hubieron", "hubiera", "hubieras",
"hubiéramos", "hubierais", "hubieran", "hubiese", "hubieses",
"hubiésemos", "hubieseis", "hubiesen", "habiendo", "habido", "habida",
"habidos", "habidas", "soy", "eres", "es", "somos", "sois", "son",
"sea", "seas", "seamos", "seáis", "sean", "seré", "serás", "será",
"seremos", "seréis", "serán", "sería", "serías", "seríamos",
"seríais", "serían", "era", "eras", "éramos", "erais", "eran", "fui",
"fuiste", "fue", "fuimos", "fuisteis", "fueron", "fuera", "fueras",
"fuéramos", "fuerais", "fueran", "fuese", "fueses", "fuésemos",
"fueseis", "fuesen", "siendo", "sido", "tengo", "tienes", "tiene",
"tenemos", "tenéis", "tienen", "tenga", "tengas", "tengamos",
"tengáis", "tengan", "tendré", "tendrás", "tendrá", "tendremos",
"tendréis", "tendrán", "tendría", "tendrías", "tendríamos",
"tendríais", "tendrían", "tenía", "tenías", "teníamos", "teníais",
"tenían", "tuve", "tuviste", "tuvo", "tuvimos", "tuvisteis",
"tuvieron", "tuviera", "tuvieras", "tuviéramos", "tuvierais",
"tuvieran", "tuviese", "tuvieses", "tuviésemos", "tuvieseis",
"tuviesen", "teniendo", "tenido", "tenida", "tenidos", "tenidas",
"tened"]

but when I try to analyze a text like "la carreta", I'm correctly
getting an stemmed version of "carreta" ("carret"), but I'm getting a
token for "la" (a stop-word). Why is "la" not getting removed from the
text?

$ curl "http://localhost:9200/presspeople/_analyze?text=la%20carreta&analyzer=stemming&pretty=true"
{
"tokens" : [ {
"token" : "la",
"start_offset" : 0,
"end_offset" : 2,
"type" : "",
"position" : 1
}, {
"token" : "carret",
"start_offset" : 3,
"end_offset" : 10,
"type" : "",
"position" : 2
} ]
}

--
Joaquin Cuenca Abela


(Shay Banon) #2

The order where you define the stop filter is important, is there a chance that its simply get applied after the stemming (if I got the question right).
On Friday, January 21, 2011 at 12:12 PM, Joaquin Cuenca Abela wrote:

Hi,

I'm trying to use the snowball stemmer in the latest (yesterday)
version of elasticsearch (from git), I have the following index:

index:
analysis:
analyzer:
stemming:
type: custom
tokenizer: standard
language: Spanish
filter: [standard, lowercase, stop, asciifolding, snowball]
stop_words: ["de", "la", "que", "el", "en", "y", "a", "los",
"del", "se", "las", "por", "un", "para", "con", "no", "una", "su",
"al", "lo", "como", "más", "pero", "sus", "le", "ya", "o", "este",
"sí", "porque", "esta", "entre", "cuando", "muy", "sin", "sobre",
"también", "me", "hasta", "hay", "donde", "quien", "desde", "todo",
"nos", "durante", "todos", "uno", "les", "ni", "contra", "otros",
"ese", "eso", "ante", "ellos", "e", "esto", "mí", "antes", "algunos",
"qué", "unos", "yo", "otro", "otras", "otra", "él", "tanto", "esa",
"estos", "mucho", "quienes", "nada", "muchos", "cual", "poco", "ella",
"estar", "estas", "algunas", "algo", "nosotros", "mi", "mis", "tú",
"te", "ti", "tu", "tus", "ellas", "nosotras", "vosotros", "vosotras",
"os", "mío", "mía", "míos", "mías", "tuyo", "tuya", "tuyos", "tuyas",
"suyo", "suya", "suyos", "suyas", "nuestro", "nuestra", "nuestros",
"nuestras", "vuestro", "vuestra", "vuestros", "vuestras", "esos",
"esas", "estoy", "estás", "está", "estamos", "estáis", "están",
"esté", "estés", "estemos", "estéis", "estén", "estaré", "estarás",
"estará", "estaremos", "estaréis", "estarán", "estaría", "estarías",
"estaríamos", "estaríais", "estarían", "estaba", "estabas",
"estábamos", "estabais", "estaban", "estuve", "estuviste", "estuvo",
"estuvimos", "estuvisteis", "estuvieron", "estuviera", "estuvieras",
"estuviéramos", "estuvierais", "estuvieran", "estuviese",
"estuvieses", "estuviésemos", "estuvieseis", "estuviesen", "estando",
"estado", "estada", "estados", "estadas", "estad", "he", "has", "ha",
"hemos", "habéis", "han", "haya", "hayas", "hayamos", "hayáis",
"hayan", "habré", "habrás", "habrá", "habremos", "habréis", "habrán",
"habría", "habrías", "habríamos", "habríais", "habrían", "había",
"habías", "habíamos", "habíais", "habían", "hube", "hubiste", "hubo",
"hubimos", "hubisteis", "hubieron", "hubiera", "hubieras",
"hubiéramos", "hubierais", "hubieran", "hubiese", "hubieses",
"hubiésemos", "hubieseis", "hubiesen", "habiendo", "habido", "habida",
"habidos", "habidas", "soy", "eres", "es", "somos", "sois", "son",
"sea", "seas", "seamos", "seáis", "sean", "seré", "serás", "será",
"seremos", "seréis", "serán", "sería", "serías", "seríamos",
"seríais", "serían", "era", "eras", "éramos", "erais", "eran", "fui",
"fuiste", "fue", "fuimos", "fuisteis", "fueron", "fuera", "fueras",
"fuéramos", "fuerais", "fueran", "fuese", "fueses", "fuésemos",
"fueseis", "fuesen", "siendo", "sido", "tengo", "tienes", "tiene",
"tenemos", "tenéis", "tienen", "tenga", "tengas", "tengamos",
"tengáis", "tengan", "tendré", "tendrás", "tendrá", "tendremos",
"tendréis", "tendrán", "tendría", "tendrías", "tendríamos",
"tendríais", "tendrían", "tenía", "tenías", "teníamos", "teníais",
"tenían", "tuve", "tuviste", "tuvo", "tuvimos", "tuvisteis",
"tuvieron", "tuviera", "tuvieras", "tuviéramos", "tuvierais",
"tuvieran", "tuviese", "tuvieses", "tuviésemos", "tuvieseis",
"tuviesen", "teniendo", "tenido", "tenida", "tenidos", "tenidas",
"tened"]

but when I try to analyze a text like "la carreta", I'm correctly
getting an stemmed version of "carreta" ("carret"), but I'm getting a
token for "la" (a stop-word). Why is "la" not getting removed from the
text?

$ curl "http://localhost:9200/presspeople/_analyze?text=la%20carreta&analyzer=stemming&pretty=true"
{
"tokens" : [ {
"token" : "la",
"start_offset" : 0,
"end_offset" : 2,
"type" : "",
"position" : 1
}, {
"token" : "carret",
"start_offset" : 3,
"end_offset" : 10,
"type" : "",
"position" : 2
} ]
}

--
Joaquin Cuenca Abela


(Joaquin Cuenca Abela) #3

Hi Shay,

this doesn't seem to be the problem.

Simplifying, what I'm doing is:

$ curl -XPUT http://localhost:9200/test -d 'index:
analysis:
analyzer:
mine:
type: custom
tokenizer: standard
language: Spanish
filter: [standard, stop]
stop_words: ['de', 'la']
'
$ curl "http://localhost:9200/test/_analyze?text=la%20casa&pretty=true"
{
"tokens" : [ {
"token" : "la",
"start_offset" : 0,
"end_offset" : 2,
"type" : "",
"position" : 1
}, {
"token" : "casa",
"start_offset" : 3,
"end_offset" : 7,
"type" : "",
"position" : 2
} ]
}

For the _analyze query, I was not expecting to get the token "la", as it's a
stop word.

I'm also getting hits if I do a query using a stop word:

$ curl -XPUT "http://localhost:9200/test/product/1" -d '{"name": "la casa"}'
{"ok":true,"_index":"test","_type":"product","_id":"1","_version":1}

$ curl "http://localhost:9200/test/product/_search?q=la&pretty=true"
{
"took" : 6,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.13561106,
"hits" : [ {
"_index" : "test",
"_type" : "product",
"_id" : "1",
"_version" : 1,
"_score" : 0.13561106, "_source" : {"name": "la casa"}
} ]
}
}

Am I using "stop_words" incorrectly?

2011/1/23 Shay Banon shay.banon@elasticsearch.com

The order where you define the stop filter is important, is there a
chance that its simply get applied after the stemming (if I got the question
right).

On Friday, January 21, 2011 at 12:12 PM, Joaquin Cuenca Abela wrote:

Hi,

I'm trying to use the snowball stemmer in the latest (yesterday)
version of elasticsearch (from git), I have the following index:

index:
analysis:
analyzer:
stemming:
type: custom
tokenizer: standard
language: Spanish
filter: [standard, lowercase, stop, asciifolding, snowball]
stop_words: ["de", "la", "que", "el", "en", "y", "a", "los",
"del", "se", "las", "por", "un", "para", "con", "no", "una", "su",
"al", "lo", "como", "más", "pero", "sus", "le", "ya", "o", "este",
"sí", "porque", "esta", "entre", "cuando", "muy", "sin", "sobre",
"también", "me", "hasta", "hay", "donde", "quien", "desde", "todo",
"nos", "durante", "todos", "uno", "les", "ni", "contra", "otros",
"ese", "eso", "ante", "ellos", "e", "esto", "mí", "antes", "algunos",
"qué", "unos", "yo", "otro", "otras", "otra", "él", "tanto", "esa",
"estos", "mucho", "quienes", "nada", "muchos", "cual", "poco", "ella",
"estar", "estas", "algunas", "algo", "nosotros", "mi", "mis", "tú",
"te", "ti", "tu", "tus", "ellas", "nosotras", "vosotros", "vosotras",
"os", "mío", "mía", "míos", "mías", "tuyo", "tuya", "tuyos", "tuyas",
"suyo", "suya", "suyos", "suyas", "nuestro", "nuestra", "nuestros",
"nuestras", "vuestro", "vuestra", "vuestros", "vuestras", "esos",
"esas", "estoy", "estás", "está", "estamos", "estáis", "están",
"esté", "estés", "estemos", "estéis", "estén", "estaré", "estarás",
"estará", "estaremos", "estaréis", "estarán", "estaría", "estarías",
"estaríamos", "estaríais", "estarían", "estaba", "estabas",
"estábamos", "estabais", "estaban", "estuve", "estuviste", "estuvo",
"estuvimos", "estuvisteis", "estuvieron", "estuviera", "estuvieras",
"estuviéramos", "estuvierais", "estuvieran", "estuviese",
"estuvieses", "estuviésemos", "estuvieseis", "estuviesen", "estando",
"estado", "estada", "estados", "estadas", "estad", "he", "has", "ha",
"hemos", "habéis", "han", "haya", "hayas", "hayamos", "hayáis",
"hayan", "habré", "habrás", "habrá", "habremos", "habréis", "habrán",
"habría", "habrías", "habríamos", "habríais", "habrían", "había",
"habías", "habíamos", "habíais", "habían", "hube", "hubiste", "hubo",
"hubimos", "hubisteis", "hubieron", "hubiera", "hubieras",
"hubiéramos", "hubierais", "hubieran", "hubiese", "hubieses",
"hubiésemos", "hubieseis", "hubiesen", "habiendo", "habido", "habida",
"habidos", "habidas", "soy", "eres", "es", "somos", "sois", "son",
"sea", "seas", "seamos", "seáis", "sean", "seré", "serás", "será",
"seremos", "seréis", "serán", "sería", "serías", "seríamos",
"seríais", "serían", "era", "eras", "éramos", "erais", "eran", "fui",
"fuiste", "fue", "fuimos", "fuisteis", "fueron", "fuera", "fueras",
"fuéramos", "fuerais", "fueran", "fuese", "fueses", "fuésemos",
"fueseis", "fuesen", "siendo", "sido", "tengo", "tienes", "tiene",
"tenemos", "tenéis", "tienen", "tenga", "tengas", "tengamos",
"tengáis", "tengan", "tendré", "tendrás", "tendrá", "tendremos",
"tendréis", "tendrán", "tendría", "tendrías", "tendríamos",
"tendríais", "tendrían", "tenía", "tenías", "teníamos", "teníais",
"tenían", "tuve", "tuviste", "tuvo", "tuvimos", "tuvisteis",
"tuvieron", "tuviera", "tuvieras", "tuviéramos", "tuvierais",
"tuvieran", "tuviese", "tuvieses", "tuviésemos", "tuvieseis",
"tuviesen", "teniendo", "tenido", "tenida", "tenidos", "tenidas",
"tened"]

but when I try to analyze a text like "la carreta", I'm correctly
getting an stemmed version of "carreta" ("carret"), but I'm getting a
token for "la" (a stop-word). Why is "la" not getting removed from the
text?

$ curl "
http://localhost:9200/presspeople/_analyze?text=la%20carreta&analyzer=stemming&pretty=true
"
{
"tokens" : [ {
"token" : "la",
"start_offset" : 0,
"end_offset" : 2,
"type" : "",
"position" : 1
}, {
"token" : "carret",
"start_offset" : 3,
"end_offset" : 10,
""type" : "",
"position" : 2
} ]
}

--
Joaquin Cuenca Abela

--
Joaquin Cuenca Abela


(Joaquin Cuenca Abela) #4

BTW, I see in the docs that I should use "stopwords" instead of
"stop_words", and that I don't need to quote my stop words, but either
way I am getting the stop words in my index:

...
settings: {
index.analysis.analyzer.mine.filter.0: standard
index.analysis.analyzer.mine.filter.1: stop
index.analysis.analyzer.mine.type: custom
index.analysis.analyzer.mine.stopwords.1: la
index.analysis.analyzer.mine.tokenizer: standard
index.analysis.analyzer.mine.stopwords.0: de
index.analysis.analyzer.mine.language: Spanish
index.number_of_shards: 5
index.number_of_replicas: 1
}
...

and using "stop_words" or "stopwords", the stop words "de" and "la"
are still indexed:

$ curl -XPUT http://localhost:9200/test -d 'index:
analysis:
analyzer:
mine:
type: custom
tokenizer: standard
language: Spanish
filter: [standard, stop]
stopwords: [de,la]
'
$ curl -XPUT "http://localhost:9200/test/products/1" -d '{"name": "la casa"}'
$ curl "http://localhost:9200/test/products/_search?q=la&pretty=true"
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.13561106,
"hits" : [ {
"_index" : "test",
"_type" : "products",
"_id" : "1",
"_version" : 1,
"_score" : 0.13561106, "_source" : {"name": "la casa"}
} ]
}

On Mon, Jan 24, 2011 at 5:56 AM, Joaquin Cuenca Abela
joaquin@cuencaabela.com wrote:

Hi Shay,
this doesn't seem to be the problem.
Simplifying, what I'm doing is:
$ curl -XPUT http://localhost:9200/test -d 'index:
analysis:
analyzer:
mine:
type: custom
tokenizer: standard
language: Spanish
filter: [standard, stop]
stop_words: ['de', 'la']
'
$ curl "http://localhost:9200/test/_analyze?text=la%20casa&pretty=true"
{
"tokens" : [ {
"token" : "la",
"start_offset" : 0,
"end_offset" : 2,
"type" : "",
"position" : 1
}, {
"token" : "casa",
"start_offset" : 3,
"end_offset" : 7,
"type" : "",
"position" : 2
} ]
}
For the _analyze query, I was not expecting to get the token "la", as it's a stop word.
I'm also getting hits if I do a query using a stop word:
$ curl -XPUT "http://localhost:9200/test/product/1" -d '{"name": "la casa"}'
{"ok":true,"_index":"test","_type":"product","_id":"1","_version":1}
$ curl "http://localhost:9200/test/product/_search?q=la&pretty=true"
{
"took" : 6,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.13561106,
"hits" : [ {
"_index" : "test",
"_type" : "product",
"_id" : "1",
"_version" : 1,
"_score" : 0.13561106, "_source" : {"name": "la casa"}
} ]
}
}
Am I using "stop_words" incorrectly?
2011/1/23 Shay Banon shay.banon@elasticsearch.com

The order where you define the stop filter is important, is there a chance that its simply get applied after the stemming (if I got the question right).

On Friday, January 21, 2011 at 12:12 PM, Joaquin Cuenca Abela wrote:

Hi,

I'm trying to use the snowball stemmer in the latest (yesterday)
version of elasticsearch (from git), I have the following index:

index:
analysis:
analyzer:
stemming:
type: custom
tokenizer: standard
language: Spanish
filter: [standard, lowercase, stop, asciifolding, snowball]
stop_words: ["de", "la", "que", "el", "en", "y", "a", "los",
"del", "se", "las", "por", "un", "para", "con", "no", "una", "su",
"al", "lo", "como", "más", "pero", "sus", "le", "ya", "o", "este",
"sí", "porque", "esta", "entre", "cuando", "muy", "sin", "sobre",
"también", "me", "hasta", "hay", "donde", "quien", "desde", "todo",
"nos", "durante", "todos", "uno", "les", "ni", "contra", "otros",
"ese", "eso", "ante", "ellos", "e", "esto", "mí", "antes", "algunos",
"qué", "unos", "yo", "otro", "otras", "otra", "él", "tanto", "esa",
"estos", "mucho", "quienes", "nada", "muchos", "cual", "poco", "ella",
"estar", "estas", "algunas", "algo", "nosotros", "mi", "mis", "tú",
"te", "ti", "tu", "tus", "ellas", "nosotras", "vosotros", "vosotras",
"os", "mío", "mía", "míos", "mías", "tuyo", "tuya", "tuyos", "tuyas",
"suyo", "suya", "suyos", "suyas", "nuestro", "nuestra", "nuestros",
"nuestras", "vuestro", "vuestra", "vuestros", "vuestras", "esos",
"esas", "estoy", "estás", "está", "estamos", "estáis", "están",
"esté", "estés", "estemos", "estéis", "estén", "estaré", "estarás",
"estará", "estaremos", "estaréis", "estarán", "estaría", "estarías",
"estaríamos", "estaríais", "estarían", "estaba", "estabas",
"estábamos", "estabais", "estaban", "estuve", "estuviste", "estuvo",
"estuvimos", "estuvisteis", "estuvieron", "estuviera", "estuvieras",
"estuviéramos", "estuvierais", "estuvieran", "estuviese",
"estuvieses", "estuviésemos", "estuvieseis", "estuviesen", "estando",
"estado", "estada", "estados", "estadas", "estad", "he", "has", "ha",
"hemos", "habéis", "han", "haya", "hayas", "hayamos", "hayáis",
"hayan", "habré", "habrás", "habrá", "habremos", "habréis", "habrán",
"habría", "habrías", "habríamos", "habríais", "habrían", "había",
"habías", "habíamos", "habíais", "habían", "hube", "hubiste", "hubo",
"hubimos", "hubisteis", "hubieron", "hubiera", "hubieras",
"hubiéramos", "hubierais", "hubieran", "hubiese", "hubieses",
"hubiésemos", "hubieseis", "hubiesen", "habiendo", "habido", "habida",
"habidos", "habidas", "soy", "eres", "es", "somos", "sois", "son",
"sea", "seas", "seamos", "seáis", "sean", "seré", "serás", "será",
"seremos", "seréis", "serán", "sería", "serías", "seríamos",
"seríais", "serían", "era", "eras", "éramos", "erais", "eran", "fui",
"fuiste", "fue", "fuimos", "fuisteis", "fueron", "fuera", "fueras",
"fuéramos", "fuerais", "fueran", "fuese", "fueses", "fuésemos",
"fueseis", "fuesen", "siendo", "sido", "tengo", "tienes", "tiene",
"tenemos", "tenéis", "tienen", "tenga", "tengas", "tengamos",
"tengáis", "tengan", "tendré", "tendrás", "tendrá", "tendremos",
"tendréis", "tendrán", "tendría", "tendrías", "tendríamos",
"tendríais", "tendrían", "tenía", "tenías", "teníamos", "teníais",
"tenían", "tuve", "tuviste", "tuvo", "tuvimos", "tuvisteis",
"tuvieron", "tuviera", "tuvieras", "tuviéramos", "tuvierais",
"tuvieran", "tuviese", "tuvieses", "tuviésemos", "tuvieseis",
"tuviesen", "teniendo", "tenido", "tenida", "tenidos", "tenidas",
"tened"]

but when I try to analyze a text like "la carreta", I'm correctly
getting an stemmed version of "carreta" ("carret"), but I'm getting a
token for "la" (a stop-word). Why is "la" not getting removed from the
text?

$ curl "http://localhost:9200/presspeople/_analyze?text=la%20carreta&analyzer=stemming&pretty=true"
{
"tokens" : [ {
"token" : "la",
"start_offset" : 0,
"end_offset" : 2,
"type" : "",
"position" : 1
}, {
"token" : "carret",
"start_offset" : 3,
"end_offset" : 10,
""type" : "",
"position" : 2
} ]
}

--
Joaquin Cuenca Abela

--
Joaquin Cuenca Abela

--
Joaquin Cuenca Abela


(Shay Banon) #5

It because you register the stop words on the custom analyzer. You should put them on a custom stop filter type filter, and then reference your filter in the custom analyzer definition.
On Monday, January 24, 2011 at 9:20 AM, Joaquin Cuenca Abela wrote:

BTW, I see in the docs that I should use "stopwords" instead of
"stop_words", and that I don't need to quote my stop words, but either
way I am getting the stop words in my index:

...
settings: {
index.analysis.analyzer.mine.filter.0: standard
index.analysis.analyzer.mine.filter.1: stop
index.analysis.analyzer.mine.type: custom
index.analysis.analyzer.mine.stopwords.1: la
index.analysis.analyzer.mine.tokenizer: standard
index.analysis.analyzer.mine.stopwords.0: de
index.analysis.analyzer.mine.language: Spanish
index.number_of_shards: 5
index.number_of_replicas: 1
}
...

and using "stop_words" or "stopwords", the stop words "de" and "la"
are still indexed:

$ curl -XPUT http://localhost:9200/test -d 'index:
analysis:
analyzer:
mine:
type: custom
tokenizer: standard
language: Spanish
filter: [standard, stop]
stopwords: [de,la]
'
$ curl -XPUT "http://localhost:9200/test/products/1" -d '{"name": "la casa"}'
$ curl "http://localhost:9200/test/products/_search?q=la&pretty=true"
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.13561106,
"hits" : [ {
"_index" : "test",
"_type" : "products",
"_id" : "1",
"_version" : 1,
"_score" : 0.13561106, "_source" : {"name": "la casa"}
} ]
}

On Mon, Jan 24, 2011 at 5:56 AM, Joaquin Cuenca Abela
joaquin@cuencaabela.com wrote:

Hi Shay,
this doesn't seem to be the problem.
Simplifying, what I'm doing is:
$ curl -XPUT http://localhost:9200/test -d 'index:
analysis:
analyzer:
mine:
type: custom
tokenizer: standard
language: Spanish
filter: [standard, stop]
stop_words: ['de', 'la']
'
$ curl "http://localhost:9200/test/_analyze?text=la%20casa&pretty=true"
{
"tokens" : [ {
"token" : "la",
"start_offset" : 0,
"end_offset" : 2,
"type" : "",
"position" : 1
}, {
"token" : "casa",
"start_offset" : 3,
"end_offset" : 7,
"type" : "",
"position" : 2
} ]
}
For the _analyze query, I was not expecting to get the token "la", as it's a stop word.
I'm also getting hits if I do a query using a stop word:
$ curl -XPUT "http://localhost:9200/test/product/1" -d '{"name": "la casa"}'
{"ok":true,"_index":"test","_type":"product","_id":"1","_version":1}
$ curl "http://localhost:9200/test/product/_search?q=la&pretty=true"
{
"took" : 6,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.13561106,
"hits" : [ {
"_index" : "test",
"_type" : "product",
"_id" : "1",
"_version" : 1,
"_score" : 0.13561106, "_source" : {"name": "la casa"}
} ]
}
}
Am I using "stop_words" incorrectly?
2011/1/23 Shay Banon shay.banon@elasticsearch.com

The order where you define the stop filter is important, is there a chance that its simply get applied after the stemming (if I got the question right).

On Friday, January 21, 2011 at 12:12 PM, Joaquin Cuenca Abela wrote:

Hi,

I'm trying to use the snowball stemmer in the latest (yesterday)
version of elasticsearch (from git), I have the following index:

index:
analysis:
analyzer:
stemming:
type: custom
tokenizer: standard
language: Spanish
filter: [standard, lowercase, stop, asciifolding, snowball]
stop_words: ["de", "la", "que", "el", "en", "y", "a", "los",
"del", "se", "las", "por", "un", "para", "con", "no", "una", "su",
"al", "lo", "como", "más", "pero", "sus", "le", "ya", "o", "este",
"sí", "porque", "esta", "entre", "cuando", "muy", "sin", "sobre",
"también", "me", "hasta", "hay", "donde", "quien", "desde", "todo",
"nos", "durante", "todos", "uno", "les", "ni", "contra", "otros",
"ese", "eso", "ante", "ellos", "e", "esto", "mí", "antes", "algunos",
"qué", "unos", "yo", "otro", "otras", "otra", "él", "tanto", "esa",
"estos", "mucho", "quienes", "nada", "muchos", "cual", "poco", "ella",
"estar", "estas", "algunas", "algo", "nosotros", "mi", "mis", "tú",
"te", "ti", "tu", "tus", "ellas", "nosotras", "vosotros", "vosotras",
"os", "mío", "mía", "míos", "mías", "tuyo", "tuya", "tuyos", "tuyas",
"suyo", "suya", "suyos", "suyas", "nuestro", "nuestra", "nuestros",
"nuestras", "vuestro", "vuestra", "vuestros", "vuestras", "esos",
"esas", "estoy", "estás", "está", "estamos", "estáis", "están",
"esté", "estés", "estemos", "estéis", "estén", "estaré", "estarás",
"estará", "estaremos", "estaréis", "estarán", "estaría", "estarías",
"estaríamos", "estaríais", "estarían", "estaba", "estabas",
"estábamos", "estabais", "estaban", "estuve", "estuviste", "estuvo",
"estuvimos", "estuvisteis", "estuvieron", "estuviera", "estuvieras",
"estuviéramos", "estuvierais", "estuvieran", "estuviese",
"estuvieses", "estuviésemos", "estuvieseis", "estuviesen", "estando",
"estado", "estada", "estados", "estadas", "estad", "he", "has", "ha",
"hemos", "habéis", "han", "haya", "hayas", "hayamos", "hayáis",
"hayan", "habré", "habrás", "habrá", "habremos", "habréis", "habrán",
"habría", "habrías", "habríamos", "habríais", "habrían", "había",
"habías", "habíamos", "habíais", "habían", "hube", "hubiste", "hubo",
"hubimos", "hubisteis", "hubieron", "hubiera", "hubieras",
"hubiéramos", "hubierais", "hubieran", "hubiese", "hubieses",
"hubiésemos", "hubieseis", "hubiesen", "habiendo", "habido", "habida",
"habidos", "habidas", "soy", "eres", "es", "somos", "sois", "son",
"sea", "seas", "seamos", "seáis", "sean", "seré", "serás", "será",
"seremos", "seréis", "serán", "sería", "serías", "seríamos",
"seríais", "serían", "era", "eras", "éramos", "erais", "eran", "fui",
"fuiste", "fue", "fuimos", "fuisteis", "fueron", "fuera", "fueras",
"fuéramos", "fuerais", "fueran", "fuese", "fueses", "fuésemos",
"fueseis", "fuesen", "siendo", "sido", "tengo", "tienes", "tiene",
"tenemos", "tenéis", "tienen", "tenga", "tengas", "tengamos",
"tengáis", "tengan", "tendré", "tendrás", "tendrá", "tendremos",
"tendréis", "tendrán", "tendría", "tendrías", "tendríamos",
"tendríais", "tendrían", "tenía", "tenías", "teníamos", "teníais",
"tenían", "tuve", "tuviste", "tuvo", "tuvimos", "tuvisteis",
"tuvieron", "tuviera", "tuvieras", "tuviéramos", "tuvierais",
"tuvieran", "tuviese", "tuvieses", "tuviésemos", "tuvieseis",
"tuviesen", "teniendo", "tenido", "tenida", "tenidos", "tenidas",
"tened"]

but when I try to analyze a text like "la carreta", I'm correctly
getting an stemmed version of "carreta" ("carret"), but I'm getting a
token for "la" (a stop-word). Why is "la" not getting removed from the
text?

$ curl "http://localhost:9200/presspeople/_analyze?text=la%20carreta&analyzer=stemming&pretty=true"
{
"tokens" : [ {
"token" : "la",
"start_offset" : 0,
"end_offset" : 2,
"type" : "",
"position" : 1
}, {
"token" : "carret",
"start_offset" : 3,
"end_offset" : 10,
""type" : "",
"position" : 2
} ]
}

--
Joaquin Cuenca Abela

--
Joaquin Cuenca Abela

--
Joaquin Cuenca Abela


(system) #6