How to do 'Stemming' in - attachment / encoded content


(Radhasankar C) #1

I installed elasticsearch 5.0.1 and ingest attachment plugin.
I have indexed pdf document using ingest attachment processor.
Now i want to do 'Stemming' in the content of attachement. I tried as below,

  1. Created index and set the analyzer for the same.
curl -XGET 'http://localhost:9200/idx_analyser?pretty'
{
>     `  "idx_analyser" : {`
>         "aliases" : { },
>         "mappings" : {
>           "test" : {
>             "properties" : {
>               "attachment" : {
>                 "properties" : {
>                   "content" : {
>                     "type" : "text",
>                     "fields" : {
>                       "keyword" : {
>                         "type" : "keyword",
>                         "ignore_above" : 256
>                       }
>                     }
>                   },
>                   "content_length" : {
>                     "type" : "long"
>                   },
>                   "content_type" : {
>                     "type" : "text",
>                     "fields" : {
>                       "keyword" : {
>                         "type" : "keyword",
>                         "ignore_above" : 256
>                       }
>                     }
>                   },
>                   "language" : {
>                     "type" : "text",
>                     "fields" : {
>                       "keyword" : {
>                         "type" : "keyword",
>                         "ignore_above" : 256
>                       }
>                     }
>                   }
>                 }
>               },
>               "data" : {
>                 "type" : "text",
>                 "fields" : {
>                   "keyword" : {
>                     "type" : "keyword",
>                     "ignore_above" : 256
>                   }
>                 }
>               },
>               "text" : {
>                 "type" : "text",
>                 "analyzer" : "custom_lowercase_stemmed"
>               }
>             }
>           }
>         },
>         "settings" : {
>           "index" : {
>             "number_of_shards" : "5",
>             "provided_name" : "idx_analyser",
>             "creation_date" : "1479885039440",
>             "analysis" : {
>               "filter" : {
>                 "custom_english_stemmer" : {
>                   "name" : "english",
>                   "type" : "stemmer"
>                 }
>               },
>               "analyzer" : {
>                 "custom_lowercase_stemmed" : {
>                   "filter" : [
>                     "lowercase",
>                     "custom_english_stemmer"
>                   ],
>                   "tokenizer" : "standard"
>                 }
>               }
>             },
>             "number_of_replicas" : "1",
>             "uuid" : "FrJEtt-BSgq2ROka2PZ4CA",
>             "version" : {
>               "created" : "5000199"
>             }
>           }
>         }
>       }
>     }
  1. Indexed base64content using "pipeline = attachment" processor
curl -XPUT 'http://localhost:9200/idx_analyser/test/1?pipeline=attachment&pretty' -d'
{
  "text": "VGhpcyBpbmRleCBoYXZpbmcgaW5mb3JtYXRpb24="
}'
   > `{

"_index" : "idx_analyser",
"_type" : "test",
"_id" : "1",
"_version" : 2,
"found" : true,
"_source" : {
"data" : "VGhpcyBpbmRleCBoYXZpbmcgaW5mb3JtYXRpb24=",
"attachment" : {
"content_type" : "text/plain; charset=ISO-8859-1",
"language" : "en",
"content" : "This index having information",
"content_length" : 30
}
}
}
`

Searching the content 'having' returns the expected result

 curl -XGET 'http://localhost:9200/idx_analyser/_search?q=attachment.content=having'

Where as i want to get the same result if i search for 'have' (shown below) . this is not coming !!

 curl -XGET 'http://localhost:9200/idx_analyser/_search?q=attachment.content=have'

Am i doing anything wrong here ? Please help to resolve this ...


(David Pilato) #2

Please format your code using </> icon. It will make your post more readable.

Here you applied your analyzer to text field but ingest is writing the extracted content to attachment.content.
Apply your analyzer to attachment.content instead.

BTW I doubt that adding a subfield keyword to your attachment.content field is a good idea.


(Radhasankar C) #3

Many Thanks David...
That filed attachment (with subfield keyword) was not created by me initially ! it has been created automatically by the ingest attachment processor when indexing the pdf encoded content i guess.
However now i,
--> created a new index
--> created a mapping type field "attachment" . "content" with 'stem' analyzer on it
--> indexed base64 encoded content
--> Then did the search ...
Its worked perfectly ......
Thank you again....


(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.