Howto: extract (misspelled?) keywords from a string

berco · May 2, 2015, 3:54pm

I have a large collection of strings that each contain information about a
certain product. For example:

wine Bardolo red 1L 12b 12%
La Tulipe, 13* box 3 bottles, 2005
Great Johnny Walker 7CL 22% red label
Wisky Jonny Walken .7 Red limited editon

The number of product names is limited, as are most other properties, but
they might be misspelled.

I would like to extract keywords from all those strings. Product name,
product type, volume, etc. But I'm not sure what the best approach would be
and if ElasticSearch would be the tool of choice. I've looked at
PostgreSQL's trigram plugin (pg_tgrm) since all data sits in a PostgreSQL
db at the moment, but that seems limited. I was thinking about creating
some kind of master list of proper keyword and try to match words from a
string with those keywords. These words could be misspelled meaning they
would have to be:

fuzzy matched
matched by hand
match by some sort of neural network trained with existing data

Someone suggested "analyzing the entire string as an ngram using the ngram
tokenizer", but I'm not sure. Any pointers where I should direct my effort
would be highly appreciated!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ebe7e730-7488-425e-af77-975d5679d0dc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

berco · May 4, 2015, 9:00am

I'm still researching this but I have too little experience with it to draw
a conclusion with certainty. Anyone of you ElesticSearch experts know
whether ES is the right tool for the job?

On Saturday, May 2, 2015 at 5:54:11 PM UTC+2, be...@media2b.net wrote:

I have a large collection of strings that each contain information about a
certain product. For example:

wine Bardolo red 1L 12b 12%
La Tulipe, 13* box 3 bottles, 2005
Great Johnny Walker 7CL 22% red label
Wisky Jonny Walken .7 Red limited editon

The number of product names is limited, as are most other properties, but
they might be misspelled.

I would like to extract keywords from all those strings. Product name,
product type, volume, etc. But I'm not sure what the best approach would be
and if Elasticsearch would be the tool of choice. I've looked at
PostgreSQL's trigram plugin (pg_tgrm) since all data sits in a PostgreSQL
db at the moment, but that seems limited. I was thinking about creating
some kind of master list of proper keyword and try to match words from a
string with those keywords. These words could be misspelled meaning they
would have to be:

fuzzy matched

matched by hand

match by some sort of neural network trained with existing data

Someone suggested "analyzing the entire string as an ngram using the ngram
tokenizer", but I'm not sure. Any pointers where I should direct my effort
would be highly appreciated!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3053bd4f-19ac-41e0-b1d9-bd6cb0409063%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Tokenize string based on available field values Elasticsearch	1	408	July 6, 2017
Automatic Keywords extraction in ElasticSearch Elasticsearch	15	6485	July 6, 2017
Advice on my approach to this search problem Elasticsearch	11	514	July 6, 2017
Search similar words in a big text Elasticsearch	3	530	July 6, 2017
How to match a mean word Elasticsearch	5	380	July 6, 2017

Howto: extract (misspelled?) keywords from a string

Related topics