I have a solution in mind for a problem that I have that I would like some
feedback on before I embark on experimenting with it.
My situation is as follows:
I have a large data set with named locations that each may have several
alternative names, synonyms etc for things like venues, neighborhoods,
streets, pois, etc. The data set includes tens of millions of names.
Elasticsearch is great for indexing this type of data and I already use ES
on this data for other use cases.
My problem is identifying relevant references to those locations in text.
So, given e.g. a news paper article that mentions several locations by
name, I'd like to come up with a short list of possible matches in my
location index. It doesn't have to be very exact and it doesn't have to be
very fast. False positives are OK (and inevitable) but false negatives
would be a concern. If I can get a reasonably short list of locations, I
can post process the results to get rid of false positives.
The normal approach for this kind of problem is to use NLP with some
machine learning model that picks apart the text and comes up with a list
of locations. My main issue with this approach is that it does not detect
all references (false negatives) and it doesn't really benefit from all the
location meta data that I have indexed. I've played around with a few nlp
products and libraries and am so far not overly impressed with the results.
It's a very hard problem and there is just a lot of ambiguity in text.
Looking at the recent suggestions and finite state machine work in lucene
that makes it possible to fire thousands of completion queries per second
at my huge index, it occurred to me that I might just generate thousands of
queries from the article and brute force my way through the problem and get
a reasonably short list of possible locations for any article. If each
query takes e.g. 2ms and I need to run 20000 of them to figure out whether
combinations of words in the text match any location names that is actually
something that is feasible in well under a minute and with ES I can just
throw more hardware at the problem if I need it to be faster. If I could
get that down to a few seconds, that would be perfectly acceptable. A
reasonably short list of possible matches would allow me to use other
techniques in a post processing step to figure out the best ones.
Does this sound like a good plan? Are there other features in ES that I
might use? Or am I being naive?
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to email@example.com.
For more options, visit https://groups.google.com/groups/opt_out.