How to match words already indexed?

I have a big list of words. Each word have a meaning.
Example :
{ word : "new york",
meaning :"city" }
{ word : "today",
meaning :"date" }
I have indexed all these words, each one in a document.

Now I have a big dataset of sentences. I want to match for each sentence what meanings it contains.
For example : I will go to Paris today => ["city", "date"]

What is the best way to this ? do I have to index the sentences to make it fast ?
And how to do it when I have 1M+ words indexed and 1M+ sentences to identify the meanings in each one.

Thank you very much.

This sounds like a case for percolation.

Instead of storing your word lists in ordinary indices you can index them using the Percolator type, which will create a percolator index where each percolator (document) maps a certain rule. For instance, in your case one document can match a long list of cities and thus map city names to the concept of "city" while another document can match various date formats and map those to the concept of "date".

To use the percolator index, simply take a list of documents (they don't have to be indexed first) and percolate each of them against the percolator index using the special Percolate Query. This query will return the matching percolators, where each percolator tells you something about the contents of the document that matched, for instance that the document mentioned a "city" and a "date".

Thank you for your reply !
I have already tested the percolator type,
I tested with 1700 words, 500+ sentences => it takes more than 2 minutes.
I have found a tool that matches words (flashtext it's opensource), that make it in 12-15 seconds but don't match regexes like percolator.

How can it be more fast on ES ?

I don't think there is a simple answer, I always have to benchmark different solutions to find the fastest. But the obvious parameters to play with is

  1. Number of primary shards in the percolator index. Because queries are run in parallel on the shards, more shards (up to a point) means more data is percolated in parallel.
  2. Number of data nodes in the cluster. The more data nodes you have in the cluster the more parallel queries you can run (as long as each node gets some of the shards).
  3. Percolation batch size. The number of documents (your "sentences") in one percolation query ("batch query") can also affect the time it takes, since the coordinating node must allocate memory to gather all the results before returning them back to the client. In my percolator tests I've done 50, 100, 200 and 500 documents per batch and usually find the 50-100 value slightly faster (I usually measure Average Documents Per Second over a data set of perhaps a million documents).

I hope you can try some of these suggestions in your cluster to see if they help speed up the percolation. Good luck!

Thank you very much !

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.