Sentence corpus


(Hector) #1

Hi all,
I have a collection of sentences from different sources (e.g.
wikipedia) in different languages (e.g. English, Spanish). Some would
say it's a multilingual corpus.

I'm currently using Python + MongoDB to store and query these
sentences, but would like to start exploring elasticsearch. The
following is a description of my use case (preferably using Python):

(1) Take a bunch of plain text files;
(2) Do some pre-processing on them to convert each sentence into a
json representation that elasticsearch would like;
(3) Be able to control which words in each sentence are actually
indexed; for example, I don't need stemming, but I might use
lemmatization at some point, and also I don't need to index numbers;
(4) Bulk load all these sentences into elasticsearch;
(5) Create an index (if it's not already there);
(6) Then be able to search something like "from corpus C, in language
L, find the set of sentences S that only contain words which are in a
permissible set of words P".

Does this look like a good opportunity to use elasticsearch? Would you
please provide pointers for what to do in the previous steps?
Thanks!
-- Hector


(system) #2