Sentence corpus

Hector · May 9, 2012, 6:58pm

Hi all,
I have a collection of sentences from different sources (e.g.
wikipedia) in different languages (e.g. English, Spanish). Some would
say it's a multilingual corpus.

I'm currently using Python + MongoDB to store and query these
sentences, but would like to start exploring elasticsearch. The
following is a description of my use case (preferably using Python):

(1) Take a bunch of plain text files;
(2) Do some pre-processing on them to convert each sentence into a
json representation that elasticsearch would like;
(3) Be able to control which words in each sentence are actually
indexed; for example, I don't need stemming, but I might use
lemmatization at some point, and also I don't need to index numbers;
(4) Bulk load all these sentences into elasticsearch;
(5) Create an index (if it's not already there);
(6) Then be able to search something like "from corpus C, in language
L, find the set of sentences S that only contain words which are in a
permissible set of words P".

Does this look like a good opportunity to use elasticsearch? Would you
please provide pointers for what to do in the previous steps?
Thanks!
-- Hector

Topic		Replies	Views
Language and HTML analyzer Elasticsearch	4	600	July 5, 2017
Native Language Translation (not analyzers) Elasticsearch	1	323	July 25, 2023
How do I use "lang" analyzers? Actually, should I use them? Elasticsearch	4	350	July 6, 2017
Handling multiple languages Elasticsearch	1	300	July 6, 2017
Multilingual Search Elasticsearch	1	355	July 19, 2022

Sentence corpus

Related topics