Capping memory usage during tokenize phase and other queries

Hi Everybody,

As part of a University project in BioInformatics, I'm conducting an
investigation into the use of using ElasticSearch in searching for k-mers
on complete genomes, and being able to perform certain types of queries.
For the investigation we are using the latest version on GitHub.
(v0.90.0-RC2)

We have come up with a number of issues and would like to get some advise
on possible workarounds or solutions.

1, We have written our own tokeniser for the genome data we are processing,
and are able to tokenise smaller genomes in the order of roughly
150,000-250,000 base pairs (bp). (A base pair is a single letter that you
see in a DNA sequence, eg A,G,C,T). We split the DNA sequence into k-mers
(a k-mer is sub-sequence or sub-string of a DNA sequence, of length k), and
store these in ES. A DNA sequence of 100,000 base pairs, generates at the
tokenisation phase approximately 1,500,000 terms/k-mers/tokens, and uses
about 400-600MB of RAM during the tokenisation phase.

Our issue is that the DNA sequences (bacteria) we are interested in have
between 1,000,000 (1 million) and 10M (10 Million) base pairs, which is
equal to roughly 150M terms to be tokenised and indexed in a single
document. (each term is roughly 5 - 20 characters in length). We did try to
run these files through ES, but got OutOfHeapMemory errors very quickly,
and quickly realised that it was trying to process the entire DNA sequence
in RAM before committing to disk. By our calculations ES would consume
approx 40-60GB of RAM each time we run the index/tokenise process over a
single sequence, and unfortunately we don't have full-time access to
hardware that has enough RAM to hold all this in memory at once. (While we
can get access to machines with 256GB+ RAM per node, etc, our access is
very time limited). Our initial dataset is approximately 4000 DNA
sequences, but this is secondary to the topic.

We thought about splitting the documents into segments and processing
piecemeal, however this will interfere with the tokenisation process as the
terms/k-mers being extracted overlap previous terms/k-mers, so we need a
way to either flush the current items to disk so we don't hit a out of RAM
condition, or a clean way to merge two documents into 1 and have it handle
duplicate term_vectors correctly (taking into account that a single offset
into the document/field file may point to different terms/k-mers).

I've looked through the API documentation and source of some of the plugins
publicly available, and can't see a way to flush the current state of the
tokeniser/index to disk so we can free up some RAM or do a merge. Or is
there some modification to ES that we can make to ensure we don't hit out
of memory errors during the tokenisation process? (Our development desktop
machines have between 4 and 8GB of RAM, and ideally we would like it to
work with ES only utilising 2GB of RAM).

  1. We also need to get the offset(s) the term/k-mer occurs in the DNA
    sequence as part of our query. I've seen Ryan S (
    https://groups.google.com/d/topic/elasticsearch/1bHW9-wuPPk/discussion )
    and a few others have had similar questions and Ryan even implemented his
    own solution. Is this now available in ES (in the master branch on github -
    eg v0.90.0-RC2), or do would I have to use Ryan's modifications to get
    access to the term_vectors. (Our mapping uses "term_vector":
    "with_positions_offsets" to ensure they are stored). We are using a mix of
    Java API and the RESTful API for different components, if that at all
    matters.

  2. Using the current filters/facets, is it possible to generate a query
    that returns documents if "termA" is within "x" distance of "temB", where
    "x" is either the offset as stored in the term_vector, or distance in the
    number of tokens between. For example, can you show me the documents that
    have "Children" and "TV" where there are less than 30 characters (so raw
    offset) between the two terms, or alternatively can you show me the
    documents that have "Children" and "TV" where there are less than 5
    words/terms between the two terms.

  3. Is the "Similarity API" of Lucene 4.x exposed by ES yet? (Can't find
    anything in the documentation, but admittedly have only starting looking
    for this aspect). We have a use case of being able to generate a Gram
    matrix of all DNA sequences stored with ES, based on how similar two
    sequences are to each other. If it is a case we have a query that only
    compares 2 documents at a time and provides a score, then that's
    acceptable, but we love to have a single query that generates this
    information for us. (eg Show me how similar this 1 document is to all other
    documents in ES, where we can either just provide a new document, or give
    an ID of an existing document in ES). This would be limited to documents of
    the same type within the same index for the sake of simplifying our own use
    case.

So far our team has been very impressed with ES, and hope your team keeps
up the fantastic work.

Look forward to hearing any responses to my queries.

Kind Regards,
Darran.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.