Using ElasticSearch analyzers outside of ElasticSearch

Just thinking aloud, have not tried to implement anything, but I
probably will soon...

I am currently using span queries for the bulk of my queries.
Unfortunately, span queries only support term queries, which mean no
analysis will happen on the query terms. My current approach utilizes
a Lucene analyzer to analyze the terms used by the SpamTermQueries.

Using both a custom Lucene analyzer and an ElasticSearch analyzer (via
elasticsearch.yml) has numerous issues: need to support two systems,
potential mismatch, duplication of efforts, etc... The analysis API is
useful, but the network hop to analyze each term would be too high.

My current thinking is to create a local embedded ElasticSearch
instance who sole purpose would be fulfill analyze requests. The
existing TransportClient would continue to communicate with the actual
cluster, while this new NodeClient would only exist within the JVM.
The embedded node would use the same analyzer definitions found in the
cluster elasticsearch.yml file ("include" config files would be
extremely useful).

Questions/thoughts:

  1. I have never created an embedded ElasticSearch server, but I assume
    I can construct one that does not interfere with the existing cluster
    and/or the other middle boxes in the network. Correct?

  2. Performance-wise. I expect that the performance of using the
    analyze API locally would be identical to using a Lucene analyzer.

  3. How heavyweight is an embedded ElasticSearch instance?

Cheers,

Ivan

To answer my own question:

Tried created an embedded instance and quickly ran into the issue
where a custom analyzer is not created unless it is tied to an index.
I then simply followed the code used by the analysis unit tests
(AnalysisModuleTests) and built an analyzer using the various
Elasticsearch modules. Works like a charm.

Cheers,

Ivan

On Wed, Aug 1, 2012 at 11:07 AM, Ivan Brusic ivan@brusic.com wrote:

Just thinking aloud, have not tried to implement anything, but I
probably will soon...

I am currently using span queries for the bulk of my queries.
Unfortunately, span queries only support term queries, which mean no
analysis will happen on the query terms. My current approach utilizes
a Lucene analyzer to analyze the terms used by the SpamTermQueries.

Using both a custom Lucene analyzer and an Elasticsearch analyzer (via
elasticsearch.yml) has numerous issues: need to support two systems,
potential mismatch, duplication of efforts, etc... The analysis API is
useful, but the network hop to analyze each term would be too high.

My current thinking is to create a local embedded Elasticsearch
instance who sole purpose would be fulfill analyze requests. The
existing TransportClient would continue to communicate with the actual
cluster, while this new NodeClient would only exist within the JVM.
The embedded node would use the same analyzer definitions found in the
cluster elasticsearch.yml file ("include" config files would be
extremely useful).

Questions/thoughts:

  1. I have never created an embedded Elasticsearch server, but I assume
    I can construct one that does not interfere with the existing cluster
    and/or the other middle boxes in the network. Correct?

  2. Performance-wise. I expect that the performance of using the
    analyze API locally would be identical to using a Lucene analyzer.

  3. How heavyweight is an embedded Elasticsearch instance?

Cheers,

Ivan