Using ElasticSearch analyzers outside of ElasticSearch

Ivan · August 1, 2012, 6:07pm

Just thinking aloud, have not tried to implement anything, but I
probably will soon...

I am currently using span queries for the bulk of my queries.
Unfortunately, span queries only support term queries, which mean no
analysis will happen on the query terms. My current approach utilizes
a Lucene analyzer to analyze the terms used by the SpamTermQueries.

Using both a custom Lucene analyzer and an ElasticSearch analyzer (via
elasticsearch.yml) has numerous issues: need to support two systems,
potential mismatch, duplication of efforts, etc... The analysis API is
useful, but the network hop to analyze each term would be too high.

My current thinking is to create a local embedded ElasticSearch
instance who sole purpose would be fulfill analyze requests. The
existing TransportClient would continue to communicate with the actual
cluster, while this new NodeClient would only exist within the JVM.
The embedded node would use the same analyzer definitions found in the
cluster elasticsearch.yml file ("include" config files would be
extremely useful).

Questions/thoughts:

I have never created an embedded ElasticSearch server, but I assume
I can construct one that does not interfere with the existing cluster
and/or the other middle boxes in the network. Correct?
Performance-wise. I expect that the performance of using the
analyze API locally would be identical to using a Lucene analyzer.
How heavyweight is an embedded ElasticSearch instance?

Cheers,

Ivan

Ivan · August 1, 2012, 11:39pm

To answer my own question:

Tried created an embedded instance and quickly ran into the issue
where a custom analyzer is not created unless it is tied to an index.
I then simply followed the code used by the analysis unit tests
(AnalysisModuleTests) and built an analyzer using the various
Elasticsearch modules. Works like a charm.

Cheers,

Ivan

On Wed, Aug 1, 2012 at 11:07 AM, Ivan Brusic ivan@brusic.com wrote:

Just thinking aloud, have not tried to implement anything, but I
probably will soon...

I am currently using span queries for the bulk of my queries.
Unfortunately, span queries only support term queries, which mean no
analysis will happen on the query terms. My current approach utilizes
a Lucene analyzer to analyze the terms used by the SpamTermQueries.

Using both a custom Lucene analyzer and an Elasticsearch analyzer (via
elasticsearch.yml) has numerous issues: need to support two systems,
potential mismatch, duplication of efforts, etc... The analysis API is
useful, but the network hop to analyze each term would be too high.

My current thinking is to create a local embedded Elasticsearch
instance who sole purpose would be fulfill analyze requests. The
existing TransportClient would continue to communicate with the actual
cluster, while this new NodeClient would only exist within the JVM.
The embedded node would use the same analyzer definitions found in the
cluster elasticsearch.yml file ("include" config files would be
extremely useful).

Questions/thoughts:

I have never created an embedded Elasticsearch server, but I assume
I can construct one that does not interfere with the existing cluster
and/or the other middle boxes in the network. Correct?

Performance-wise. I expect that the performance of using the
analyze API locally would be identical to using a Lucene analyzer.

How heavyweight is an embedded Elasticsearch instance?

Cheers,

Ivan

Topic		Replies	Views
Utilizing other lucene analyzers (eg stanford lemmatizer) Elasticsearch	2	516	July 6, 2017
Analyzing SpanNearQuery Elasticsearch	9	1248	July 5, 2017
Forcing Analysis of Terms and Span Terms? Elasticsearch	5	795	July 6, 2017
Adding my own Analyzers Elasticsearch	1	255	July 6, 2017
Custom Analyzer deployment Elasticsearch	1	399	April 24, 2018

Using ElasticSearch analyzers outside of ElasticSearch

Related topics