Hi,
I'd like to learn how you use the Java client API for Elasticsearch and
what your experiences are so far.
My scenario is a web app (*.war or similar) running on an app server ( e.g.
Glassfish, JBoss etc.) installed as a front-end (for security, query
translation) to Elasticsearch. The cluster can be remote (but need not to
be remote).
I need robust access, that is, each query or indexing request must be
reliably answered by a success or failure event and of course I need fast
response times.
There are at least three variants:
a) a Node in client mode
b) a TransportClient
c) the REST API (HTTP port 9200)
Let's discuss some of the pros (+) & cons (-) from my naive view as an app
developer:
a) + zero-conf, out-of-the box cluster discovery
+ automatic failover
+ fluent API interface
- overhead of internal node joining the cluster(?)
- missing network interface setup
- unreadable binary protocol over port 9300
b) + almost zero-conf, configurable network interface setup
+ automatic failover
+ fluent API interface
- slight overhead of TransportClient node when joining the cluster
(finding the currently reachable nodes)
- additional "sniff" mode for automatic node (re)discovery
- unreadable binary protocol over port 9300
c) + readable protocol over port 9200 (HTTP)
- no zero-conf cluster discovery, failover only with external load
balancer
- overhead of JSON/Java serialization and de-serialization
Right now, I decided to go with b)
My assumption was I could manage a TransportClient singleton as a
long-lived object. I struggled with connections being apparently dropped
(after a certain length of inactivity?), so subsequent client operations
gave "no node available" - without being able to refresh the connection by
an API method. It's a challenge to understand how keep-alive connections
can be configured with TransportClient - after a period of 5000ms by
default, the communication seems to time out? Closing and re-opening a
TransportClient in a web app environment looks like an expensive operation,
because in the background there are extra threads running for watching the
connection, but unfortunately I do so - with each query, I open a new
TransportClient object. This works reliable but adds to the overall
turn-around of a request/response cycle, so it will not scale I am afraid.
I am aware kimchy is working hard to improve the TransportClient internals
but I am curious to learn about the optimal management of the life cyle of
a TransportClient object (singleton or not) and if sharing of requests via
a single TransportClient by multiple threads is recommended.
Additionally, I use the admin client via the TransportClient to issue a
cluster health check command, so in case of a "red" state,
querying/indexing can be interrupted at the app layer. This adds some more
overhead to each access via the Java client API but is more robust because
the web app could in case report a cluster availability problem to the user.
So what are your experiences? Are my assumptions valid? Do I miss
something? Is a), b) or c) preferable for a web app front-end scenario? Do
you have some advise for best practice?
Best regards,
Jörg