Questions about Non-Java (C) Elasticsearch Native Clients

unknownunknown · May 23, 2016, 12:51pm

I was talking to someone recently about my attempts to drive indexing performance on a single node higher, and their recommendation was to basically use C to write the ingestion code (for the sake of discussion and everyone's sanity, let's say that JNI is out of the question). My understanding is that ES only offers Java-based clients for native connection. So I have a couple of questions about that:

(1) Are there any / any future plans for -- non-Java native ES clients? What's the rationale for either response?

(2) Supposing a program could be written in C to generate Lucene indices (there's at least one project out there that looks like it is attempting this). What problems would be left to make this data available for search in ES?

(3) Can HTTP-based clients obtain the same maximum indexing rate that current Java native clients can?

Thanks for any responses!

jprante · May 23, 2016, 1:23pm

Elasticsearch is written in Java and uses a binary protocol between cluster nodes (the transport layer). There is not much sense in re-implementing this procedure in non-Java code since it is optimized for Java and Netty. For non-Java, you have to use HTTP clients.
You can not create native Lucene indices and inject them in Elasticsearch, you have to index the data again over Elasticsearch API
No, HTTP is slower by 10-15%, but that is not critical, since the payload is compressed and transported concurrently.

Your project is following a limited scope, since the data transport speed from client to cluster does not depend on choosing Java or non-Java. For example, Elasticsearch uses Netty with low latency, massive buffering, concurrent connections, and compression algorithms to gain performance between nodes. Your suggestion does not address how to replace the power of Netty.

You should reconsider your approach of increasing performance on just one single node since it will not getting you very far. Elasticsearch was built to scale over many nodes and performance (as read in maximum load performance) can only be increased by using multiple nodes.

unknownunknown · May 23, 2016, 2:19pm

Jörg, thanks for the thoughtful response. I've interpreted inline below, please consider correcting me if I'm way off.

My understanding is that attempting to do this would be an exercise in reverse engineering / heavy interpretation of the ES and Netty transport code. Not impossible, but a lot of effort with limited payoff.

What I'm reading here is that there are at least two steps to get in index available in ES; one step is the creation of the Lucene index, and another step is additional processing with ES itself.

This reference here seems to suggest that someone has done this using Pig. Maybe this is a recipe for disaster, but it seems to show that it might be conceptually possible.

http://www.poudro.com/blog/building-an-elasticsearch-index-offline-using-hadoop-pig/

My reading of this is that there shouldn't be any real difference in indexing performance between using a non-Java-based HTTP client and the Java native client, both because the selected implementation (Netty) should be pretty efficient, and because ultimately both approaches use it?

My reading here is that: once elasticsearch stops vertically scaling, it's time to add another box; attempts to tweak it more might glean a little more performance here or there at the cost of performance there or here, but ultimately, if I'm hitting a wall the box is probably tapped out.

Thanks again for your feedback.

nik9000 · May 23, 2016, 5:35pm

A nightmare. The binary protocol changes with every release. It is even allowed to change in patch releases. Each node knows what version is on the other side of the binary pipe so we use if (out.version().onOfAfter(Version.V_2_3_1)) kinds of constructs all over the place. If you had a limited scope, but, yikes.

It is more like "Elasticsearch plugs into Lucene at fairly deep levels". Depending on the version of Elasticsearch it is hard to know how deeply because things go from Elasticsearch into Lucene sometimes or Lucene develops a better alternative after learning from Elasticsearch's missteps and Elasticsearch cuts over to them.

I can't comment on the recipe because I don't have time to read it or the background in the technologies.

I've not run the numbers myself so I'll defer to @jprante but my understanding is that ingest performance is often not a matter of the performance of the script doing the inserting. I mean, once you get the script to the point where it can saturate Elasticsearch it doesn't matter if you wrote it in C or Zend-PHP. Sure, the script'll run slower, but usually hardware on the ingestion side is comparatively cheap.

At some point folks are able to saturate the Elasticsearch cluster and they use techniques like:

Set refresh_interval
Use larger bulk requests (increase until it doesn't help any more, but don't go ever Xmb where X is a number I don't know but it is on the order of 40mb or something. It is a moving target, sorry.)
Hot/Warm Architecture
Finally, get more boxes and use more shards. Remember that replicas cost the same indexing performance as shards.
Don't use traps like _ttl or deleting large portions of your index. It is faster to recreate the index if you are deleting a significant portion of it (no, I don't know what portion, sorry).

Also, different data sets are different. Large documents have different things than small documents. This came in fairly recently for very small documents, for example. It is probably not a big deal for large documents but small ones (think sensor data) got a 10-15% bump. The ideal _bulk size is likely different for different kinds of documents, etc.

unknownunknown · May 23, 2016, 6:56pm

Thanks again for the deep response, Nik.

It sounds like the ES protocol is under constant change and trying to target it would be risky and/or a lot of ongoing maintenance.

It also sounds like ES and Lucene are similarly intimately intertwined, creating additional problems for any software trying to create an external index.

For small (<16k), well-structured records, are there any known problems where the ingestion script is actually the bottleneck?

Are there any levers to increase this particular limitation, or is this a hard wall? WIth this, it looks like there's a range of indexing between around 40000 EPS (1K events) and 2.5K EPS (16K events).