Elasticsearch modules' releases and repositories

Hello,

We have an embedded node in our java app and 5 modules that are loaded :

  • Netty4
  • ParentJoin
  • CommonAnalysis
  • Painless
  • Reindex

Lately we wanted to upgrade the ES libraries from 7.10 to 7.16. We found the main Elasticsearch jar in maven centrals, but we didn't find some modules with the latest version. As the module path for XContent has moved from org.elasticsearch.common to org.elasticsearch we can't use the main version to 7.16.2 and the modules to 7.10.2.

On https://search.maven.org we can find the modules 7.10.2 versions, but there are no 7.16 versions.

Is there a specific maven repository where we can find the releases of all these modules ?
What is the policy of modules artifact publishing ?

Thank you for your answers.

Bruno Thomas

Running Elasticsearch as an embedded node hasn't been supported for a very long time, see this blog post for instance. It's split into modules for development purposes but the individual modules aren't intended to be used separately as part of a different application. You should run Elasticsearch in a separate process and use one of the client libraries to drive it instead.

1 Like

Thanks for your answser. I had read the post you mentionned, I was dreading this response.

We have the current use case: we are maintaining a java opensource app that can be installed locally on a laptop to search into documents. As it is a personal use, we don't need all the security, multi-user stuffs. We only need a similar API to ES that the one we have in server mode (with Elasticsearch running in a separate cluster).

At the very beginning we were using docker to install our app, and ES was running into a separate container. The issue with that is that there are 2 JVMs running, and as you know a JVM is very greedy with memory and CPU. So our partners that had small setups with 8GB or even 4GB were crashing all the time.

With embedded ES we can share memory and CPU. It solved the previous issue. We know several journalists that are using the app on their laptop. It brings value.

At the same time we understand Elastic pushing a secure server usage, and it would be acceptable to have a "we don't support embedded mode" stance. But it is also hard to get that there is a valuable use case, made possible by the versatile architecture of the source code, and the only barriers to do it is a will to make a release pipeline that pushes the modules jars to maven central.

In order to maintain our application updated we need the latest modules version.
Would it be possible for ES to publish them to central repositories?

This would help hundred of journalists to work on an up-to-date platform.
Thanks

How long ago did you decide that the problems with running a proper Elasticsearch node alongside your app were insurmountable? There have been significant improvements to its behaviour in resource-constrained environments so it seems worth revisiting that decision. If it still doesn't work today in a supported configuration then that sounds like a bug, and one which we would be interested in fixing, so I encourage you to please report any problems you encounter with that architecture.

Unfortunately we can't commit to maintaining the things needed to support running in embedded mode. It isn't just a case of flipping a switch in the release pipeline to re-enable module publication, there are various licensing and other concerns that also come into play, and there's also the need to maintain tests to avoid breaking this workflow in future, and likely a bunch of other things too. It all adds up to a nontrivial amount of work.

Note that you have access to the source code and build system so you can in principle build everything your app needs yourself. I don't recommend that path at all, but I think it worth noting that the fact that the modules aren't being published isn't a blocker here. In particular I don't understand the legal consequences of doing this, licensing or otherwise, although I am sure there is a way to construct a special agreement with Elastic that permits this if a special agreement is needed.

It was in 2018 that we took the decision.

  1. Do you mean that we could tweak the jvm args like xmx, xms and so forth?

  2. Do you have an example of desktop applications (opensource) that install/use Elasticsearch with 2 JVM processes in resource-constrained environments to see how they tackle this?

  3. Unless the shared memory feature is considered as a bad pratice for ES, how do you tackle the problem that the two processes are not using the resources at the same time nor will need large amounts of memory at the same time?

Yes (or stick it in a container with the appropriate memory limits and let ES work out how to configure itself).

No, but we have a bunch more experience of running smaller ES nodes today than we did in 2018. I doubt you could do anything with 500MB of heap back then but today that's a standard node size on https://cloud.elastic.co/. If you're careful you can probably get away with even less.

Not sure I understand exactly what you're asking here. The OS is responsible for sharing resources fairly, by scheduling threads and swapping out unused memory if it needs the RAM for something else and so on.

The datashare JVM is doing these tasks:

  • using Tika/tesseract OCR to extract text from documents
  • using coreNLP (or other java NLP frameworks) to extract Named Entities from the text
  • indexing the texts/NE's into ES
  • searching into ES user queries on the previous indexed documents

The 2 first tasks are very resource intensive and need a great amount of memory.

The Elasticsearch node is creating indices and making searches (and it is also resource intensive but not at the same moments).

As you say the OS is freeing/distributing resources as much as it can. But JVM garbage collector is very complex and it doesn't free memory as fast as C. We can assume that when all the code is running in the same memory space then the gc is doing a better/finer job than when there are 2 gc running in different spaces.

Finally, for us it is more complicated to make a simple user install with 2 processes than one self executable jar (containing ES, Datashare).

To sum up :

  • it requires a lot of efforts to provide a nice UX for installing the software with 2 jars on mac/windows/linux (and packaging) instead of one
  • same for handling 2 services lifecycle (starting, connecting, shutdown, errors)
  • it is harder to tweak memory settings to optimize the software performances, and reduce runtime errors because of the variety of usage we encounter on both services (we loose the one share/manage all memory use)

We will continue to look into it. We'll keep you posted.
Thanks

I think it best not to assume this sort of thing. If you can demonstrate that ES really doesn't work for you as a standalone process then that's a different matter, and one that's worth investigation, but it's tricky to draw correct conclusions from theoretical arguments about performance of this nature.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.