Handling data in Elasticsearch with Docker

I'm looking at setting up Elasticsearch within Docker and am trying to figure out the best method for inserting new data. Currently, after running docker-compose up, I run a series of curl commands to set up the index and bulk import data:

curl -XDELETE http://localhost:9200/my_index
curl -XPUT http://localhost:9200/my_index -H 'Content-Type: application/json' -d@/index.json
curl -H "Content-Type: application/x-ndjson" -XPOST localhost:9200/_bulk --data-binary "@/bulk_import.json"
curl -XPOST 'localhost:9200/my_index/_refresh'

This works fine, but I have to run these commands every time I restart the container (because the index and data are gone with it). I'm definitely missing something hopefully obvious, so would love if someone could point me in the right direction here.

Edit: My docker-compose.yml if it helps:

version: '3'

services:
  es:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.1.1
    container_name: es
    environment:
      - discovery.type=single-node
    ports:
      - 9200:9200
      - 9300:9300
    volumes:
      - esdata1:/usr/share/elasticsearch/data

volumes:
    esdata1:
      driver: local

Hm, that's a bit odd...

Notice that this is essentially a docker question, rather than an elastic question, since all you have to make sure of, is that /usr/share/elasticsearch/data contains the data from last time you indexed it. Normally this would be the case per default.

Could you please provide the output of "docker volume ls" and, if a volume called "docker_esdata1" exists, "docker volume inspect docker_esdata1"?

Thanks @ftr, I think I figured it out after reading your response. I believe this is just a derp on my part. I have been running docker-compose down -v to destroy my containers and the -v flag will incidentally destroy my volumes containing the Elasticsesarch data and index.

This brings me back to my initial question of the best way to insert data and retain that new data. The scenario being: I start up an ES container and populate with an initial bulk json file. I want to add a new entry, so I send a HTTP request to do so. I now want to spin up a new container on a separate machine with that initial bulk data, plus our new added entry. What is the best way to do this? If I accidentally did run docker-compose down -v in prod, wouldn't I just lose all my new data? I suppose I could make sure to append the data to my initial bulk json file every time a new entry is added, but that feels very hacky.

Is this not a very common scenario? I assumed it would be due to data being added to prod environment, and wanting to replicate that in a dev/test environment.

Well, it depends how exactly you configure your named volumes in docker-compose. What docker-compose down -v deletes is, AFAIK, dockers metadata regarding the volumes, not (necessarily) the data itself. There are many solutions to the rather typical problem you are facing, but one easy solution is the following volumes configuration in your docker-compose file, assuming your elastic service uses esdata01 as the volume for data (ie. mounted to /usr/share/elasticsearch/data in the container):

volumes:
  esdata01:
    driver: local
    driver_opts:
      type: none
      device: /reallybigdatadrive/elasticsearchdata/
      o: bind

What this says is simply to bind an (already existing) directory (i'm assuming you are using *nix here, I have no idea how this works on MS windows) to the docker volume, which is then used by the elasticsearch service. Then, if you take down the volume, your data is still safe and sound in the directory /reallybigdatadrive/elasticsearchdata/, and can be refound by simply running docker-compose up

NB NB NB You mention that you want to let another machine continue work on the index. As long as you mount the data directory onto /reallybigdatadrive/elasticsearchdata/ on the second machine (or some other place, and modify the docker-compose file accordingly) this will work, however, you need to remember that disk IO to the index is usually a crucial bit for the performance of the cluster. Thus, you need to make certain that wherever you place the data is on a fast disk, and that you have plenty of network IO to and from the machines running the docker images.

Also, do not start up two containers on two different machines, trying to use the same data for their indexes. You will be unhappy with the results. Honestly, the usual solution here, is to make certain that the nodes running the database (here: elasticsearch), have access to the data locally, and then communicate with elasticsearch through the REST interface or something similar - that practice is not confined to elasticsearch, it's generally a good principle from data mining...

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.