Not able to index large csv files using java bulk api

sahilkumar · May 2, 2019, 9:42am

I am using es 6.6.2 to index a csv file(150MB), that contains about a million documents, using my java application. I read the file using bufferreader and then using json builder I add the documents to bulkrequest. When csv file is around 10 MB I am able to index it successfully. But when file size increases I execute the bulk requests in batches. I have tried to index bulk requests of batch size 100,1000,10000,100000. The first batch gets indexed very fast(2-4 seconds) and then time taken for subsequent batches increases upto minutes and then after the 4th or 5th batch I get the following exception after which Elastic Search goes down and I have to restart the es service :

    org.elasticsearch.transport.ReceiveTimeoutTransportException: [][127.0.0.1:9300][cluster:monitor/nodes/liveness] request_id [34] timed out after [5003ms]
    	at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1010) ~[elasticsearch-6.7.1.jar:6.7.1]
    	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-6.7.1.jar:6.7.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_201]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_201]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_201]


NoNodeAvailableException[None of the configured nodes are available: [{#transport#-1}{YLojUlzuTtmZ7pAmHvWeMw}{localhost}{127.0.0.1:9300}]]
	at org.elasticsearch.client.transport.TransportClientNodesService.ensureNodesAreAvailable(TransportClientNodesService.java:352)
	at org.elasticsearch.client.transport.TransportClientNodesService.execute(TransportClientNodesService.java:248)
	at org.elasticsearch.client.transport.TransportProxyClient.execute(TransportProxyClient.java:60)
	at org.elasticsearch.client.transport.TransportClient.doExecute(TransportClient.java:388)
	at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:403)
	at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:391)
	at org.elasticsearch.client.support.AbstractClient$IndicesAdmin.execute(AbstractClient.java:1262)
	at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:46)

The transport client is established at the beginning of the application and is static. Memory for jvm is 1gb.
Please tell me how shoud I index such large files into es.

dadoonet · May 2, 2019, 10:53am

Please format your code, logs or configuration files using </> icon as explained in this guide and not the citation button. It will make your post more readable.

Or use markdown style like:

```
CODE
```

This is the icon to use if you are not using markdown format:

There's a live preview panel for exactly this reasons.

Lots of people read these forums, and many of them will simply skip over a post that is difficult to read, because it's just too large an investment of their time to try and follow a wall of badly formatted text.
If your goal is to get an answer to your questions, it's in your interest to make it as easy to read and understand as possible.
Please update your post.

Are you sure you are creating a new bulk request after each bulk has been executed?
Are you using the bulk processor class?

What does your code look like?

sahilkumar · May 3, 2019, 7:14am

I have updated my post. You were right, I was not creating a new bulk request after each bulk was executed. The issue is resolved now. Thanks.

dadoonet · May 3, 2019, 7:57am

Great. I'd encourage you using the BulkProcessor class if you are using Java.

An example here:

github.com

dadoonet/legacy-search/blob/02-bulk/src/main/java/fr/pilato/demo/legacysearch/dao/ElasticsearchDao.java#L57-L77


this.bulkProcessor = BulkProcessor.builder(
        (request, bulkListener) -> esClient.bulkAsync(request, RequestOptions.DEFAULT, bulkListener),
        new BulkProcessor.Listener() {
            @Override
            public void beforeBulk(long executionId, BulkRequest request) {
                logger.debug("going to execute bulk of {} requests", request.numberOfActions());
            }


            @Override
            public void afterBulk(long executionId, BulkRequest request, BulkResponse response) {
                logger.debug("bulk executed {} failures", response.hasFailures() ? "with" : "without");
            }


            @Override
            public void afterBulk(long executionId, BulkRequest request, Throwable failure) {
                logger.warn("error while executing bulk", failure);
            }
        })
        .setBulkActions(10000)
        .setFlushInterval(TimeValue.timeValueSeconds(5))
        .build();

Here is how you index documents:

github.com

dadoonet/legacy-search/blob/02-bulk/src/main/java/fr/pilato/demo/legacysearch/dao/ElasticsearchDao.java#L81-L82


byte[] bytes = mapper.writeValueAsBytes(person);
bulkProcessor.add(new IndexRequest("person").id(person.idAsString()).source(bytes, XContentType.JSON));

HTH

sahilkumar · May 25, 2019, 6:30am

Hi david,

I have a file containing keywords which I need to search for in the index. It is a csv file containing 3 headers - keyword,brandName,bucketName

What I need to do is search for the keyword in all fields of the index if brandName and bucketName values are '*', and if they are something else then I need to search the keyword in the fields whose value matches the bucketName and brandName. My index contains many fields includin brandName and bucketName. Please advise as to which search query I should use. I am not able to use multimatch query since number of fields for a document is not fixed.

dadoonet · May 25, 2019, 7:30am

Could you open a new question? This is unrelated to the initial one I think.

sahilkumar · May 25, 2019, 7:46am

Done.

system · June 22, 2019, 7:46am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Attempt to index a large dataset fails Elasticsearch	12	529	July 6, 2017
Indexing large number of documents Elasticsearch	5	918	July 6, 2017
Issue while Indexing bulk of Json to Elasticsearch Elasticsearch	3	1166	July 5, 2017
What is the best way for huge bulk file indexing? Elasticsearch	8	1580	July 6, 2017
Timing out while indexing Elasticsearch	30	15177	July 6, 2017

Not able to index large csv files using java bulk api

Related topics