GET all JSON docs in an index without using ID


(Kunal Goyal) #1

Hi,

I am trying to get all the JSON documents present inside an index and insert them in a different location using bulk API. But I cant find a way to get all the JSON documents without knowing their IDs as they are UUIDs created by elasticsearch. How can we code this in Java?

Thanks


(Magnus Bäck) #2

Just use an empty query and fetch all documents (you may want to use the scan and scroll API). Fetching documents one by one via their IDs is terribly inefficient anyway.


(Kunal Goyal) #3

How can we Scan and Scroll in Java?


(Magnus Bäck) #4

That is covered by the very first subtopic of the search API documentation:
https://www.elastic.co/guide/en/elasticsearch/client/java-api/current/scrolling.html


(Kunal Goyal) #5

Hi,

So I am trying to get all the documents present in an index using scroll and bulk index them in a different client but it takes a lot of time and then returns so many exceptions like:
[cluster:monitor/nodes/info] request_id [4] timed out after [5002ms]
WARNING: Unexpected exception in the selector loop.
java.lang.OutOfMemoryError: GC overhead limit exceeded

Similar method works just fine and quickly when I execute a normal query and bulk index the files.
Can you please guide on this and tell me what am i doing wrong here?

	QueryBuilder qb = QueryBuilders.matchAllQuery();
	SearchResponse scrollResp = sourceClient.prepareSearch("s_details")
			.setSearchType(SearchType.SCAN)
			.setScroll(new TimeValue(60000))
			.setQuery(qb)
			.setSize(100).execute().actionGet(); //100 hits per shard will be returned for each scroll
	//Scroll until no hits are returned

	BulkRequestBuilder bulkRequest = targetClient.prepareBulk();
	while (true) {

		for (SearchHit hit : scrollResp.getHits().getHits()) {
			//Handle the hit...
			bulkRequest.add(targetClient.prepareIndex("s_details", "spark").setSource(hit.getSource()));
		}
		scrollResp = sourceClient.prepareSearchScroll(scrollResp.getScrollId()).setScroll(new TimeValue(6000000)).execute().actionGet();
		
		//Break condition: No hits are returned
		if (scrollResp.getHits().getHits().length == 0) {
			break;
		}
	}
	BulkResponse bulkResponse = bulkRequest.execute().actionGet();

	if (bulkResponse.hasFailures()) {
		// process failures by iterating through each bulk response item
		System.out.println("Bulk Insertion Failed");
	}

(Colin Goodheart-Smithe) #6

You are storing up all the index requests and trying to send them in a single bulk request. This is almost certainly the source of your OutOfMemoryError. You should execute the bulk request in batchs. So maybe when the bulk request has more than 1000 index operations in it, execute the request, check for failures and then create a new bulk request for the next batch


(Kunal Goyal) #7

Thanks Collin for the suggestion but I am running the program on just 20 requests at max. That is why I dont understand that why the program runs for 2-3 minutes and then starts giving [cluster:monitor/nodes/info] request_id [4] timed out after [5002ms] for different request ids.


(system) #8