GET all JSON docs in an index without using ID

kunalgoyal · July 20, 2015, 5:39pm

Hi,

I am trying to get all the JSON documents present inside an index and insert them in a different location using bulk API. But I cant find a way to get all the JSON documents without knowing their IDs as they are UUIDs created by elasticsearch. How can we code this in Java?

Thanks

magnusbaeck · July 20, 2015, 5:53pm

Just use an empty query and fetch all documents (you may want to use the scan and scroll API). Fetching documents one by one via their IDs is terribly inefficient anyway.

kunalgoyal · July 20, 2015, 6:41pm

How can we Scan and Scroll in Java?

magnusbaeck · July 20, 2015, 9:23pm

That is covered by the very first subtopic of the search API documentation:
https://www.elastic.co/guide/en/elasticsearch/client/java-api/current/scrolling.html

kunalgoyal · July 21, 2015, 9:23pm

Hi,

So I am trying to get all the documents present in an index using scroll and bulk index them in a different client but it takes a lot of time and then returns so many exceptions like:
[cluster:monitor/nodes/info] request_id [4] timed out after [5002ms]
WARNING: Unexpected exception in the selector loop.
java.lang.OutOfMemoryError: GC overhead limit exceeded

Similar method works just fine and quickly when I execute a normal query and bulk index the files.
Can you please guide on this and tell me what am i doing wrong here?

	QueryBuilder qb = QueryBuilders.matchAllQuery();
	SearchResponse scrollResp = sourceClient.prepareSearch("s_details")
			.setSearchType(SearchType.SCAN)
			.setScroll(new TimeValue(60000))
			.setQuery(qb)
			.setSize(100).execute().actionGet(); //100 hits per shard will be returned for each scroll
	//Scroll until no hits are returned

	BulkRequestBuilder bulkRequest = targetClient.prepareBulk();
	while (true) {

		for (SearchHit hit : scrollResp.getHits().getHits()) {
			//Handle the hit...
			bulkRequest.add(targetClient.prepareIndex("s_details", "spark").setSource(hit.getSource()));
		}
		scrollResp = sourceClient.prepareSearchScroll(scrollResp.getScrollId()).setScroll(new TimeValue(6000000)).execute().actionGet();
		
		//Break condition: No hits are returned
		if (scrollResp.getHits().getHits().length == 0) {
			break;
		}
	}
	BulkResponse bulkResponse = bulkRequest.execute().actionGet();

	if (bulkResponse.hasFailures()) {
		// process failures by iterating through each bulk response item
		System.out.println("Bulk Insertion Failed");
	}

colings86 · July 22, 2015, 7:38am

You are storing up all the index requests and trying to send them in a single bulk request. This is almost certainly the source of your OutOfMemoryError. You should execute the bulk request in batchs. So maybe when the bulk request has more than 1000 index operations in it, execute the request, check for failures and then create a new bulk request for the next batch

kunalgoyal · July 22, 2015, 3:28pm

Thanks Collin for the suggestion but I am running the program on just 20 requests at max. That is why I dont understand that why the program runs for 2-3 minutes and then starts giving [cluster:monitor/nodes/info] request_id [4] timed out after [5002ms] for different request ids.

Topic		Replies	Views
Fastest way to retrieve all ids in an index? Elasticsearch	3	7526	June 11, 2019
ElasticSearch NEST: Bulk-indexing operation does not make use of specified document IDs Elasticsearch	3	1533	October 24, 2019
Bulk request how to? Elasticsearch	3	1357	June 15, 2022
How to retrive all Json documents from index using java Elasticsearch	2	342	May 18, 2020
How to bulk load huge JSON docs from files using Java client? Elasticsearch	9	1394	August 29, 2018

GET all JSON docs in an index without using ID

Related Topics