Read text from pdf file

arefin · March 8, 2021, 10:14am

String filePath = "C://x.pdf";
	String encodedfile = null;
	RestHighLevelClient restHighLevelClient = null;
	File file = new File(filePath);
	try
	{
		FileInputStream fileInputStreamReader = new FileInputStream(file);
		byte[] bytes = new byte[(int) file.length()];
		fileInputStreamReader.read(bytes);
		encodedfile = new String(Base64.getEncoder().encodeToString(bytes));
	}
	catch (IOException e)
	{
	}
	try
	{
		restHighLevelClient = new RestHighLevelClient(RestClient.builder(new HttpHost("localhost", 9200, "http")));
	}
	catch (Exception e)
	{
	}

	Map<String, Object> jsonMap = new HashMap<>();
	jsonMap.put("Name", "samanvi");
	jsonMap.put("postDate", new Date());
	jsonMap.put("hra", encodedfile);
	IndexRequest request = new IndexRequest("index", "_doc", "56")
			.index("index")
			.source("field", jsonMap);
	try
	{
		IndexResponse response = restHighLevelClient.index(request, RequestOptions.DEFAULT);
	}
	catch (ElasticsearchException | IOException e)
	{
	}

'I am indexing file this way, now I want to read text from this pdf file, can anyone please describe how could i read text by elasticsearch query.'

dadoonet · March 8, 2021, 10:16am

This is not going to work unless you use the ingest attachment plugin to extract the text from your file.

See: Ingest Attachment Processor Plugin | Elasticsearch Plugins and Integrations [7.11] | Elastic

system · April 5, 2021, 10:16am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to store .pdf or .txt file and search in it using Java Elasticsearch language-clients , ingest-pipeline , reindex	9	1337	June 6, 2022
Indexing PDF file in ElasticSearch using Java Code Elasticsearch	2	2602	August 28, 2018
ElasticSearch, ingest-attachment error - Validation Failed: 1: source is missing;2: content type is missing Elasticsearch	1	1887	July 18, 2018
How to index and store pdf file in elastic search using spring boot? Elasticsearch	51	12389	April 21, 2020
Search a PDF file using its content Elasticsearch	9	15789	February 11, 2019

Read text from pdf file

Related topics