How to index text files (pdf, doc, txt...) in Java?

gagidza · December 20, 2022, 6:28pm

Hello everyone, some time ago I started to deal with Elasticsearch.
Now I am faced with the task of indexing various text files from a certain directory for subsequent search by their contents. Since Elasticsearch uses Apache Tika to parse various files, I think this is what I need. At this stage, it will be enough to save files by 4 parameters: name, file type, absolute path and file content.
I have already read similar posts (How to store .pdf or .txt file and search in it using Java; How to index and store pdf file in elastic search using spring boot?) and the article Attachment processor | Elasticsearch Guide [8.5] | Elastic. But so far it's not entirely clear to me how to write a similar request for indexing a file using Java.
If someone met with a similar problem and has experience in solving it, I will be glad of any help. Thank you in advance.

thn · December 20, 2022, 6:52pm

for a given directory,

Get a list of files
For each file,
2.1 get the metadata such as file name, file path, file type
2.2 get file content (for PDF, extract the text)
2.3 index the data

or
-> create a JSON doc with four fields: file_name, file_path, file_type, file_content
-> then index this JSON doc (using raw JSON data)

gagidza · December 20, 2022, 7:08pm

Thanks, I chose the same steps to complete.
I have a solution for this problem for pdf files. I am using pdfbox to get the content. But I want an elegant solution for all the large number of file types. And I understand that Elasticsearch is able to parse various documents itself. Or am I wrong?

dadoonet · December 20, 2022, 7:41pm

Not sure it helps but you can also give a look at fscrawler project.

gagidza · December 21, 2022, 7:42am

I want to share the results, I think it will be useful to someone:

public Document parseFile(String filePath) throws IOException, TikaException, SAXException {
        Map<String, String> metaMap = new HashMap<>();
        String content = "";

        File file = new File(filePath);
        InputStream input = new ByteArrayInputStream(FileUtils.readFileToByteArray(file));
        BodyContentHandler handler = new BodyContentHandler(-1);
        Metadata metadata = new Metadata();
        Parser parser = new AutoDetectParser();
        parser.parse(input, handler, metadata, new ParseContext());

        content = handler.toString();
        Arrays.stream(metadata.names()).forEach(n -> metaMap.put(n, metadata.get(n)));
        
        Document document = new Document();
        document.setName(file.getName());
        document.setPath(file.getAbsolutePath());
        document.setText(content);
        
        return document;
    }

public String indexDocument(String filePath) throws IOException, TikaException, SAXException {
        Document document = documentService.parseFile(filePath);

        ElasticsearchClient client = esRestClient.getElasticSearchClient();

        IndexResponse response = client.index(i -> i
                .index(DOCUMENT_INDEX)
                .document(document));

        logger.info("File {} located in the directory: {} was successfully indexed.", document.getName(), document.getPath());

        return response.result().toString();
    }

@GetMapping("/parse")
    public void indexDocuments() {
        List<String> documents = Arrays.asList(PDF_TEXT, PDF_IMAGE, DOC, DOCX, TXT);
        for (String doc : documents) {
            service.indexDocument(doc);
        }
    }

This is an intermediate option, I am not using metadata information yet and the document has only 3 fields so far (name, absolute path and content).

Summing up, with the help of Apache Teak it was possible to get all the necessary information from various types of text files. It was not possible to get the contents from the pdf file created from the pictures. I will continue to look for a more beautiful and flexible solution through the attachment processor.

Also thanks for mentioning the fscrawler project. Very interesting project, should be studied.

dadoonet · December 21, 2022, 9:48am

Yeah. It does basically what you wrote...

system · January 18, 2023, 9:48am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Indexing all pdfs within a folder Elasticsearch	2	462	December 12, 2018
Indexing PDF file in ElasticSearch using Java Code Elasticsearch	2	2602	August 28, 2018
How to store .pdf or .txt file and search in it using Java Elasticsearch language-clients , ingest-pipeline , reindex	9	1337	June 6, 2022
Retrieve metadata of document in response to queries Elasticsearch	12	600	December 20, 2019
Indexing pdf, word, text, image files Elasticsearch	2	678	April 27, 2017

How to index text files (pdf, doc, txt...) in Java?

Related topics