How to index text files (pdf, doc, txt...) in Java?

Hello everyone, some time ago I started to deal with Elasticsearch.
Now I am faced with the task of indexing various text files from a certain directory for subsequent search by their contents. Since Elasticsearch uses Apache Tika to parse various files, I think this is what I need. At this stage, it will be enough to save files by 4 parameters: name, file type, absolute path and file content.
I have already read similar posts (How to store .pdf or .txt file and search in it using Java; How to index and store pdf file in elastic search using spring boot?) and the article Attachment processor | Elasticsearch Guide [8.5] | Elastic. But so far it's not entirely clear to me how to write a similar request for indexing a file using Java.
If someone met with a similar problem and has experience in solving it, I will be glad of any help. Thank you in advance.

for a given directory,

  1. Get a list of files
  2. For each file,
    2.1 get the metadata such as file name, file path, file type
    2.2 get file content (for PDF, extract the text)
    2.3 index the data

or
-> create a JSON doc with four fields: file_name, file_path, file_type, file_content
-> then index this JSON doc (using raw JSON data)

Thanks, I chose the same steps to complete.
I have a solution for this problem for pdf files. I am using pdfbox to get the content. But I want an elegant solution for all the large number of file types. And I understand that Elasticsearch is able to parse various documents itself. Or am I wrong?

Not sure it helps but you can also give a look at fscrawler project. :blush:

I want to share the results, I think it will be useful to someone:

public Document parseFile(String filePath) throws IOException, TikaException, SAXException {
        Map<String, String> metaMap = new HashMap<>();
        String content = "";

        File file = new File(filePath);
        InputStream input = new ByteArrayInputStream(FileUtils.readFileToByteArray(file));
        BodyContentHandler handler = new BodyContentHandler(-1);
        Metadata metadata = new Metadata();
        Parser parser = new AutoDetectParser();
        parser.parse(input, handler, metadata, new ParseContext());

        content = handler.toString();
        Arrays.stream(metadata.names()).forEach(n -> metaMap.put(n, metadata.get(n)));
        
        Document document = new Document();
        document.setName(file.getName());
        document.setPath(file.getAbsolutePath());
        document.setText(content);
        
        return document;
    }
public String indexDocument(String filePath) throws IOException, TikaException, SAXException {
        Document document = documentService.parseFile(filePath);

        ElasticsearchClient client = esRestClient.getElasticSearchClient();

        IndexResponse response = client.index(i -> i
                .index(DOCUMENT_INDEX)
                .document(document));

        logger.info("File {} located in the directory: {} was successfully indexed.", document.getName(), document.getPath());

        return response.result().toString();
    }
@GetMapping("/parse")
    public void indexDocuments() {
        List<String> documents = Arrays.asList(PDF_TEXT, PDF_IMAGE, DOC, DOCX, TXT);
        for (String doc : documents) {
            service.indexDocument(doc);
        }
    }

This is an intermediate option, I am not using metadata information yet and the document has only 3 fields so far (name, absolute path and content).

Summing up, with the help of Apache Teak it was possible to get all the necessary information from various types of text files. It was not possible to get the contents from the pdf file created from the pictures. I will continue to look for a more beautiful and flexible solution through the attachment processor.

Also thanks for mentioning the fscrawler project. Very interesting project, should be studied.

Yeah. It does basically what you wrote... :wink:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.