Indexing plain text files

Adspectus · March 2, 2020, 5:54pm

Hello, just installed Elasticsearch as a full text engine for my nextcloud installation. I noticed that plain text files such as .txt or .tex (LaTeX source code) are not getting indexed - or at least a search for text shows no result (only PDF files in search result). There is an open bug report at the nextcloud side that says Elasticsearch is not able to deal with plain text files if the file has line breaks or tabs in it - which is not unusual IMHO. Actually, I can hardly believe that this is true. A full text search engine which is not capable to index plain text files...? Or did I miss something?

dadoonet · March 2, 2020, 7:36pm

Elasticsearch needs that you send valid json documents.
Elasticsearch index json objects.

You need to transform your plain text to a json document.

You could look at FSCrawler project if you want a tool that could do that.

Adspectus · March 2, 2020, 8:03pm

Thanks for your answer. The doc of FSCrawler says "This crawler helps to index binary documents such as PDF, Open Office, MS Office." - how does that fit? Moreover, I do not know how to implement this to Nextcloud, as this is just giving me the option to configure an Elasticsearch instance. I fear that this leads to situation where the guys on the Elasticsearch side will say, that the problem is with the Nextcloud side and the Nextcloud guys will point in the opposite direction. At least, I have the impression that this forum is more responsive...

dadoonet · March 2, 2020, 11:37pm

I don't know nextcloud.

We do have a cloud offer Elastic Cloud: Hosted Elasticsearch, Hosted Search | Elastic ?

Cloud by elastic is one way to have access to all features, all managed by us. Think about what is there yet like Security, Monitoring, Reporting, SQL, Canvas, APM, Logs UI, Infra UI, SIEM, Maps UI, AppSearch and what is coming next ...

It also works with plain text file.

system · March 30, 2020, 11:44pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.