How to remove \n and \t chars when using FS Crawler?

Danilo_Felix · August 28, 2019, 7:18pm

Hello all,

I'm new with FS Crawler, so please be patient with me.

Using version 2.6, I'm crawling many files from a big Windows's folders tree. There are PDF, XLS, DOC, etc.

It's working, but the content attribute goes to Elasticsearch full of /t and /n characters.

  {
    "_index": "clgp-docs",
    "_type": "_doc",
    "_id": "c2c62de2f7d5af413ed074c845129751",
    "_score": 1.0,
    "_source": {
      "content": "\nHome\nHistória »\nSaudade\nSindipetro »\nDocumentos »\nSindicalize-se\n\n \n\nNotícias por base »\nEleição 2015 – 2018\nNotícias por assunto »\nBroncas do Petrolino\nAgenda da Diretoria\nTV Petroleira\n\nCategorizado | ACT, Agenda da Diretoria, Direitos, Petrobrás, Reuniões com RHs\n\nSindipetro-LP cobra soluções para problemas relacionados ao Benefício Educacional\n\nPostado em19 fevereiro 2016.\n\nO  Benefício  Educacional,  uma  das  conquistas  da  categoria  e  importante  ferramenta...

Is it possible to change these special chars to space chars during the ingestion process? How to do it?

I aprecciatte if you could expose a _setting.json example file.

Thanks

dadoonet · August 30, 2019, 3:57am

Welcome!

I believe that the only option from now is to use an ingest pipeline which does the removal at index time on elasticsearch side. See https://fscrawler.readthedocs.io/en/latest/admin/fs/elasticsearch.html#using-ingest-node-pipeline

May be with this processor https://www.elastic.co/guide/en/elasticsearch/reference/7.3/gsub-processor.html
or this one https://www.elastic.co/guide/en/elasticsearch/reference/7.3/script-processor.html

Could you also open an issue in FSCrawler project so I can think about adding a regex parser in the future in FSCrawler itself?

Danilo_Felix · August 30, 2019, 1:56pm

Thanks about the ideia! I'll try it.

This would be a good improvement to the next versions os FS Crawler. I'll open a issue there, as you sugested.

system · September 27, 2019, 1:56pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to remove /n characters from index Elasticsearch	4	1722	November 21, 2018
How to avoid �� chars when using FS Crawler? Elasticsearch	6	1074	December 8, 2019
Fscrawler pipeline feature Elasticsearch	11	2291	July 26, 2018
ElasticSearch Indexing question Elasticsearch	22	3838	July 5, 2017
Treatment of special characters in elasticsearch Elasticsearch	4	1048	July 6, 2017

How to remove \n and \t chars when using FS Crawler?

Related topics