How to remove \n and \t chars when using FS Crawler?

Hello all,

I'm new with FS Crawler, so please be patient with me.

Using version 2.6, I'm crawling many files from a big Windows's folders tree. There are PDF, XLS, DOC, etc.

It's working, but the content attribute goes to Elasticsearch full of /t and /n characters.

  {
    "_index": "clgp-docs",
    "_type": "_doc",
    "_id": "c2c62de2f7d5af413ed074c845129751",
    "_score": 1.0,
    "_source": {
      "content": "\nHome\nHistória »\nSaudade\nSindipetro »\nDocumentos »\nSindicalize-se\n\n \n\nNotícias por base »\nEleição 2015 – 2018\nNotícias por assunto »\nBroncas do Petrolino\nAgenda da Diretoria\nTV Petroleira\n\nCategorizado | ACT, Agenda da Diretoria, Direitos, Petrobrás, Reuniões com RHs\n\nSindipetro-LP cobra soluções para problemas relacionados ao Benefício Educacional\n\nPostado em19 fevereiro 2016.\n\nO  Benefício  Educacional,  uma  das  conquistas  da  categoria  e  importante  ferramenta...  

Is it possible to change these special chars to space chars during the ingestion process? How to do it?

I aprecciatte if you could expose a _setting.json example file.

Thanks

Welcome!

I believe that the only option from now is to use an ingest pipeline which does the removal at index time on elasticsearch side. See https://fscrawler.readthedocs.io/en/latest/admin/fs/elasticsearch.html#using-ingest-node-pipeline

May be with this processor https://www.elastic.co/guide/en/elasticsearch/reference/7.3/gsub-processor.html
or this one https://www.elastic.co/guide/en/elasticsearch/reference/7.3/script-processor.html

Could you also open an issue in FSCrawler project so I can think about adding a regex parser in the future in FSCrawler itself?

Thanks about the ideia! I'll try it.

This would be a good improvement to the next versions os FS Crawler. I'll open a issue there, as you sugested.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.