FSCrawler - Tika Configuration for escape quotes in TextandCSVParser

When I am ingesting a csv file using FSCrawler, I am getting the error "IOException reading next record: java.io.IOException: (line 131664) invalid char between encapsulated token and delimiter -> (line 131664) invalid char between encapsulated token and delimiter".
This error is due to this line in CSV file:
NULL,M Udaybhai,"Sachdev colony,"Sarojini NAGAR,Ahmedabad,M-3-13-4,,

Please find below the error message:

03:12:53,414 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(E:/Data/crawler_data/Test, E:/Data/crawler_data/Test/data/test) = /data/test
03:12:53,414 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [true], filename = [/data/test], includes = [[*/*.jpg, */*.jpeg, */*.png, */*.doc, */*.docx, */*.pdf, */*.txt, */*.sql, */*.csv]], excludes = [[*/*.zip, */*.rar, */*.exe, */*.mp4, */*.mp3]]
03:12:53,430 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/data/test], excludes = [[*/*.zip, */*.rar, */*.exe, */*.mp4, */*.mp3]]
03:12:53,430 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/data/test], includes = [[*/*.jpg, */*.jpeg, */*.png, */*.doc, */*.docx, */*.pdf, */*.txt, */*.sql, */*.csv]]
03:12:53,430 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/data/test], excludes = [[*/*.zip, */*.rar, */*.exe, */*.mp4, */*.mp3]]
03:12:53,430 DEBUG [f.p.e.c.f.FsParserAbstract] [/data/test] can be indexed: [true]
03:12:53,430 DEBUG [f.p.e.c.f.FsParserAbstract]   - folder: test
03:12:53,430 DEBUG [f.p.e.c.f.FsParserAbstract] indexing [E:/Data/crawler_data/Test/data/test] content
03:12:53,430 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] Listing local files from E:/Data/crawler_data/Test/data/test
03:12:53,430 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] 1 local files found
03:12:53,430 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(E:/Data/crawler_data/Test, E:/Data/crawler_data/Test/data/test/testing.txt) = /data/test/testing.txt
03:12:53,430 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/data/test/testing.txt], includes = [[*/*.jpg, */*.jpeg, */*.png, */*.doc, */*.docx, */*.pdf, */*.txt, */*.sql, */*.csv]], excludes = [[*/*.zip, */*.rar, */*.exe, */*.mp4, */*.mp3]]
03:12:53,430 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/data/test/testing.txt], excludes = [[*/*.zip, */*.rar, */*.exe, */*.mp4, */*.mp3]]
03:12:53,430 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/data/test/testing.txt], includes = [[*/*.jpg, */*.jpeg, */*.png, */*.doc, */*.docx, */*.pdf, */*.txt, */*.sql, */*.csv]]
03:12:53,430 DEBUG [f.p.e.c.f.FsParserAbstract] [/data/test/testing.txt] can be indexed: [true]
03:12:53,430 DEBUG [f.p.e.c.f.FsParserAbstract]   - file: /data/test/testing.txt
03:12:53,430 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Comparing file size [39.9mb] with current limit [50mb] -> under limit
03:12:53,430 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [E:/Data/crawler_data/Test/data/test],[testing.txt]
03:12:53,430 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(E:/Data/crawler_data/Test, E:/Data/crawler_data/Test/data/test/testing.txt) = /data/test/testing.txt
03:12:54,581 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(E:/Data/crawler_data/Test, E:/Data/crawler_data/Test/data/test/testing.txt) = /data/test/testing.txt
03:12:54,581 WARN  [f.p.e.c.f.t.TikaDocParser] Failed to extract [-1] characters of text for [E:/Data/crawler_data/Test/data/test/testing.txt]: exception parsing the csv -> IOException reading next record: java.io.IOException: (line 131664) invalid char between encapsulated token and delimiter -> (line 131664) invalid char between encapsulated token and delimiter
03:12:54,583 DEBUG [f.p.e.c.f.t.TikaDocParser] Failed to extract [-1] characters of text for [E:/Data/crawler_data/Test/data/test/testing.txt]
org.apache.tika.exception.TikaException: exception parsing the csv
        at org.apache.tika.parser.csv.TextAndCSVParser.parse(TextAndCSVParser.java:245) ~[tika-parser-text-module-2.9.1.jar:2.9.1]
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-core-2.9.1.jar:2.9.1]
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-core-2.9.1.jar:2.9.1]
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203) ~[tika-core-2.9.1.jar:2.9.1]
        at fr.pilato.elasticsearch.crawler.fs.tika.TikaInstance.extractText(TikaInstance.java:197) ~[fscrawler-tika-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.tika.TikaDocParser.generate(TikaDocParser.java:98) ~[fscrawler-tika-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.indexFile(FsParserAbstract.java:439) ~[fscrawler-core-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:277) ~[fscrawler-core-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:304) ~[fscrawler-core-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:304) ~[fscrawler-core-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.run(FsParserAbstract.java:152) ~[fscrawler-core-2.10-SNAPSHOT.jar:?]
        at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
Caused by: java.io.UncheckedIOException: IOException reading next record: java.io.IOException: (line 131664) invalid char between encapsulated token and delimiter
        at org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:150) ~[commons-csv-1.10.0.jar:1.10.0]
        at org.apache.commons.csv.CSVParser$CSVRecordIterator.hasNext(CSVParser.java:160) ~[commons-csv-1.10.0.jar:1.10.0]
        at org.apache.tika.parser.csv.TextAndCSVParser.parse(TextAndCSVParser.java:215) ~[tika-parser-text-module-2.9.1.jar:2.9.1]
        ... 11 more
Caused by: java.io.IOException: (line 131664) invalid char between encapsulated token and delimiter
        at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:369) ~[commons-csv-1.10.0.jar:1.10.0]
        at org.apache.commons.csv.Lexer.nextToken(Lexer.java:290) ~[commons-csv-1.10.0.jar:1.10.0]
        at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:770) ~[commons-csv-1.10.0.jar:1.10.0]
        at org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:148) ~[commons-csv-1.10.0.jar:1.10.0]
        at org.apache.commons.csv.CSVParser$CSVRecordIterator.hasNext(CSVParser.java:160) ~[commons-csv-1.10.0.jar:1.10.0]
        at org.apache.tika.parser.csv.TextAndCSVParser.parse(TextAndCSVParser.java:215) ~[tika-parser-text-module-2.9.1.jar:2.9.1]
        ... 11 more
03:12:54,588 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(E:/Data/crawler_data/Test, E:/Data/crawler_data/Test/data/test/testing.txt) = /data/test/testing.txt
03:12:54,589 DEBUG [f.p.e.c.f.s.FsCrawlerDocumentServiceElasticsearchImpl] Indexing fs-dark-2024-001/40d1e35ea2ccd52b1a193cccaf23e5c?pipeline=null
03:12:54,590 DEBUG [f.p.e.c.f.f.b.FsCrawlerSimpleBulkProcessorListener] Going to execute new bulk composed of 1 actions
03:12:54,591 DEBUG [f.p.e.c.f.c.ElasticsearchEngine] Sending a bulk request of [1] documents to the Elasticsearch service
03:12:54,592 DEBUG [f.p.e.c.f.c.ElasticsearchClient] bulk a ndjson of 655 characters
03:12:54,601 DEBUG [f.p.e.c.f.f.b.FsCrawlerSimpleBulkProcessorListener] Executed bulk composed of 1 actions

In order to solve this error, I want to escape the quotes so i have added the below Tika config file:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <service-loader dynamic="true"/>
  <service-loader loadErrorHandler="IGNORE"/>
  <parsers> 
      <parser class="org.apache.tika.parser.csv.TextAndCSVParser">
		<params>
			<param name="escape" type="character">\</param>
		</params>
	  </parser>
  </parsers>
</properties>

The follow error is displayed after adding the custom tike config:


03:09:56,195 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/data/test/test1.txt], excludes = [[*/*.zip, */*.rar, */*.exe, */*.mp4, */*.mp3]]
03:09:56,195 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/data/test/test1.txt], includes = [[*/*.jpg, */*.jpeg, */*.png, */*.doc, */*.docx, */*.pdf, */*.txt, */*.sql, */*.csv]]
03:09:56,195 DEBUG [f.p.e.c.f.FsParserAbstract] [/data/test/test1.txt] can be indexed: [true]
03:09:56,195 DEBUG [f.p.e.c.f.FsParserAbstract]   - file: /data/test/test1.txt
03:09:56,195 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Comparing file size [39.9mb] with current limit [50mb] -> under limit
03:09:56,195 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [E:/Data/crawler_data/Test/data/test],[test1.txt]
03:09:56,195 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(E:/Data/crawler_data/Test, E:/Data/crawler_data/Test/data/test/test1.txt) = /data/test/test1.txt
03:09:56,195 INFO  [f.p.e.c.f.t.TikaInstance] Using custom tika configuration from [E:/Elastic/fscrawler/tikaConfig.xml].
03:09:56,195 ERROR [f.p.e.c.f.t.TikaInstance] Can not configure Tika: java.lang.ClassNotFoundException: character
03:09:56,195 DEBUG [f.p.e.c.f.t.TikaInstance] Fullstack trace error for Tika
org.apache.tika.exception.TikaConfigException: java.lang.ClassNotFoundException: character
        at org.apache.tika.config.TikaConfig$XmlLoader.loadOne(TikaConfig.java:744) ~[tika-core-2.9.1.jar:2.9.1]
        at org.apache.tika.config.TikaConfig$XmlLoader.loadOne(TikaConfig.java:755) ~[tika-core-2.9.1.jar:2.9.1]
        at org.apache.tika.config.TikaConfig$XmlLoader.loadOverall(TikaConfig.java:681) ~[tika-core-2.9.1.jar:2.9.1]
        at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:176) ~[tika-core-2.9.1.jar:2.9.1]
        at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:156) ~[tika-core-2.9.1.jar:2.9.1]
        at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:148) ~[tika-core-2.9.1.jar:2.9.1]
        at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:116) ~[tika-core-2.9.1.jar:2.9.1]
        at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:112) ~[tika-core-2.9.1.jar:2.9.1]
        at fr.pilato.elasticsearch.crawler.fs.tika.TikaInstance.initParser(TikaInstance.java:98) ~[fscrawler-tika-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.tika.TikaInstance.initTika(TikaInstance.java:86) ~[fscrawler-tika-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.tika.TikaInstance.extractText(TikaInstance.java:194) ~[fscrawler-tika-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.tika.TikaDocParser.generate(TikaDocParser.java:98) ~[fscrawler-tika-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.indexFile(FsParserAbstract.java:439) ~[fscrawler-core-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:277) ~[fscrawler-core-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:304) ~[fscrawler-core-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:304) ~[fscrawler-core-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.run(FsParserAbstract.java:152) ~[fscrawler-core-2.10-SNAPSHOT.jar:?]
        at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: character
        at org.apache.tika.config.Param.classFromType(Param.java:279) ~[tika-core-2.9.1.jar:2.9.1]
        at org.apache.tika.config.Param.setTypeString(Param.java:336) ~[tika-core-2.9.1.jar:2.9.1]
        at org.apache.tika.config.Param.load(Param.java:164) ~[tika-core-2.9.1.jar:2.9.1]
        at org.apache.tika.config.TikaConfig$XmlLoader.getParams(TikaConfig.java:853) ~[tika-core-2.9.1.jar:2.9.1]
        at org.apache.tika.config.TikaConfig$XmlLoader.loadOne(TikaConfig.java:742) ~[tika-core-2.9.1.jar:2.9.1]
        ... 17 more
Caused by: java.lang.ClassNotFoundException: character
        at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641) ~[?:?]
        at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) ~[?:?]
        at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:526) ~[?:?]
        at java.base/java.lang.Class.forName0(Native Method) ~[?:?]
        at java.base/java.lang.Class.forName(Class.java:421) ~[?:?]
        at java.base/java.lang.Class.forName(Class.java:412) ~[?:?]
        at org.apache.tika.config.Param.classFromType(Param.java:277) ~[tika-core-2.9.1.jar:2.9.1]
        at org.apache.tika.config.Param.setTypeString(Param.java:336) ~[tika-core-2.9.1.jar:2.9.1]
        at org.apache.tika.config.Param.load(Param.java:164) ~[tika-core-2.9.1.jar:2.9.1]
        at org.apache.tika.config.TikaConfig$XmlLoader.getParams(TikaConfig.java:853) ~[tika-core-2.9.1.jar:2.9.1]
        at org.apache.tika.config.TikaConfig$XmlLoader.loadOne(TikaConfig.java:742) ~[tika-core-2.9.1.jar:2.9.1]
        ... 17 more
03:09:56,195 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(E:/Data/crawler_data/Test, E:/Data/crawler_data/Test/data/test/test1.txt) = /data/test/test1.txt
03:09:56,195 WARN  [f.p.e.c.f.t.TikaDocParser] Failed to extract [-1] characters of text for [E:/Data/crawler_data/Test/data/test/test1.txt]: Cannot invoke "org.apache.tika.config.TikaConfig.getMediaTypeRegistry()" because "config" is null
03:09:56,195 DEBUG [f.p.e.c.f.t.TikaDocParser] Failed to extract [-1] characters of text for [E:/Data/crawler_data/Test/data/test/test1.txt]
java.lang.NullPointerException: Cannot invoke "org.apache.tika.config.TikaConfig.getMediaTypeRegistry()" because "config" is null
        at org.apache.tika.parser.AutoDetectParser.<init>(AutoDetectParser.java:92) ~[tika-core-2.9.1.jar:2.9.1]
        at fr.pilato.elasticsearch.crawler.fs.tika.TikaInstance.initParser(TikaInstance.java:104) ~[fscrawler-tika-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.tika.TikaInstance.initTika(TikaInstance.java:86) ~[fscrawler-tika-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.tika.TikaInstance.extractText(TikaInstance.java:194) ~[fscrawler-tika-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.tika.TikaDocParser.generate(TikaDocParser.java:98) ~[fscrawler-tika-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.indexFile(FsParserAbstract.java:439) ~[fscrawler-core-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:277) ~[fscrawler-core-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:304) ~[fscrawler-core-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:304) ~[fscrawler-core-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.run(FsParserAbstract.java:152) ~[fscrawler-core-2.10-SNAPSHOT.jar:?]
        at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
03:09:56,195 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(E:/Data/crawler_data/Test, E:/Data/crawler_data/Test/data/test/test1.txt) = /data/test/test1.txt
03:09:56,195 DEBUG [f.p.e.c.f.s.FsCrawlerDocumentServiceElasticsearchImpl] Indexing fs-dark-2024-001/50b7f0d9383bc4f7a970db8a7bcc644a?pipeline=null
03:09:56,195 DEBUG [f.p.e.c.f.f.b.FsCrawlerSimpleBulkProcessorListener] Going to execute new bulk composed of 1 actions
03:09:56,195 DEBUG [f.p.e.c.f.c.ElasticsearchEngine] Sending a bulk request of [1] documents to the Elasticsearch service
03:09:56,195 DEBUG [f.p.e.c.f.c.ElasticsearchClient] bulk a ndjson of 585 characters
03:09:56,210 DEBUG [f.p.e.c.f.f.b.FsCrawlerSimpleBulkProcessorListener] Executed bulk composed of 1 actions
03:09:56,210 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed files in [E:/Data/crawler_data/Test/data/test]..

Please let me know how to escape the quotes in CSV file with the help of Tika config file.

Where did you find this configuration?

I don't see anything like this in the TextAndCSVParser class.

From this line "org.apache.tika.parser.csv.TextAndCSVParser.parse(TextAndCSVParser.java:245) ~[tika-parser-text-module-2.9.1.jar:2.9.1]", I have figured out the parser name.
Also I came to know that i need the escape the quotes. But I am not sure how to do this. It is wrong, please guide me how to resolve?

This line is:

throw new TikaException("exception parsing the csv", e);

How could you guess from this line that you could set an escape parameter for this Tika Parser?

After changing the Quote to null in the Java file, it is working.

This issue is raised in TIKA in 2020 itself and as per their suggestion, I have modified that line and run the FSCrawler. I will test again with more files tomorrow.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.