When I am ingesting a csv file using FSCrawler, I am getting the error "IOException reading next record: java.io.IOException: (line 131664) invalid char between encapsulated token and delimiter -> (line 131664) invalid char between encapsulated token and delimiter".
This error is due to this line in CSV file:
NULL,M Udaybhai,"Sachdev colony,"Sarojini NAGAR,Ahmedabad,M-3-13-4,,
Please find below the error message:
03:12:53,414 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(E:/Data/crawler_data/Test, E:/Data/crawler_data/Test/data/test) = /data/test
03:12:53,414 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [true], filename = [/data/test], includes = [[*/*.jpg, */*.jpeg, */*.png, */*.doc, */*.docx, */*.pdf, */*.txt, */*.sql, */*.csv]], excludes = [[*/*.zip, */*.rar, */*.exe, */*.mp4, */*.mp3]]
03:12:53,430 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/data/test], excludes = [[*/*.zip, */*.rar, */*.exe, */*.mp4, */*.mp3]]
03:12:53,430 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/data/test], includes = [[*/*.jpg, */*.jpeg, */*.png, */*.doc, */*.docx, */*.pdf, */*.txt, */*.sql, */*.csv]]
03:12:53,430 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/data/test], excludes = [[*/*.zip, */*.rar, */*.exe, */*.mp4, */*.mp3]]
03:12:53,430 DEBUG [f.p.e.c.f.FsParserAbstract] [/data/test] can be indexed: [true]
03:12:53,430 DEBUG [f.p.e.c.f.FsParserAbstract] - folder: test
03:12:53,430 DEBUG [f.p.e.c.f.FsParserAbstract] indexing [E:/Data/crawler_data/Test/data/test] content
03:12:53,430 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] Listing local files from E:/Data/crawler_data/Test/data/test
03:12:53,430 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] 1 local files found
03:12:53,430 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(E:/Data/crawler_data/Test, E:/Data/crawler_data/Test/data/test/testing.txt) = /data/test/testing.txt
03:12:53,430 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/data/test/testing.txt], includes = [[*/*.jpg, */*.jpeg, */*.png, */*.doc, */*.docx, */*.pdf, */*.txt, */*.sql, */*.csv]], excludes = [[*/*.zip, */*.rar, */*.exe, */*.mp4, */*.mp3]]
03:12:53,430 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/data/test/testing.txt], excludes = [[*/*.zip, */*.rar, */*.exe, */*.mp4, */*.mp3]]
03:12:53,430 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/data/test/testing.txt], includes = [[*/*.jpg, */*.jpeg, */*.png, */*.doc, */*.docx, */*.pdf, */*.txt, */*.sql, */*.csv]]
03:12:53,430 DEBUG [f.p.e.c.f.FsParserAbstract] [/data/test/testing.txt] can be indexed: [true]
03:12:53,430 DEBUG [f.p.e.c.f.FsParserAbstract] - file: /data/test/testing.txt
03:12:53,430 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Comparing file size [39.9mb] with current limit [50mb] -> under limit
03:12:53,430 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [E:/Data/crawler_data/Test/data/test],[testing.txt]
03:12:53,430 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(E:/Data/crawler_data/Test, E:/Data/crawler_data/Test/data/test/testing.txt) = /data/test/testing.txt
03:12:54,581 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(E:/Data/crawler_data/Test, E:/Data/crawler_data/Test/data/test/testing.txt) = /data/test/testing.txt
03:12:54,581 WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [-1] characters of text for [E:/Data/crawler_data/Test/data/test/testing.txt]: exception parsing the csv -> IOException reading next record: java.io.IOException: (line 131664) invalid char between encapsulated token and delimiter -> (line 131664) invalid char between encapsulated token and delimiter
03:12:54,583 DEBUG [f.p.e.c.f.t.TikaDocParser] Failed to extract [-1] characters of text for [E:/Data/crawler_data/Test/data/test/testing.txt]
org.apache.tika.exception.TikaException: exception parsing the csv
at org.apache.tika.parser.csv.TextAndCSVParser.parse(TextAndCSVParser.java:245) ~[tika-parser-text-module-2.9.1.jar:2.9.1]
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-core-2.9.1.jar:2.9.1]
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-core-2.9.1.jar:2.9.1]
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203) ~[tika-core-2.9.1.jar:2.9.1]
at fr.pilato.elasticsearch.crawler.fs.tika.TikaInstance.extractText(TikaInstance.java:197) ~[fscrawler-tika-2.10-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.tika.TikaDocParser.generate(TikaDocParser.java:98) ~[fscrawler-tika-2.10-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.indexFile(FsParserAbstract.java:439) ~[fscrawler-core-2.10-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:277) ~[fscrawler-core-2.10-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:304) ~[fscrawler-core-2.10-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:304) ~[fscrawler-core-2.10-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.run(FsParserAbstract.java:152) ~[fscrawler-core-2.10-SNAPSHOT.jar:?]
at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
Caused by: java.io.UncheckedIOException: IOException reading next record: java.io.IOException: (line 131664) invalid char between encapsulated token and delimiter
at org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:150) ~[commons-csv-1.10.0.jar:1.10.0]
at org.apache.commons.csv.CSVParser$CSVRecordIterator.hasNext(CSVParser.java:160) ~[commons-csv-1.10.0.jar:1.10.0]
at org.apache.tika.parser.csv.TextAndCSVParser.parse(TextAndCSVParser.java:215) ~[tika-parser-text-module-2.9.1.jar:2.9.1]
... 11 more
Caused by: java.io.IOException: (line 131664) invalid char between encapsulated token and delimiter
at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:369) ~[commons-csv-1.10.0.jar:1.10.0]
at org.apache.commons.csv.Lexer.nextToken(Lexer.java:290) ~[commons-csv-1.10.0.jar:1.10.0]
at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:770) ~[commons-csv-1.10.0.jar:1.10.0]
at org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:148) ~[commons-csv-1.10.0.jar:1.10.0]
at org.apache.commons.csv.CSVParser$CSVRecordIterator.hasNext(CSVParser.java:160) ~[commons-csv-1.10.0.jar:1.10.0]
at org.apache.tika.parser.csv.TextAndCSVParser.parse(TextAndCSVParser.java:215) ~[tika-parser-text-module-2.9.1.jar:2.9.1]
... 11 more
03:12:54,588 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(E:/Data/crawler_data/Test, E:/Data/crawler_data/Test/data/test/testing.txt) = /data/test/testing.txt
03:12:54,589 DEBUG [f.p.e.c.f.s.FsCrawlerDocumentServiceElasticsearchImpl] Indexing fs-dark-2024-001/40d1e35ea2ccd52b1a193cccaf23e5c?pipeline=null
03:12:54,590 DEBUG [f.p.e.c.f.f.b.FsCrawlerSimpleBulkProcessorListener] Going to execute new bulk composed of 1 actions
03:12:54,591 DEBUG [f.p.e.c.f.c.ElasticsearchEngine] Sending a bulk request of [1] documents to the Elasticsearch service
03:12:54,592 DEBUG [f.p.e.c.f.c.ElasticsearchClient] bulk a ndjson of 655 characters
03:12:54,601 DEBUG [f.p.e.c.f.f.b.FsCrawlerSimpleBulkProcessorListener] Executed bulk composed of 1 actions
In order to solve this error, I want to escape the quotes so i have added the below Tika config file:
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<service-loader dynamic="true"/>
<service-loader loadErrorHandler="IGNORE"/>
<parsers>
<parser class="org.apache.tika.parser.csv.TextAndCSVParser">
<params>
<param name="escape" type="character">\</param>
</params>
</parser>
</parsers>
</properties>
The follow error is displayed after adding the custom tike config:
03:09:56,195 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/data/test/test1.txt], excludes = [[*/*.zip, */*.rar, */*.exe, */*.mp4, */*.mp3]]
03:09:56,195 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/data/test/test1.txt], includes = [[*/*.jpg, */*.jpeg, */*.png, */*.doc, */*.docx, */*.pdf, */*.txt, */*.sql, */*.csv]]
03:09:56,195 DEBUG [f.p.e.c.f.FsParserAbstract] [/data/test/test1.txt] can be indexed: [true]
03:09:56,195 DEBUG [f.p.e.c.f.FsParserAbstract] - file: /data/test/test1.txt
03:09:56,195 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Comparing file size [39.9mb] with current limit [50mb] -> under limit
03:09:56,195 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [E:/Data/crawler_data/Test/data/test],[test1.txt]
03:09:56,195 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(E:/Data/crawler_data/Test, E:/Data/crawler_data/Test/data/test/test1.txt) = /data/test/test1.txt
03:09:56,195 INFO [f.p.e.c.f.t.TikaInstance] Using custom tika configuration from [E:/Elastic/fscrawler/tikaConfig.xml].
03:09:56,195 ERROR [f.p.e.c.f.t.TikaInstance] Can not configure Tika: java.lang.ClassNotFoundException: character
03:09:56,195 DEBUG [f.p.e.c.f.t.TikaInstance] Fullstack trace error for Tika
org.apache.tika.exception.TikaConfigException: java.lang.ClassNotFoundException: character
at org.apache.tika.config.TikaConfig$XmlLoader.loadOne(TikaConfig.java:744) ~[tika-core-2.9.1.jar:2.9.1]
at org.apache.tika.config.TikaConfig$XmlLoader.loadOne(TikaConfig.java:755) ~[tika-core-2.9.1.jar:2.9.1]
at org.apache.tika.config.TikaConfig$XmlLoader.loadOverall(TikaConfig.java:681) ~[tika-core-2.9.1.jar:2.9.1]
at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:176) ~[tika-core-2.9.1.jar:2.9.1]
at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:156) ~[tika-core-2.9.1.jar:2.9.1]
at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:148) ~[tika-core-2.9.1.jar:2.9.1]
at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:116) ~[tika-core-2.9.1.jar:2.9.1]
at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:112) ~[tika-core-2.9.1.jar:2.9.1]
at fr.pilato.elasticsearch.crawler.fs.tika.TikaInstance.initParser(TikaInstance.java:98) ~[fscrawler-tika-2.10-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.tika.TikaInstance.initTika(TikaInstance.java:86) ~[fscrawler-tika-2.10-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.tika.TikaInstance.extractText(TikaInstance.java:194) ~[fscrawler-tika-2.10-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.tika.TikaDocParser.generate(TikaDocParser.java:98) ~[fscrawler-tika-2.10-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.indexFile(FsParserAbstract.java:439) ~[fscrawler-core-2.10-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:277) ~[fscrawler-core-2.10-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:304) ~[fscrawler-core-2.10-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:304) ~[fscrawler-core-2.10-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.run(FsParserAbstract.java:152) ~[fscrawler-core-2.10-SNAPSHOT.jar:?]
at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: character
at org.apache.tika.config.Param.classFromType(Param.java:279) ~[tika-core-2.9.1.jar:2.9.1]
at org.apache.tika.config.Param.setTypeString(Param.java:336) ~[tika-core-2.9.1.jar:2.9.1]
at org.apache.tika.config.Param.load(Param.java:164) ~[tika-core-2.9.1.jar:2.9.1]
at org.apache.tika.config.TikaConfig$XmlLoader.getParams(TikaConfig.java:853) ~[tika-core-2.9.1.jar:2.9.1]
at org.apache.tika.config.TikaConfig$XmlLoader.loadOne(TikaConfig.java:742) ~[tika-core-2.9.1.jar:2.9.1]
... 17 more
Caused by: java.lang.ClassNotFoundException: character
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641) ~[?:?]
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) ~[?:?]
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:526) ~[?:?]
at java.base/java.lang.Class.forName0(Native Method) ~[?:?]
at java.base/java.lang.Class.forName(Class.java:421) ~[?:?]
at java.base/java.lang.Class.forName(Class.java:412) ~[?:?]
at org.apache.tika.config.Param.classFromType(Param.java:277) ~[tika-core-2.9.1.jar:2.9.1]
at org.apache.tika.config.Param.setTypeString(Param.java:336) ~[tika-core-2.9.1.jar:2.9.1]
at org.apache.tika.config.Param.load(Param.java:164) ~[tika-core-2.9.1.jar:2.9.1]
at org.apache.tika.config.TikaConfig$XmlLoader.getParams(TikaConfig.java:853) ~[tika-core-2.9.1.jar:2.9.1]
at org.apache.tika.config.TikaConfig$XmlLoader.loadOne(TikaConfig.java:742) ~[tika-core-2.9.1.jar:2.9.1]
... 17 more
03:09:56,195 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(E:/Data/crawler_data/Test, E:/Data/crawler_data/Test/data/test/test1.txt) = /data/test/test1.txt
03:09:56,195 WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [-1] characters of text for [E:/Data/crawler_data/Test/data/test/test1.txt]: Cannot invoke "org.apache.tika.config.TikaConfig.getMediaTypeRegistry()" because "config" is null
03:09:56,195 DEBUG [f.p.e.c.f.t.TikaDocParser] Failed to extract [-1] characters of text for [E:/Data/crawler_data/Test/data/test/test1.txt]
java.lang.NullPointerException: Cannot invoke "org.apache.tika.config.TikaConfig.getMediaTypeRegistry()" because "config" is null
at org.apache.tika.parser.AutoDetectParser.<init>(AutoDetectParser.java:92) ~[tika-core-2.9.1.jar:2.9.1]
at fr.pilato.elasticsearch.crawler.fs.tika.TikaInstance.initParser(TikaInstance.java:104) ~[fscrawler-tika-2.10-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.tika.TikaInstance.initTika(TikaInstance.java:86) ~[fscrawler-tika-2.10-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.tika.TikaInstance.extractText(TikaInstance.java:194) ~[fscrawler-tika-2.10-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.tika.TikaDocParser.generate(TikaDocParser.java:98) ~[fscrawler-tika-2.10-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.indexFile(FsParserAbstract.java:439) ~[fscrawler-core-2.10-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:277) ~[fscrawler-core-2.10-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:304) ~[fscrawler-core-2.10-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:304) ~[fscrawler-core-2.10-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.run(FsParserAbstract.java:152) ~[fscrawler-core-2.10-SNAPSHOT.jar:?]
at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
03:09:56,195 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(E:/Data/crawler_data/Test, E:/Data/crawler_data/Test/data/test/test1.txt) = /data/test/test1.txt
03:09:56,195 DEBUG [f.p.e.c.f.s.FsCrawlerDocumentServiceElasticsearchImpl] Indexing fs-dark-2024-001/50b7f0d9383bc4f7a970db8a7bcc644a?pipeline=null
03:09:56,195 DEBUG [f.p.e.c.f.f.b.FsCrawlerSimpleBulkProcessorListener] Going to execute new bulk composed of 1 actions
03:09:56,195 DEBUG [f.p.e.c.f.c.ElasticsearchEngine] Sending a bulk request of [1] documents to the Elasticsearch service
03:09:56,195 DEBUG [f.p.e.c.f.c.ElasticsearchClient] bulk a ndjson of 585 characters
03:09:56,210 DEBUG [f.p.e.c.f.f.b.FsCrawlerSimpleBulkProcessorListener] Executed bulk composed of 1 actions
03:09:56,210 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed files in [E:/Data/crawler_data/Test/data/test]..
Please let me know how to escape the quotes in CSV file with the help of Tika config file.