I have collected a text corpus prepared from html documents that were scraped from the same web site. The site and text is in Hebrew. The collection is written to excel file with columns being the title, url and text. I'm loading the documents by iterating the file in Python but am able to index only 123 of the 244 documents in the collection. I have cleaned the data of none-alphanumeric characters and used the HTML Strip Char Filter in indexing, trying keyword, standard and icu_tokenizer as tokenizers but the result is the same. All documents are decoded-encoded as utf-8 and all appear to be adequate when printed to file. Can there be another source of difference between the documents that is causing only some to be properly indexed? Are there other checks that I should be doing?
You can check for errors in the elasticsearch log files.
Thanks mayya. The log file was not informative but the following clean-up did the trick:
article = article.replace('"','').replace('.',' ').replace('\n','\n')
I also used the following as tokenizer to strip the html file of hidden characters, tags
{"tokenizer":"icu_tokenizer","char_filter":["html_strip"]}
but it was the cleanup of " and perhaps \n that resolved the issue.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.