Documents not being indexed

ronyarmon · February 28, 2019, 1:37pm

I have collected a text corpus prepared from html documents that were scraped from the same web site. The site and text is in Hebrew. The collection is written to excel file with columns being the title, url and text. I'm loading the documents by iterating the file in Python but am able to index only 123 of the 244 documents in the collection. I have cleaned the data of none-alphanumeric characters and used the HTML Strip Char Filter in indexing, trying keyword, standard and icu_tokenizer as tokenizers but the result is the same. All documents are decoded-encoded as utf-8 and all appear to be adequate when printed to file. Can there be another source of difference between the documents that is causing only some to be properly indexed? Are there other checks that I should be doing?

mayya · March 8, 2019, 9:35pm

You can check for errors in the elasticsearch log files.

ronyarmon · March 10, 2019, 6:09am

Thanks mayya. The log file was not informative but the following clean-up did the trick:
article = article.replace('"','').replace('.',' ').replace('\n','\n')
I also used the following as tokenizer to strip the html file of hidden characters, tags
{"tokenizer":"icu_tokenizer","char_filter":["html_strip"]}
but it was the cleanup of " and perhaps \n that resolved the issue.

system · April 7, 2019, 6:10am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Indexing HTML documents, problems with JSON Elasticsearch	5	981	July 6, 2017
Indexing HTML Elasticsearch	5	675	July 6, 2017
Storing the html stripped version of a document in elasticsearch Elasticsearch	4	3648	September 26, 2017
Unable to index Html content in Ealsticsearch Elasticsearch	1	342	June 2, 2020
Strip_HTML on indexing does not store results? Elasticsearch	10	918	July 6, 2017

Documents not being indexed

Related topics