UPDATE: I managed to create the index; see my UPDATEs in my response below.
Loading Wikipedia's Search Index For Testing was posted in early 2016. I have been trying (and failing) to load the Wikipedia search index while using this guide as a reference because I cannot find a more up-to-date alternative. I have tried using elasticsearch 2.4 and 5.5 but I receive error after error when attempting to create the index with the latest CirrusSearch settings and mapping.
@nik9000 It would be awesome if this guide was updated for late 2017.
Has anyone had any luck indexing Wikipedia as of late?
Does anyone know how I can perform a More Like This query against Wikipedia's pages so that I may extract the categories attached to the highest scoring pages without having to download and index Wikipedia locally?
On OS X 10.12.6 (elasticsearch installed via homebrew)
Traceback snippets from elasticsearch-py
2.4
TransportError(400, 'illegal_argument_exception', 'failed to create index')
Logs reveal the same issues as the traceback for es 5.5 below
5.5
TransportError(400, 'illegal_argument_exception', 'Custom Analyzer [plain] failed to find filter under name [preserve_original_recorder]')
TransportError(400, 'illegal_argument_exception', 'Custom Analyzer [plain] failed to find filter under name [preserve_original]')
UPDATE 2: Facepalming continues. Wikimedia's search-extra extension plugin might do the trick. If I actually get this working I'll post an updated guide myself.
I receive the same type of error for other analyzers (lowercase_keyword, short_text) using preserve_original_recorder and preserve_original. After removing the problem filters from the settings, I received the following type of errors for the mapping: TransportError(400, 'mapper_parsing_exception', 'Unknown Similarity type [arrays] for field [FIELD NAME]')
UPDATE 1: This is likely due to me not including the "similarity" setting. Let's see if I have committed any more mistakes meriting a facepalm.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.