Update Loading Wikipedia Guide


(Panagis Alisandratos) #1

UPDATE: I managed to create the index; see my UPDATEs in my response below.

Loading Wikipedia's Search Index For Testing was posted in early 2016. I have been trying (and failing) to load the Wikipedia search index while using this guide as a reference because I cannot find a more up-to-date alternative. I have tried using elasticsearch 2.4 and 5.5 but I receive error after error when attempting to create the index with the latest CirrusSearch settings and mapping.

@nik9000 It would be awesome if this guide was updated for late 2017.

Has anyone had any luck indexing Wikipedia as of late?
Does anyone know how I can perform a More Like This query against Wikipedia's pages so that I may extract the categories attached to the highest scoring pages without having to download and index Wikipedia locally?


(Mark Walkom) #2

Can you provide the errors you are seeing?


(Panagis Alisandratos) #3

On OS X 10.12.6 (elasticsearch installed via homebrew)

Traceback snippets from elasticsearch-py

2.4

TransportError(400, 'illegal_argument_exception', 'failed to create index')

Logs reveal the same issues as the traceback for es 5.5 below

5.5

TransportError(400, 'illegal_argument_exception', 'Custom Analyzer [plain] failed to find filter under name [preserve_original_recorder]')
TransportError(400, 'illegal_argument_exception', 'Custom Analyzer [plain] failed to find filter under name [preserve_original]')

UPDATE 2: Facepalming continues. Wikimedia's search-extra extension plugin might do the trick. If I actually get this working I'll post an updated guide myself.

I receive the same type of error for other analyzers (lowercase_keyword, short_text) using preserve_original_recorder and preserve_original. After removing the problem filters from the settings, I received the following type of errors for the mapping:
TransportError(400, 'mapper_parsing_exception', 'Unknown Similarity type [arrays] for field [FIELD NAME]')

UPDATE 1: This is likely due to me not including the "similarity" setting. Let's see if I have committed any more mistakes meriting a facepalm.

I stopped at this point to start this topic.

Configuration:

{
  "settings": {
    "analysis": .content.page.index.analysis, // from below *settings*
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    .content // from below *mappings*
  }
}

settings

mappings


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.