I was fairly involved with the embedded search deployed by the Wikimedia Foundation on all the wikipedias, wiktionaries, commons, wikidata, etc. So I have somewhat related experience. Not totally related because each wikipedia has its own list of pages, but some relationship.
We used the language analyzers built into Elasticsearch for all the languages that didn't have an active reviewer. We used a custom analyzer for English and for Italian and did accent squashing. They were just built out of the analyzers built into Elasticsearch but weren't english and italian. Maintaining these isn't too hard because you rarely tweak them, but you need to be careful with testing.
We used MediaWiki's inbuilt redirect functionality rather than synonyms. This let the wiki maintainers set up the redirects that were appropriate. So ever page has a title and a number of "redirect" titles, all of which are indexed and searched.
We did predictive text using edge-ngrams. We didn't use the completion suggester because it was experimental at the time and didn't handle deleted documents. I believe they've since gone to something much fancier after I left.
Our goals were, in order,
Didn't fall over under the load.
Stable enough that you didn't have to worry about it paging you every night.
Updated quickly. Old system was once every few days, new one was a few second behind real time.
Reasonable response times
Recall good enough that people thought it was good. Fairly unscientific.
Later work after I left was more in terms of improving recall.
The system took about a year and a half to deliver fully. The first launch was six months in. The year between the first launch and deploying to enwiki we:
Implemented regex search for text, a super nice tool for article editors if a little slow. Still way, way faster than a linear regex search.
Scrounged up a lot more hardware.
Actually bought hardware.
Spent some time improving the job management system in general use at the WMF, particularly how it worked with our indexing jobs.
I expect you'd be able to get something reasonable together much, much faster than our year and a half. A lot of our trouble was due to trying to get online updates with very little hardware, trying to juggle hardware from an old system into the new one, and the traffic. Performance testing on real hardware was critical because it is fairly expensive to have enough hardware to support the search load for WMF. We only had the real hardware to test with. And a lot of what we did was react to performance problems and patch them in Elasticsearch or by building things we needed.
Many thanks, Nik. All of those points sound familiar (esp 4 & 5!).
Working on this full-time with tests, I would expect to be fully done, with
GUI widgets in JS, within six months, but that's not counting the business
getting in the way.
International synonyms remain a mystery, as many loan words have acquired
meanings which differentiate them from their origins.... I'm sure I'm not
the first to address the issue.
Yeah! To be honest I really like the way we ended up. We gave the synonym power to the wiki maintainers and they did the right thing with it because the setup was flexible. If I were you I'd push to do something similar. If you can find folks who speak the languages and can help then that is probably the best way to go. You don't have to manually maintain the redirects - but it is nice that you can. I think lots of redirects on wikipedia ultimately came from some script someone ran to create them.
One tip I have is that it is important to have an easy way to rebuild your elasticsearch index from your catalog. You'll decide that you want to change some analysis or something and you'll need to rebuild the index to pick up the changes. For that Elasticsearch has the _reindex API (my baby) and aliases. This (very old) blog post introduces the concept of the alias flip required. It is out of date in a bunch of respects (no reindex, mappings have changed since then) but the theory is still sound. So it is important that you can kick this process off periodically.
I worked with Wikipedia redirects a few years ago — an excellent resource. And yes, programmatic re-indexing is a must! Everything programmatic is a must. Thanks for your input.
Sorry if I am mixing subject, but might be the same one and your answer was pretty complete in that matter, but it seems the wikimedia/hightlighter is using ES 2.4.
Would you mind give a help on this subject, language analizer as my complementary post:
For that highlighter I'd open an issue over at the repository. I haven't been particularly involved in that project or highlighting in general for a year or so.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.