I was fairly involved with the embedded search deployed by the Wikimedia Foundation on all the wikipedias, wiktionaries, commons, wikidata, etc. So I have somewhat related experience. Not totally related because each wikipedia has its own list of pages, but some relationship.
We used the language analyzers built into Elasticsearch for all the languages that didn't have an active reviewer. We used a custom analyzer for English and for Italian and did accent squashing. They were just built out of the analyzers built into Elasticsearch but weren't
italian. Maintaining these isn't too hard because you rarely tweak them, but you need to be careful with testing.
We used MediaWiki's inbuilt redirect functionality rather than synonyms. This let the wiki maintainers set up the redirects that were appropriate. So ever page has a title and a number of "redirect" titles, all of which are indexed and searched.
We did predictive text using edge-ngrams. We didn't use the completion suggester because it was experimental at the time and didn't handle deleted documents. I believe they've since gone to something much fancier after I left.
Our goals were, in order,
- Didn't fall over under the load.
- Stable enough that you didn't have to worry about it paging you every night.
- Updated quickly. Old system was once every few days, new one was a few second behind real time.
- Reasonable response times
- Recall good enough that people thought it was good. Fairly unscientific.
Later work after I left was more in terms of improving recall.
The system took about a year and a half to deliver fully. The first launch was six months in. The year between the first launch and deploying to enwiki we:
- Contributed a few things to Elasticearch
- Wrote our own highlighter
- Implemented regex search for text, a super nice tool for article editors if a little slow. Still way, way faster than a linear regex search.
- Scrounged up a lot more hardware.
- Actually bought hardware.
- Spent some time improving the job management system in general use at the WMF, particularly how it worked with our indexing jobs.
I expect you'd be able to get something reasonable together much, much faster than our year and a half. A lot of our trouble was due to trying to get online updates with very little hardware, trying to juggle hardware from an old system into the new one, and the traffic. Performance testing on real hardware was critical because it is fairly expensive to have enough hardware to support the search load for WMF. We only had the real hardware to test with. And a lot of what we did was react to performance problems and patch them in Elasticsearch or by building things we needed.