Wish me/us luck. Over the next few months we'll be doubling the number of
documents indexed and doubling the update rate and ramping up the query
rate by about an order of magnitude.
Le 8 janvier 2014 at 21:46:20, Nikolas Everett (nik9000@gmail.com) a écrit:
I've spent the past six months or so writing and deploying a replacement on site search system and we've finally decided it was time to blog about it. I figure this list might find this useful because everything is open source and public. It looks like we're averaging about 3000 queries per second and about 300 updates per second at the moment which doesn't make us a very big installation but we're excited in our own little way. If you have time please give it a shot here or here or here or, if you don't mind somewhat uglier results, here. If you notice anything fishy please let me know or file a bug. We're a pretty small team but we'll get to everything eventually.
Wish me/us luck. Over the next few months we'll be doubling the number of documents indexed and doubling the update rate and ramping up the query rate by about an order of magnitude.
I'll let you know when you can find Elasticsearch - Wikipedia with it but that might take
some time. There is way too much search traffic for us to be the default
there.
The search suggestion is a bit surprising - "he do works" but what a
difference to the old search Suche – Wikisource
Those suggestions are coming from the Title of books and wikisource doesn't
contain one called Hello World. Elasticsearch actually provides some
really metal suggestions:
hell works
hollow works
hell world
hell word
which are all "better" than hello world. I think there are lots of
religious texts in there.
Funny thing: the old suggester was surprisingly good. It took a lot of
effort to get the phrase suggester to the same quality. Honestly I'll need
to go back and look at the old one again some time and see if there are any
lessons in there I can move over. I wish I knew what license it had
though....
When I was first working on the project we were generating suggestions
using all titles of all articles so you'd search for "noble prize" thinking
you'd get "nobel prize" but alas, you'd get "none pipe". Templates have
funny names. You can use the clicky stuff to search the template namespace
or you can prefix the query with "template:" to search in the template
namespace. When you do that the search suggestions come from "non content"
pages. Or you could search on commons which has tons of images named Hello
World by searching in the file namespace.
I'll let you know when you can find Elasticsearch - Wikipedia with it but that might take
some time. There is way too much search traffic for us to be the default
there.
We started indexing en.wikipedia.org so the index contains the first 10
million pages or so and all pages that have been directly edited in the
past 22 hours. Elasticsearch - Search results - Wikipedia
Sadly, the highlighter seems to think that the best section to highlight is
the references because they contain the word elasticsearch packed tightly
together. I'm going to have to have a look at that.
I seem to be getting very good results and getting them quickly. But I do
notice some results like the following. Note that the query string I
entered is the title of a book that a good friend (and B-24 navigator
during WWII) lent me to read a couple of years ago:
Soldiers and Slaves https://en.wikipedia.org/wiki/Soldiers_and_Slaves
Soldiers and Slaves: American POWs Trapped by the Nazis' Final Gamble is
a 2005 history of World War II by New York Times reporter Roger ...
I am not sure why it says that there is no page, when the first response is
exactly that page. But still, very quick and a joy to use!
I seem to be getting very good results and getting them quickly. But I do
notice some results like the following. Note that the query string I
entered is the title of a book that a good friend (and B-24 navigator
during WWII) lent me to read a couple of years ago:
Soldiers and Slaveshttps://en.wikipedia.org/wiki/Soldiers_and_Slaves
Soldiers and Slaves: American POWs Trapped by the Nazis' Final Gamble
is a 2005 history of World War II by New York Times reporter Roger ...
I am not sure why it says that there is no page, when the first response
is exactly that page. But still, very quick and a joy to use!
Some stuff:
You might just be using the old search. Right now we're deployed as a
"secondary" search engine so you have to keep re-adding the
srbackend=CirrusSearch url parameter. Lame but I have an excuse: the point
of being a "secondary" search engine is to compare the results with the
original. Once the index is finished building we'll make it a BetaFeature
which means if you log in you can click the "beta" tab and enable it for
all your searches. Sorry I didn't mention it earlier.
I couldn't tell you why the old search is claiming that the page doesn't
exist. It is more black boxy so I can't just issue my own queries against
it with something like Sense. That has been one of the real advantages of
the Elaticsearch solution because it lets me prototype new features against
the production index really easily and it helps a ton with debugging.
We hadn't indexed that page yet so I indexed it manually and now:
You can see that it is the first result and you can see that we don't
highlighting "and" because it is a stop word. At some point soon I'll be
switching to matching on both my plain analyzed copy and my stemmed/stop
worded copy so stop words will start getting highlighted. It should look
something like:
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.