Very open Elasticsearch installation

I've spent the past six months or so writing and deploying a replacement on
site search system and we've finally decided it was time to blog about
ithttp://blog.wikimedia.org/2014/01/06/wikimedia-moving-to-elasticsearch/.
I figure this list might find this useful because
everythinghttp://git.wikimedia.org/summary/?r=mediawiki/extensions/CirrusSearch.gitis
open
sourcehttp://git.wikimedia.org/tree/operations%2Fpuppet.git/production/modules%2Felasticsearchand
publichttp://ganglia.wikimedia.org/latest/?c=Elasticsearch%20cluster%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2.
It looks like we're averaging about
3000http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=es_queries&s=by+name&c=Elasticsearch+cluster+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4queries
per second and about
300http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=es_indexes&s=by+name&c=Elasticsearch+cluster+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4updates
per second at the moment which doesn't make us a very big
installation but we're excited in our own little way. If you have time
please give it a shot here https://it.wikipedia.org/wiki/Speciale:Ricercaor
here https://en.wikisource.org/wiki/Special:Search or
herehttps://www.mediawiki.org/wiki/Special:Searchor, if you don't
mind somewhat uglier results,
here https://www.wikidata.org/wiki/Special:Search. If you notice
anything fishy please let me know or file a
bughttps://bugzilla.wikimedia.org/enter_bug.cgi?product=MediaWiki%20extensions&component=CirrusSearch.
We're a pretty small team but we'll get to everything eventually.

Wish me/us luck. Over the next few months we'll be doubling the number of
documents indexed and doubling the update rate and ramping up the query
rate by about an order of magnitude.

Thanks for reading,

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd1RM7%2BZdJbUq%2B%2BAnTiz_4Hu1ei9OQ7c31w9%3Dgp248i-6A%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

This is really awesome Nik!
Congrats to your team.

I'm a bit disappointed that this search gives no result: https://en.wikisource.org/w/index.php?title=Special%3ASearch&profile=default&search=elasticsearch&fulltext=Search :slight_smile:

Best

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 8 janvier 2014 at 21:46:20, Nikolas Everett (nik9000@gmail.com) a écrit:

I've spent the past six months or so writing and deploying a replacement on site search system and we've finally decided it was time to blog about it. I figure this list might find this useful because everything is open source and public. It looks like we're averaging about 3000 queries per second and about 300 updates per second at the moment which doesn't make us a very big installation but we're excited in our own little way. If you have time please give it a shot here or here or here or, if you don't mind somewhat uglier results, here. If you notice anything fishy please let me know or file a bug. We're a pretty small team but we'll get to everything eventually.

Wish me/us luck. Over the next few months we'll be doubling the number of documents indexed and doubling the update rate and ramping up the query rate by about an order of magnitude.

Thanks for reading,

Nik

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd1RM7%2BZdJbUq%2B%2BAnTiz_4Hu1ei9OQ7c31w9%3Dgp248i-6A%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/etPan.52cdb9ed.4b588f54.1449b%40MacBook-Air-de-David.local.
For more options, visit https://groups.google.com/groups/opt_out.

On Wed, Jan 8, 2014 at 3:49 PM, David Pilato david@pilato.fr wrote:

This is really awesome Nik!
Congrats to your team.

I'm a bit disappointed that this search gives no result:
Search results for "elasticsearch" - Wikisource, the free online library
:slight_smile:

Looks like we don't have any books about Elasticsearch. It does show up
here:

I can't read it. You can also find technical stuff about the
integration and our rollout plan over here:
Search results for "elasticsearch" - MediaWiki.

I'll let you know when you can find
Elasticsearch - Wikipedia with it but that might take
some time. There is way too much search traffic for us to be the default
there.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd1Eaxe0dYW-WjyYbhVssaambHWdYqKeyMFstah3apdDrw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

I can't tell it in other words, but your step to ES is a landmark.

Thank you, Nik, for making this public, this helps me a lot for spreading
the word for more openness...

https://en.wikisource.org/w/index.php?title=Special%3ASearch&profile=default&search=Hello+World&fulltext=Search

The search suggestion is a bit surprising - "he do works" :slight_smile: but what a
difference to the old search https://de.wikisource.org/wiki/Spezial:Suche

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHKtgyS%2BKPuDNnhXbx6nc038O61TGze-mC6pVvTANMkGA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

On Wed, Jan 8, 2014 at 4:22 PM, joergprante@gmail.com <joergprante@gmail.com

wrote:

I can't tell it in other words, but your step to ES is a landmark.

Thank you, Nik, for making this public, this helps me a lot for spreading
the word for more openness...

Search results for "Hello World" - Wikisource, the free online library

The search suggestion is a bit surprising - "he do works" :slight_smile: but what a
difference to the old search Suche – Wikisource

Those suggestions are coming from the Title of books and wikisource doesn't
contain one called Hello World. Elasticsearch actually provides some
really metal suggestions:
hell works
hollow works
hell world
hell word
which are all "better" than hello world. I think there are lots of
religious texts in there.

Funny thing: the old suggester was surprisingly good. It took a lot of
effort to get the phrase suggester to the same quality. Honestly I'll need
to go back and look at the old one again some time and see if there are any
lessons in there I can move over. I wish I knew what license it had
though....

When I was first working on the project we were generating suggestions
using all titles of all articles so you'd search for "noble prize" thinking
you'd get "nobel prize" but alas, you'd get "none pipe". Templates have
funny names. You can use the clicky stuff to search the template namespace
or you can prefix the query with "template:" to search in the template
namespace. When you do that the search suggestions come from "non content"
pages. Or you could search on commons which has tons of images named Hello
World by searching in the file namespace.

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd3kBP-T8uUHYPqE%3DQicuE0pRY7weYe7Kfaa9wuG8QVp1w%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

On Wed, Jan 8, 2014 at 4:01 PM, Nikolas Everett nik9000@gmail.com wrote:

On Wed, Jan 8, 2014 at 3:49 PM, David Pilato david@pilato.fr wrote:

I'm a bit disappointed that this search gives no result:
Search results for "elasticsearch" - Wikisource, the free online library
:slight_smile:

I'll let you know when you can find
Elasticsearch - Wikipedia with it but that might take
some time. There is way too much search traffic for us to be the default
there.

We started indexing en.wikipedia.org so the index contains the first 10
million pages or so and all pages that have been directly edited in the
past 22 hours.
Elasticsearch - Search results - Wikipedia
Sadly, the highlighter seems to think that the best section to highlight is
the references because they contain the word elasticsearch packed tightly
together. I'm going to have to have a look at that.

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd2DBXpE1nsCRUXJb77XRUv0SXQyS5RuYJwTbqLTLCXgBw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi, Nik.

I seem to be getting very good results and getting them quickly. But I do
notice some results like the following. Note that the query string I
entered is the title of a book that a good friend (and B-24 navigator
during WWII) lent me to read a couple of years ago:

Search: soldiers and slaves

The page "Soldiers and slaves
https://en.wikipedia.org/w/index.php?title=Soldiers_and_slaves&action=edit&redlink=1"
does not exist. You can ask for it to be created
https://en.wikipedia.org/wiki/Wikipedia:Articles_for_creation, but
consider checking the search results below to see whether the topic is
already covered.

For search help, please visit Help:Searchinghttps://en.wikipedia.org/wiki/Help:Searching
.

I am not sure why it says that there is no page, when the first response is
exactly that page. But still, very quick and a joy to use!

Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/86b620ad-3e9e-4a22-bca5-8ca46e1fcd32%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On Tue, Jan 14, 2014 at 10:28 AM, InquiringMind brian.from.fl@gmail.comwrote:

Hi, Nik.

I seem to be getting very good results and getting them quickly. But I do
notice some results like the following. Note that the query string I
entered is the title of a book that a good friend (and B-24 navigator
during WWII) lent me to read a couple of years ago:

Search: soldiers and slaves

The page "Soldiers and slaves
https://en.wikipedia.org/w/index.php?title=Soldiers_and_slaves&action=edit&redlink=1"
does not exist. You can ask for it to be created
https://en.wikipedia.org/wiki/Wikipedia:Articles_for_creation, but
consider checking the search results below to see whether the topic is
already covered.

For search help, please visit Help:Searchinghttps://en.wikipedia.org/wiki/Help:Searching
.

I am not sure why it says that there is no page, when the first response
is exactly that page. But still, very quick and a joy to use!

Some stuff:
You might just be using the old search. Right now we're deployed as a
"secondary" search engine so you have to keep re-adding the
srbackend=CirrusSearch url parameter. Lame but I have an excuse: the point
of being a "secondary" search engine is to compare the results with the
original. Once the index is finished building we'll make it a BetaFeature
which means if you log in you can click the "beta" tab and enable it for
all your searches. Sorry I didn't mention it earlier.

I couldn't tell you why the old search is claiming that the page doesn't
exist. It is more black boxy so I can't just issue my own queries against
it with something like Sense. That has been one of the real advantages of
the Elaticsearch solution because it lets me prototype new features against
the production index really easily and it helps a ton with debugging.

We hadn't indexed that page yet so I indexed it manually and now:

You can see that it is the first result and you can see that we don't
highlighting "and" because it is a stop word. At some point soon I'll be
switching to matching on both my plain analyzed copy and my stemmed/stop
worded copy so stop words will start getting highlighted. It should look
something like:

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd1SBV0McdFDCD%3Dwwt0O%2BuSufUOvBO%2BSqR6Cm%2B4UX3U1pw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.