Elasticsearch backend for MediaWiki


(Nik Everett) #1

Elasticsearch list,

So I've been working on a MediaWiki plugin that powers search using
Elasticsearch and you, yes you, can try it right now! All you've got to do
is head over to our handy dandy test site right here:

It is still in super alpha so I expect things to not work entirely
correctly but please let me know what you think.

The code lives here:
https://git.wikimedia.org/summary/?r=mediawiki/extensions/CirrusSearch.git
but right now it depends on stuff in the master branch of MediaWiki and
I've not been keeping the installation instructions up to date so
installing it yourself would be a pain but if you really want to let me
know and I'll help you through it.

The goal of the project is to give anyone running MediaWiki something they
can install relatively easily that will give them the same search they'd
get on Wikipedia.

Thanks,

Nik Everett

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Brian Yoder) #2

Hi, Nik,

To try this out, I entered media and got 77 results. I then queried for video
media
and got 89 results. I expected fewer than 77, not more.

Then I queried for video and got 21 results. Aha! It returned documents
that contained either video or media.

Would have expected the default to be logical AND, not an OR. In other
words, video media should return documents that contain both video and *
media*.

The query for video AND media returned only 9 documents that mentioned
both, as expected. And the query for video OR media returned all 89 results
as expected.

Just a suggestion: The default of AND and not OR is more Google/Yahoo-like
if nothing is specified. Or is this a Lucene string query quirk?

But still, very cool! And, pretty fast too! And the highlighting is
correct... Awesome!!!! Thanks for sharing!

Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Nik Everett) #3

On Wed, Sep 4, 2013 at 7:31 PM, InquiringMind brian.from.fl@gmail.comwrote:

Hi, Nik,

To try this out, I entered media and got 77 results. I then queried for
video media and got 89 results. I expected fewer than 77, not more.

Then I queried for video and got 21 results. Aha! It returned documents
that contained either video or media.

Would have expected the default to be logical AND, not an OR. In other
words, video media should return documents that contain both videoand
media.

The query for video AND media returned only 9 documents that mentioned
both, as expected. And the query for video OR media returned all 89 results
as expected.

Just a suggestion: The default of AND and not OR is more Google/Yahoo-like
if nothing is specified. Or is this a Lucene string query quirk?

This is one of the first things that came up when we started testing and I
think you are right about it but it took me some time to come around to
it. I just tried google tonight and it looked like AND was the default
operation but I recall trying it a while ago and getting results from
google that didn't include all the terms. I suppose my objections were
rooted in a love of the concept of dismax. Bah. I'll change it.

But still, very cool! And, pretty fast too! And the highlighting is
correct... Awesome!!!! Thanks for sharing!

Of course the highlighting is correct! Elasticsearch does highlighting
very nicely.

Brian

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Pilato) #4

Very cool. Congrats Nikolas!
How did you implement the "did you mean" feature? Suggest API?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 4 sept. 2013 à 23:26, Nikolas Everett nik9000@gmail.com a écrit :

Elasticsearch list,

So I've been working on a MediaWiki plugin that powers search using Elasticsearch and you, yes you, can try it right now! All you've got to do is head over to our handy dandy test site right here: https://test2.wikipedia.org/w/index.php?title=Special%3ASearch&profile=default&search=search+engine&fulltext=Search

It is still in super alpha so I expect things to not work entirely correctly but please let me know what you think.

The code lives here: https://git.wikimedia.org/summary/?r=mediawiki/extensions/CirrusSearch.git but right now it depends on stuff in the master branch of MediaWiki and I've not been keeping the installation instructions up to date so installing it yourself would be a pain but if you really want to let me know and I'll help you through it.

The goal of the project is to give anyone running MediaWiki something they can install relatively easily that will give them the same search they'd get on Wikipedia.

Thanks,

Nik Everett

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Lukáš Vlček) #5

Hi,

I do not think users really care about AND vs. OR or total hits count as
long as the first hits are relevant and lead to the information they want.
Don't try to mimic Google, you never know how it really works.

Even if you use OR you should get hits with both terms first, no? If you
use AND you can miss relevant document just because a single query term was
not analysed correctly (highly inflected languages).

Just my 2 cents.

Regards,
Lukáš
Dne 5.9.2013 2:47 "Nikolas Everett" nik9000@gmail.com napsal(a):

On Wed, Sep 4, 2013 at 7:31 PM, InquiringMind brian.from.fl@gmail.comwrote:

Hi, Nik,

To try this out, I entered media and got 77 results. I then queried
for video media and got 89 results. I expected fewer than 77, not more.

Then I queried for video and got 21 results. Aha! It returned
documents that contained either video or media.

Would have expected the default to be logical AND, not an OR. In other
words, video media should return documents that contain both videoand
media.

The query for video AND media returned only 9 documents that mentioned
both, as expected. And the query for video OR media returned all 89 results
as expected.

Just a suggestion: The default of AND and not OR is more
Google/Yahoo-like if nothing is specified. Or is this a Lucene string query
quirk?

This is one of the first things that came up when we started testing and I
think you are right about it but it took me some time to come around to
it. I just tried google tonight and it looked like AND was the default
operation but I recall trying it a while ago and getting results from
google that didn't include all the terms. I suppose my objections were
rooted in a love of the concept of dismax. Bah. I'll change it.

But still, very cool! And, pretty fast too! And the highlighting is
correct... Awesome!!!! Thanks for sharing!

Of course the highlighting is correct! Elasticsearch does highlighting
very nicely.

Brian

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Pilato) #6

+1 with Lukáš: i prefer having the choice as a user to enter:
search engine
+search engine
search +engine
+search +engine

By default, I'm pretty sure I will use the first in 99% as long as the more relevant come before.

My 0.01 cent in addition of Lukáš's 2 cents :wink:

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 5 sept. 2013 à 08:22, Lukáš Vlček lukas.vlcek@gmail.com a écrit :

Hi,

I do not think users really care about AND vs. OR or total hits count as long as the first hits are relevant and lead to the information they want. Don't try to mimic Google, you never know how it really works.

Even if you use OR you should get hits with both terms first, no? If you use AND you can miss relevant document just because a single query term was not analysed correctly (highly inflected languages).

Just my 2 cents.

Regards,
Lukáš

Dne 5.9.2013 2:47 "Nikolas Everett" nik9000@gmail.com napsal(a):

On Wed, Sep 4, 2013 at 7:31 PM, InquiringMind brian.from.fl@gmail.com wrote:

Hi, Nik,

To try this out, I entered media and got 77 results. I then queried for video media and got 89 results. I expected fewer than 77, not more.

Then I queried for video and got 21 results. Aha! It returned documents that contained either video or media.

Would have expected the default to be logical AND, not an OR. In other words, video media should return documents that contain both video and media.

The query for video AND media returned only 9 documents that mentioned both, as expected. And the query for video OR media returned all 89 results as expected.

Just a suggestion: The default of AND and not OR is more Google/Yahoo-like if nothing is specified. Or is this a Lucene string query quirk?

This is one of the first things that came up when we started testing and I think you are right about it but it took me some time to come around to it. I just tried google tonight and it looked like AND was the default operation but I recall trying it a while ago and getting results from google that didn't include all the terms. I suppose my objections were rooted in a love of the concept of dismax. Bah. I'll change it.

But still, very cool! And, pretty fast too! And the highlighting is correct... Awesome!!!! Thanks for sharing!

Of course the highlighting is correct! Elasticsearch does highlighting very nicely.

Brian

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(simonw-2) #7

On Thursday, September 5, 2013 6:59:00 AM UTC+2, David Pilato wrote:

Very cool. Congrats Nikolas!
How did you implement the "did you mean" feature? Suggest API?

I can answer this I guess :slight_smile: - Yes that is the phrase suggester :slight_smile:

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 4 sept. 2013 à 23:26, Nikolas Everett <nik...@gmail.com <javascript:>>
a écrit :

Elasticsearch list,

So I've been working on a MediaWiki plugin that powers search using
Elasticsearch and you, yes you, can try it right now! All you've got to do
is head over to our handy dandy test site right here:
https://test2.wikipedia.org/w/index.php?title=Special%3ASearch&profile=default&search=search+engine&fulltext=Search

It is still in super alpha so I expect things to not work entirely
correctly but please let me know what you think.

The code lives here:
https://git.wikimedia.org/summary/?r=mediawiki/extensions/CirrusSearch.git
but right now it depends on stuff in the master branch of MediaWiki and
I've not been keeping the installation instructions up to date so
installing it yourself would be a pain but if you really want to let me
know and I'll help you through it.

The goal of the project is to give anyone running MediaWiki something they
can install relatively easily that will give them the same search they'd
get on Wikipedia.

Thanks,

Nik Everett

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Steve frank) #8

I would preferred this API for Mediawiki Hostinghttp://www.cloudways.com/en/managed-mediawiki-cloud-hosting.php, they are good in phrase security and scalability should be the utmost
priority. Go out for Cloudways are the provide custom cloud environment

On Thursday, September 5, 2013 1:43:17 PM UTC+5, simonw wrote:

On Thursday, September 5, 2013 6:59:00 AM UTC+2, David Pilato wrote:

Very cool. Congrats Nikolas!
How did you implement the "did you mean" feature? Suggest API?

I can answer this I guess :slight_smile: - Yes that is the phrase suggester :slight_smile:

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 4 sept. 2013 à 23:26, Nikolas Everett nik...@gmail.com a écrit :

Elasticsearch list,

So I've been working on a MediaWiki plugin that powers search using
Elasticsearch and you, yes you, can try it right now! All you've got to do
is head over to our handy dandy test site right here:
https://test2.wikipedia.org/w/index.php?title=Special%3ASearch&profile=default&search=search+engine&fulltext=Search

It is still in super alpha so I expect things to not work entirely
correctly but please let me know what you think.

The code lives here:
https://git.wikimedia.org/summary/?r=mediawiki/extensions/CirrusSearch.git
but right now it depends on stuff in the master branch of MediaWiki and
I've not been keeping the installation instructions up to date so
installing it yourself would be a pain but if you really want to let me
know and I'll help you through it.

The goal of the project is to give anyone running MediaWiki something
they can install relatively easily that will give them the same search
they'd get on Wikipedia.

Thanks,

Nik Everett

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Nik Everett) #9

On Thu, Sep 5, 2013 at 4:43 AM, simonw simon.willnauer@elasticsearch.comwrote:

On Thursday, September 5, 2013 6:59:00 AM UTC+2, David Pilato wrote:

Very cool. Congrats Nikolas!
How did you implement the "did you mean" feature? Suggest API?

I can answer this I guess :slight_smile: - Yes that is the phrase suggester :slight_smile:

Yeah, that is the phrase suggester running in always suggest mode with a
high confidence. The other trick to making the phrase suggestions good is
to split the index into two parts - one for the content pages and one for
the other pages. That way searches that only search the content pages
(most of them) only get phrase suggestions that make sense in the content
pages. If you don't do that you get suggestions based on the names of
templates which is useless to most folks.

Simonw has reviewed a few phrase suggester changes I've submitted that
should stop some of the odd responses it sometimes gives in our case.
We're not using those changes yet but will get them when we switch to
0.90.4. If I have time to do
https://github.com/elasticsearch/elasticsearch/issues/3482 then that'll
help even more but I doubt I'll find time to get it reviewed before you cut
0.90.4 so it'll have to wait.

The search implementation currently running on wikipedia does something
pretty similar with suggestions. In many ways it was way ahead of its time
in the open source world.

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Nik Everett) #10

On Thu, Sep 5, 2013 at 2:32 AM, David Pilato david@pilato.fr wrote:

+1 with Lukáš: i prefer having the choice as a user to enter:
search engine
+search engine
search +engine
+search +engine

By default, I'm pretty sure I will use the first in 99% as long as the
more relevant come before.

My 0.01 cent in addition of Lukáš's 2 cents :wink:

Hmmm. You guys make a compelling argument. I think I'm going to continue
with switching to AND for now because that is the least change from what
exists now. I'll keep an issue for switching to OR in the case of not
enough results. Or maybe if the user asks for it. This is the bug for the
curious: https://bugzilla.wikimedia.org/show_bug.cgi?id=52904

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #11