Search when url is one of the terms


(Wiki) #1

Hi,

My index holds web metadata object, like:
{
:id => { :type => 'string', :store => true },
:title => { :type => 'string', :analyzer =>
'snowball', :boost => 5 },
:url => { :type => 'string', :index =>
'not_analyzed', :store => true },
:author => { :type => 'string', :index =>
'not_analyzed', :store => true },
:summary => { :type => 'string', :analyzer =>
'snowball', :boost => 10 },
:content => { :type => 'string', :analyzer =>
'snowball', :boost => 8 },
:published => { :type => 'date', :format => 'yyyy-MM-dd
HH:mm:ss', :index => 'not_analyzed', :store => true },
:updated => { :type => 'date', :format => 'yyyy-MM-dd
HH:mm:ss', :index => 'not_analyzed', :store => true },
:categories => { :type => 'string', :index =>
'not_analyzed', :store => true },
:image => { :type => 'string', :index =>
'not_analyzed', :store => true },
:site => { :type => 'string', :index =>
'snowball', :store => true }
}

the id is the url.

  1. whats the right indexing for the url\id field assuming I want to
    query by url? should I add analyzer to these fields?

  2. I can't find a way to search for website data by it's url, what am
    I doing wrong? also tried encoding the url, it doesn't help:
    curl -XGET http://localhost:9200/article-dev/_search -d '{
    "query" : { "term" : { "_id": "http://thenextweb.com/media/2012/01/02/
    uk-music-download-sales-grew-by-26-6-in-2011-but-the-industrys-still-
    in-decline/" } } }'

Thanks,

Viki


(Ivan Brusic) #2
  1. It all depends on what type of queries you want to execute. If the
    URL is simply a key, where you do not care to search inside the value,
    then the URL should not be analyzed. Analyzing URLs is difficult since
    they are not words. If you wanted to search inside the urls, there are
    many different routes to take, all of which requiring decomposing the
    data on the client side into smaller tokens.

  2. Strings are analyzed by default in ElasticSearch, therefore your
    :id field will be analyzed when indexed. Are term query is not
    analyzed, so your search will not work. Your :url shoud work however.
    Does searching against that field work? If not, gist an example
    document.

--
Ivan

On Wed, Mar 14, 2012 at 7:35 AM, Wiki viki.rozental@gmail.com wrote:

Hi,

My index holds web metadata object, like:
{
:id => { :type => 'string', :store => true },
:title => { :type => 'string', :analyzer =>
'snowball', :boost => 5 },
:url => { :type => 'string', :index =>
'not_analyzed', :store => true },
:author => { :type => 'string', :index =>
'not_analyzed', :store => true },
:summary => { :type => 'string', :analyzer =>
'snowball', :boost => 10 },
:content => { :type => 'string', :analyzer =>
'snowball', :boost => 8 },
:published => { :type => 'date', :format => 'yyyy-MM-dd
HH:mm:ss', :index => 'not_analyzed', :store => true },
:updated => { :type => 'date', :format => 'yyyy-MM-dd
HH:mm:ss', :index => 'not_analyzed', :store => true },
:categories => { :type => 'string', :index =>
'not_analyzed', :store => true },
:image => { :type => 'string', :index =>
'not_analyzed', :store => true },
:site => { :type => 'string', :index =>
'snowball', :store => true }
}

the id is the url.

  1. whats the right indexing for the url\id field assuming I want to
    query by url? should I add analyzer to these fields?

  2. I can't find a way to search for website data by it's url, what am
    I doing wrong? also tried encoding the url, it doesn't help:
    curl -XGET http://localhost:9200/article-dev/_search -d '{
    "query" : { "term" : { "_id": "http://thenextweb.com/media/2012/01/02/
    uk-music-download-sales-grew-by-26-6-in-2011-but-the-industrys-still-
    in-decline/" } } }'

Thanks,

Viki


(Wiki) #3

Thanks for your response.

  1. I don't need tokenizing the url, just need to search for it as a
    one single string.
  2. searching against this field (:url) doesn't work:

{"_index":"aunticles-dev-thenextweb","_type":"article","_id":"http://
thenextweb.com/apple/2012/02/23/forgotten-apple-founder-takes-to-
facebook-to-explain-his-decision-to-quit-after-12-days/","_score":1.0,
"_source" : {"id":"http://thenextweb.com/apple/2012/02/23/forgotten-
apple-founder-takes-to-facebook-to-explain-his-decision-to-quit-
after-12-days/","title":"Forgotten Apple founder takes to Facebook to
explain his decision to quit after 12 days","summary":"Yesterday,
third Apple founder Ron Wayne published an essay on Facebook about his
decision to leave Apple Computer after only 12 days.","image":"http://
cdn.thenextweb.com/wp-content/blogs.dir/1/files/2012/02/
Photoxpress_23083180-300x250.jpg","categories":"","published":"2012-03-14
13:24:43","updated":null,"type":"article","site":null}

Thanks a lot!

On Mar 16, 2:02 am, Ivan Brusic i...@brusic.com wrote:

  1. It all depends on what type of queries you want to execute. If theURLis simply a key, where you do not care to search inside the value,
    then theURLshould not be analyzed. Analyzing URLs is difficult since
    they are not words. If you wanted to search inside the urls, there are
    many different routes to take, all of which requiring decomposing the
    data on the client side into smaller tokens.

  2. Strings are analyzed by default in ElasticSearch, therefore your
    :id field will be analyzed when indexed. Are term query is not
    analyzed, so your search will not work. Your :urlshoud work however.
    Does searching against that field work? If not, gist an example
    document.

--
Ivan

On Wed, Mar 14, 2012 at 7:35 AM, Wiki viki.rozen...@gmail.com wrote:

Hi,

My index holds web metadata object, like:
{
:id => { :type => 'string', :store => true },
:title => { :type => 'string', :analyzer =>
'snowball', :boost => 5 },
:url => { :type => 'string', :index =>
'not_analyzed', :store => true },
:author => { :type => 'string', :index =>
'not_analyzed', :store => true },
:summary => { :type => 'string', :analyzer =>
'snowball', :boost => 10 },
:content => { :type => 'string', :analyzer =>
'snowball', :boost => 8 },
:published => { :type => 'date', :format => 'yyyy-MM-dd
HH:mm:ss', :index => 'not_analyzed', :store => true },
:updated => { :type => 'date', :format => 'yyyy-MM-dd
HH:mm:ss', :index => 'not_analyzed', :store => true },
:categories => { :type => 'string', :index =>
'not_analyzed', :store => true },
:image => { :type => 'string', :index =>
'not_analyzed', :store => true },
:site => { :type => 'string', :index =>
'snowball', :store => true }
}

the id is theurl.

  1. whats the right indexing for theurl\id field assuming I want to
    query byurl? should I add analyzer to these fields?
  1. I can't find a way to search for website data by it'surl, what am
    I doing wrong? also tried encoding theurl, it doesn't help:
    curl -XGEThttp://localhost:9200/article-dev/_search-d '{
    "query" : { "term" : { "_id": "http://thenextweb.com/media/2012/01/02/
    uk-music-download-sales-grew-by-26-6-in-2011-but-the-industrys-still-
    in-decline/" } } }'

Thanks,

Viki


(Shay Banon) #4

Can you gist a full curl recreation, including indexing the relevant data
nad setting up the mapping? (see http://www.elasticsearch.org/help).

A few notes:

  1. You don't need to explicitly store each field, by default, the whole
    _source json document you indexed is stored, so you end up storing things
    twice, once each individual field, and once the whole doc).

  2. If the url field is not analyzed, you can look it up using the same url
    it was indexed with using a term query.

On Tue, Mar 20, 2012 at 10:48 AM, Wiki viki.rozental@gmail.com wrote:

Thanks for your response.

  1. I don't need tokenizing the url, just need to search for it as a
    one single string.
  2. searching against this field (:url) doesn't work:

{"_index":"aunticles-dev-thenextweb","_type":"article","_id":"http://
thenextweb.com/apple/2012/02/23/forgotten-apple-founder-takes-to-
facebook-to-explain-his-decision-to-quit-after-12-days/","_score":1.0,
"_source" : {"id":"http://thenextweb.com/apple/2012/02/23/forgotten-
apple-founder-takes-to-facebook-to-explain-his-decision-to-quit-
after-12-days/","title":"Forgotten Apple founder takes to Facebook to
explain his decision to quit after 12 days","summary":"Yesterday,
third Apple founder Ron Wayne published an essay on Facebook about his
decision to leave Apple Computer after only 12 days.","image":"http://
cdn.thenextweb.com/wp-content/blogs.dir/1/files/2012/02/
Photoxpress_23083180-300x250.jpg","categories":"","published":"2012-03-14
13:24:43","updated":null,"type":"article","site":null}

Thanks a lot!

On Mar 16, 2:02 am, Ivan Brusic i...@brusic.com wrote:

  1. It all depends on what type of queries you want to execute. If
    theURLis simply a key, where you do not care to search inside the value,
    then theURLshould not be analyzed. Analyzing URLs is difficult since
    they are not words. If you wanted to search inside the urls, there are
    many different routes to take, all of which requiring decomposing the
    data on the client side into smaller tokens.

  2. Strings are analyzed by default in ElasticSearch, therefore your
    :id field will be analyzed when indexed. Are term query is not
    analyzed, so your search will not work. Your :urlshoud work however.
    Does searching against that field work? If not, gist an example
    document.

--
Ivan

On Wed, Mar 14, 2012 at 7:35 AM, Wiki viki.rozen...@gmail.com wrote:

Hi,

My index holds web metadata object, like:
{
:id => { :type => 'string', :store => true },
:title => { :type => 'string', :analyzer =>
'snowball', :boost => 5 },
:url => { :type => 'string', :index =>
'not_analyzed', :store => true },
:author => { :type => 'string', :index =>
'not_analyzed', :store => true },
:summary => { :type => 'string', :analyzer =>
'snowball', :boost => 10 },
:content => { :type => 'string', :analyzer =>
'snowball', :boost => 8 },
:published => { :type => 'date', :format => 'yyyy-MM-dd
HH:mm:ss', :index => 'not_analyzed', :store => true },
:updated => { :type => 'date', :format => 'yyyy-MM-dd
HH:mm:ss', :index => 'not_analyzed', :store => true },
:categories => { :type => 'string', :index =>
'not_analyzed', :store => true },
:image => { :type => 'string', :index =>
'not_analyzed', :store => true },
:site => { :type => 'string', :index =>
'snowball', :store => true }
}

the id is theurl.

  1. whats the right indexing for theurl\id field assuming I want to
    query byurl? should I add analyzer to these fields?
  1. I can't find a way to search for website data by it'surl, what am
    I doing wrong? also tried encoding theurl, it doesn't help:
    curl -XGEThttp://localhost:9200/article-dev/_search-d '{
    "query" : { "term" : { "_id": "http://thenextweb.com/media/2012/01/02/
    uk-music-download-sales-grew-by-26-6-in-2011-but-the-industrys-still-
    in-decline/" } } }'

Thanks,

Viki


(system) #5