Help with ElasticSearchException due to data format?


(utsharon) #1

We are using Elasticsearch 0.18.5 & Pyes 0.16. We have successfully
indexed many objects. Before indexing, we strip HTML tags + "\r\n" from
the content. However, when trying to render this back ... we are getting
this error. Hoping you can help me make sense of it. Guessing there is
some other content I need to remove before saving to the index.... or is
there something that can be done on the retrieval end of things? Thanks
for any help you can provide! See error below ... the error is with the
3rd piece (id = kns.ellington.story.157066) of data (there is no error if
I retrieve only the 1st two listed below).

================

import pyes

conn = pyes.ES(['localhost:9200'])

conn.search({'highlight': {'pre_tags': [''], 'fragment_size':
250, 'number_of_fragments': 1, 'fields': {'content': {}, 'tease': {}},
'post_tags': ['
']}, 'query': {'custom_score': {'lang': 'mvel',
'query': {'query_string': {'query': 'ford'}}, 'script':
'_score*(log10(doc.pub_date.longValue))'}}, 'facets': {'type': {'terms':
{'field': 'type', 'size': 10}}}, 'fields': ['title', 'tease', 'pub_date',
'django_ct', 'django_id', 'type', 'link_url', 'photo_url', 'thumbnail_url',
'author']},indexes=['kns'],doc_types=['story', 'neighborhood',
'restaurant', 'photo', 'ugcstory', 'gallery', 'video', 'flatpage',
'place'],size='3')

*Traceback (most recent call last):
File "", line 1, in ?
File "/opt/local/btop_bundle/lib/python/pyes/es.py", line 851, in search
return self._query_call("_search", body, indexes, doc_types,
*query_params)
File "/opt/local/btop_bundle/lib/python/pyes/es.py", line 261, in
_query_call
response = self._send_request('GET', path, body, querystring_args)
File "/opt/local/btop_bundle/lib/python/pyes/es.py", line 230, in
_send_request
raise pyes.exceptions.ElasticSearchException(response.body,
response.status, response.body)
pyes.exceptions.ElasticSearchException: {"took":23,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":8,"max_score":4.841233,"hits":[{"_index":"kns","_type":"story","_id":"kns.ellington.story.156955","_score":4.841233,"fields":{"author":"David
Moon","django_ct":"news.stories","title":"Moon: How would you like to be
Ford right now?
","pub_date":"2009-06-14T00:00:42","link_url":"/news/2009/jun/14/how-would-you-like-to-be-ford-right-now/","type":"story","django_id":"156955","tease":"Several
years ago, I served as treasurer of the Knoxville-Knox County Public
Building Authority. There was some discussion at the time about whether or
not the agency should offer construction and contract management services
in areas outside of Knox County, thereby placing this taxpayer-funded
entity in direct competition with private businesses that provided the same
services. "},"highlight":{"content":[" of the aspects of private
businesses' operations in the construction management industry.It was a
terrible plan, and I argued against it. It was morally wrong. It was bad
economics.How would you like to be Ford Motor Co. right
now?Ford eschewed crawling
up"]}},{"_index":"kns","_type":"story","_id":"kns.ellington.story.157182","_score":4.6320467,"fields":{"author":"","django_ct":"news.stories","title":"NASCAR
Sprint Cup-LifeLock 400 Results
","photo_url":"img/photos/2009/06/14/061409markmartinwin.jpg","pub_date":"2009-06-14T18:27:00","link_url":"/news/2009/jun/14/nascar-sprint-cup-lifelock-400-results/","type":"story","django_id":"157182","tease":"Results
from the Lifelock 400 in Michigan."},"highlight":{"content":["1. (32) Mark
Martin, Chevrolet, 200 laps, 113.7 rating, 190 points.2. (27) Jeff Gordon,
Chevrolet, 200, 100.3, 170.3. (14) Denny Hamlin, Toyota, 200, 108.9, 170.4.
(29) Carl Edwards, Ford, 200, 100.9, 165.5. (20) Greg
Biffle, Ford, 200, 124, 160.6"]}},
{"_index":"kns","_type":"story","_id":"kns.ellington.story.157066","_score":2.6895716,"fields":{"author":"News
Sentinel staff","django_ct":"news.stories","title":"Robber hits Kingston
Pike
Walgreens","pub_date":"2009-06-13T16:03:00","link_url":"/news/2009/jun/13/robber-hits-kingston-pike-walgreens/","type":"story","django_id":"157066","tease":"KNOXVILLE
— Police are searching for the man who robbed a West Knoxville drugstore
early today"},"highlight":{"content":[". Kenny Miller said. He didn’t flash
a gun but claimed to have one.The clerk complied, and the man drove away in
a silver, four-door Ford Escape headed south on Peters
Road. Police didn’t say what kind of drugs he got.Police described the
robber"]}}]}

,"facets":{"type":{"_type":"terms","missing":0,"total":8,"other":0,"terms":[{"term":"story","count":8}]}}}


(utsharon) #2

Think I see the problem, just not sure how to address. Hoping someone can
get me started on what layer to fix this.

Taking pyes out of loop, ran:
curl -XGET 'http://localhost:9200/kns/_search?pretty=true' -d
'{"query":{"term":{"content":"ford"}}}'

First 2 results do not contain unicode characters and come back in results
just fine, but the third one does and throws exception from orig post.
Here's the resultset that breaks ... containing unicode characters (see
below ... didn\u2019t)

{
"_index" : "kns",
"_type" : "story",
"_id" : "kns.ellington.story.157066",
"_score" : 0.37917396, "_source" : {"content": "KNOXVILLE \u2014
Police are searching for the man who robbed a West Knoxville drugstore
early today.The man walked into the Walgreens, 8950 Kingston Pike, just
after 2:10 a.m. and handed the clerk a note demanding drugs, Knoxville
Police Department Lt. Kenny Miller said. He didn\u2019t flash a gun but
claimed to have one.The clerk complied, and the man drove away in a silver,
four-door Ford Escape headed south on Peters Road. Police didn\u2019t say
what kind of drugs he got.Police described the robber as Hispanic, about 5
feet, 11 inches tall with a medium build and short, black hair. He wore a
blue hat and blue T-shirt, both with lettering, and white tennis
shoes.Police asked that anyone with information in the case call
865-215-7212.More details as they develop online and in Sunday\u2019s News
Sentinel.", "django_id": "157066", "title": "Robber hits Kingston Pike
Walgreens", "type": "story", "link_url":
"/news/2009/jun/13/robber-hits-kingston-pike-walgreens/", "author": "News
Sentinel staff", "sites": ["Knoxville News Sentinel"], "tease": "KNOXVILLE
\u2014 Police are searching for the man who robbed a West Knoxville
drugstore early today", "django_ct": "news.stories", "pub_date":
"2009-06-13T16:03:00", "id": "kns.ellington.story.157066", "categories":
["News/Local News/"]}
},

Not sure at what layer to fix this. Any insight would be of great help.
Thank you!,
Sharon

On Wednesday, August 1, 2012 10:49:17 AM UTC-4, Sharon wrote:

We are using Elasticsearch 0.18.5 & Pyes 0.16. We have successfully
indexed many objects. Before indexing, we strip HTML tags + "\r\n" from
the content. However, when trying to render this back ... we are getting
this error. Hoping you can help me make sense of it. Guessing there is
some other content I need to remove before saving to the index.... or is
there something that can be done on the retrieval end of things? Thanks
for any help you can provide! See error below ... the error is with the
3rd piece (id = kns.ellington.story.157066) of data (there is no error
if I retrieve only the 1st two listed below).

================

import pyes

conn = pyes.ES(['localhost:9200'])

conn.search({'highlight': {'pre_tags': [''], 'fragment_size':
250, 'number_of_fragments': 1, 'fields': {'content': {}, 'tease': {}},
'post_tags': ['
']}, 'query': {'custom_score': {'lang': 'mvel',
'query': {'query_string': {'query': 'ford'}}, 'script':
'_score*(log10(doc.pub_date.longValue))'}}, 'facets': {'type': {'terms':
{'field': 'type', 'size': 10}}}, 'fields': ['title', 'tease', 'pub_date',
'django_ct', 'django_id', 'type', 'link_url', 'photo_url', 'thumbnail_url',
'author']},indexes=['kns'],doc_types=['story', 'neighborhood',
'restaurant', 'photo', 'ugcstory', 'gallery', 'video', 'flatpage',
'place'],size='3')

*Traceback (most recent call last):
File "", line 1, in ?
File "/opt/local/btop_bundle/lib/python/pyes/es.py", line 851, in search
return self._query_call("_search", body, indexes, doc_types,
*query_params)
File "/opt/local/btop_bundle/lib/python/pyes/es.py", line 261, in
_query_call
response = self._send_request('GET', path, body, querystring_args)
File "/opt/local/btop_bundle/lib/python/pyes/es.py", line 230, in
_send_request
raise pyes.exceptions.ElasticSearchException(response.body,
response.status, response.body)
pyes.exceptions.ElasticSearchException: {"took":23,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":8,"max_score":4.841233,"hits":[{"_index":"kns","_type":"story","_id":"kns.ellington.story.156955","_score":4.841233,"fields":{"author":"David
Moon","django_ct":"news.stories","title":"Moon: How would you like to be
Ford right now?
","pub_date":"2009-06-14T00:00:42","link_url":"/news/2009/jun/14/how-would-you-like-to-be-ford-right-now/","type":"story","django_id":"156955","tease":"Several
years ago, I served as treasurer of the Knoxville-Knox County Public
Building Authority. There was some discussion at the time about whether or
not the agency should offer construction and contract management services
in areas outside of Knox County, thereby placing this taxpayer-funded
entity in direct competition with private businesses that provided the same
services. "},"highlight":{"content":[" of the aspects of private
businesses' operations in the construction management industry.It was a
terrible plan, and I argued against it. It was morally wrong. It was bad
economics.How would you like to be Ford Motor Co. right
now?Ford eschewed crawling
up"]}},{"_index":"kns","_type":"story","_id":"kns.ellington.story.157182","_score":4.6320467,"fields":{"author":"","django_ct":"news.stories","title":"NASCAR
Sprint Cup-LifeLock 400 Results
","photo_url":"img/photos/2009/06/14/061409markmartinwin.jpg","pub_date":"2009-06-14T18:27:00","link_url":"/news/2009/jun/14/nascar-sprint-cup-lifelock-400-results/","type":"story","django_id":"157182","tease":"Results
from the Lifelock 400 in Michigan."},"highlight":{"content":["1. (32) Mark
Martin, Chevrolet, 200 laps, 113.7 rating, 190 points.2. (27) Jeff Gordon,
Chevrolet, 200, 100.3, 170.3. (14) Denny Hamlin, Toyota, 200, 108.9, 170.4.
(29) Carl Edwards, Ford, 200, 100.9, 165.5. (20) Greg
Biffle, Ford, 200, 124, 160.6"]}},
{"_index":"kns","_type":"story","_id":"kns.ellington.story.157066","_score":2.6895716,"fields":{"author":"News
Sentinel staff","django_ct":"news.stories","title":"Robber hits Kingston
Pike
Walgreens","pub_date":"2009-06-13T16:03:00","link_url":"/news/2009/jun/13/robber-hits-kingston-pike-walgreens/","type":"story","django_id":"157066","tease":"KNOXVILLE
— Police are searching for the man who robbed a West Knoxville drugstore
early today"},"highlight":{"content":[". Kenny Miller said. He didn’t flash
a gun but claimed to have one.The clerk complied, and the man drove away in
a silver, four-door Ford Escape headed south on Peters
Road. Police didn’t say what kind of drugs he got.Police described the
robber"]}}]}

,"facets":{"type":{"_type":"terms","missing":0,"total":8,"other":0,"terms":[{"term":"story","count":8}]}}}


(system) #3