Update scripting to automate REST call to external analytic service?


(Josh Harrison) #1

Say I have a tool that lets me take a shortened URL like
{"url":"http://goo.gl/TTX8j5"} and return the expanded URL like {"url":
"http://goo.gl/TTX8j5", "resolved_url": "http://www.elasticsearch.org/"} It
will even do this when I route through, say, goo.gl to bit.ly to tinyurl,
and then finally to the final page. the "expanded_url" field present in the
twitter stream only seems to go one hop.
Say I have a couple hundred million records already loaded that I'd like to
go back and expand the URLs for. I was all set to go start pulling records
out with Python (my preferred language), updating them, and bulk loading
them back in. But then I realized that you can script with the bulk update
function.
Can I, with a data structure like follows, send every entities.urls.url to
my rest call at localhost:8888/urlexpansion, and insert the results for
resolved_url into the entities.urls structure with the update scripting
function?
Further, is there a practical way for me to extract all of the IDs of all
of the records that have objects in the entities.urls object?

"_source": {
"filter_level": "medium",
"contributors": null,
"text": "",
"geo": null,
"retweeted": false,
"in_reply_to_screen_name": null,
"truncated": false,
"lang": "und",
"entities": {
"symbols": [],

      "urls": [{
        "url": "https://t.co/XdXRudPXH5",
        "expanded_url": "https://blog.twitter.com/2013/rich-photo-experience-now-in-embedded-tweets-3",
        "display_url": "blog.twitter.com/2013/rich-phot\u2026",
        "indices": [80, 103]
       }]

              "hashtags": [],
              "user_mentions": []
           },
           "in_reply_to_status_id_str": null,
           "id":,
           "source": "<a href=\"http://twitter.com/download/iphone\" 

rel="nofollow">Twitter for iPhone",
"in_reply_to_user_id_str": null,
"favorited": false,
"in_reply_to_status_id": null,
"retweet_count": 0,
"created_at": "Tue Nov 26 01:03:06 +0000 2013",
"in_reply_to_user_id": null,
"favorite_count": 0,
"id_str": "123",
"place": null,
"user": {
...
},
"coordinates": null
}
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cf62a331-4e50-46df-a5cd-befff58fbe19%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Alexander Reelsen) #2

Hey,

just check
https://github.com/elasticsearch/elasticsearch/blob/master/src/test/java/org/elasticsearch/update/UpdateByNativeScriptTests.java

you can simply add the update request to a bulk request and you are done...
and have a native script doing the HTTP lookups.

But to be honest, I would not do this inside of elasticsearch, but rather
execute the lookup via an external system (or python script) and then
execute the update API and update only that field which shows the URL. (As
this is a blocking call and needs to wait for a HTTP request to return, I
would not delegate that to elasticsearch, as it is a search engine).

Hope this helps...

--Alex

On Tue, Dec 10, 2013 at 6:36 PM, Josh Harrison hijakk@gmail.com wrote:

Say I have a tool that lets me take a shortened URL like {"url":"
http://goo.gl/TTX8j5"} and return the expanded URL like {"url": "
http://goo.gl/TTX8j5", "resolved_url": "http://www.elasticsearch.org/"}
It will even do this when I route through, say, goo.gl to bit.ly to
tinyurl, and then finally to the final page. the "expanded_url" field
present in the twitter stream only seems to go one hop.
Say I have a couple hundred million records already loaded that I'd like
to go back and expand the URLs for. I was all set to go start pulling
records out with Python (my preferred language), updating them, and bulk
loading them back in. But then I realized that you can script with the bulk
update function.
Can I, with a data structure like follows, send every entities.urls.url to
my rest call at localhost:8888/urlexpansion, and insert the results for
resolved_url into the entities.urls structure with the update scripting
function?
Further, is there a practical way for me to extract all of the IDs of all
of the records that have objects in the entities.urls object?

"_source": {
"filter_level": "medium",
"contributors": null,
"text": "",
"geo": null,
"retweeted": false,
"in_reply_to_screen_name": null,
"truncated": false,
"lang": "und",
"entities": {
"symbols": [],

      "urls": [{
        "url": "https://t.co/XdXRudPXH5",
        "expanded_url": "https://blog.twitter.com/2013/rich-photo-experience-now-in-embedded-tweets-3",
        "display_url": "blog.twitter.com/2013/rich-phot\u2026 <http://blog.twitter.com/2013/rich-phot%5Cu2026>",
        "indices": [80, 103]
       }]

              "hashtags": [],
              "user_mentions": []
           },
           "in_reply_to_status_id_str": null,
           "id":,
           "source": "<a href=\"http://twitter.com/download/iphone\"

rel="nofollow">Twitter for iPhone",
"in_reply_to_user_id_str": null,
"favorited": false,
"in_reply_to_status_id": null,
"retweet_count": 0,
"created_at": "Tue Nov 26 01:03:06 +0000 2013",
"in_reply_to_user_id": null,
"favorite_count": 0,
"id_str": "123",
"place": null,
"user": {
...
},
"coordinates": null
}
}

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/cf62a331-4e50-46df-a5cd-befff58fbe19%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGCwEM9rM5%3D%2Bkp%2BYZjj8f6BVWEy3zkEjPMD_BS0HaYoYzmv9sg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #3