Unable to "import" json file into ES2.1


#1

Why does this not work in ES2.1?:

curl -XPOST "http://localhos:9200/_bulk" --data-binary @I:\ES\flu_tweet_file.json {"create": {"_index": "flu", "_type": "tweets", "_id": 1}} \n {"title": "flu tweets"} \n

(note: the json file is the direct result from a search of twitter, so it is genuine.)

The errors I got:

type: illegal argument exception / malformed action/metadata line 1; expected START_object or END_object but found [VALUE_STRING], STATUS 400
curl: (3) [globbing] unmatched brace in column 1
curl: (6) could not resolve host: flu,
curl: (6) could not resolve host: _type
curl: (6) could not resolve host: tweets,
curl: (6) could not resolve host: _id
curl: (3) [globbing] unmatched close brace/bracket in column 2
curl: (6) could not resolve host: \n
curl: (3) [globbing] unmatched close brace in column 1
curl: (3) [globbing] unmatched close brace/bracket in column 11
curl: (6) could not resolve host: \n


(David Pilato) #2

Why do you add a json content after the filename?


#3

This curl:
c:>curl -X POST "http://localhost:9200/_bulk" -d @I:\ES\flu_tweet_file.json

results in this error:

{"error":{"root_cause":[{"type":"action_request_validation_exception","reason":"
Validation Failed: 1: no requests added;"}],"type":"action_request_validation_ex
ception","reason":"Validation Failed: 1: no requests added;"},"status":400}

and when I try this curl:
c:>curl -XPOST "http://localhost:9200/_bulk" --data-binary @I:\ES\flu_tweet_file.json

I get this error:

{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Malformed action/metadata line [1], expected START_OBJECT or END_OBJECT but found [VALUE_STRING]"}],"type":"illegal_argument_exception","reason":"Malformed action/metadata line [1], expected START_OBJECT or END_OBJECT but found [VALUE_STRING]"},"status":400}

In short, I still can't get ES to "import" (index) a (bulk) json file.


(David Pilato) #4

What is in flu_tweet_file.json?


#5

Twitter API search result for "flu" resulted in json file format


#6

Excerpt:

{"contributors": null, "truncated": false, "text": "Vaccinate your child against flu. More at: https://t.co/3gHB3j5r0a #staywellthiswinter https://t.co/3XDXiAvFHU", "is_quote_status": false, "in_reply_to_status_id": null, "id": 669420613761671168, "favorite_count": 0, "source": "<a href="http://www.socialsignin.co.uk" rel="nofollow">SocialSignIn Application", "retweeted": false, "coordinates": null, "entities": {"symbols": [], "user_mentions": [], "hashtags": [{"indices": [68, 87], "text": "staywellthiswinter"}], "urls": [{"url": "https://t.co/3gHB3j5r0a", "indices": [44, 67], "expanded_url": "http://socsi.in/EBI6q", "display_url": "socsi.in/EBI6q"}], "media": [{"expanded_url": "http://twitter.com/west_lei_ccg/status/669420613761671168/photo/1", "display_url": "pic.twitter.com/3XDXiAvFHU", "url": "https://t.co/3XDXiAvFHU", "media_url_https": "https://pbs.twimg.com/media/CUpCgCuWUAAf0By.jpg", "id_str": "669420612872458240", "sizes": {"small": {"h": 226, "resize":


(David Pilato) #7

That's not an expected bulk format. Look at the docs.


#8

Following the docs,
this curl gives this error:
curl "http://localhost:9200/fluproj/tweets/4901" -d @I:\ES\flu_tweet_file.json

{"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"failed to parse"}],"type":"mapper_parsing_exception","reason":"failed to parse","caused_by":{"type":"illegal_argument_exception","reason":"Malformed content, found extra data after parsing: START_OBJECT"}},"status":400}

and this one:
curl -XPOST "http://localhost:9200/fluproj/tweets/4901" --data-binary @I:\ES\flu_tweet_file.json

{"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"failed to parse"}],"type":"mapper_parsing_exception","reason":"failed to parse","caused_by":{"type":"illegal_argument_exception","reason":"Malformed content, found extra data after parsing: START_OBJECT"}},"status":400}

Please point me to the appropriate docs you are referring to. Thanks.


(David Pilato) #9

Did you search for "bulk format" in docs?

If you did, you should have read https://www.elastic.co/guide/en/elasticsearch/guide/current/bulk.html which is clear enough I think.


#10

Yes, I did read that document. Several times. Please look at my posts again: I have made attempts to follow the document. It's when the attempts fail that I try other things.

Here are the issues as I see / understand them:

  1. When I use Kibana -Sense (per ES2.1 / Kibana 4.3), I get the same errors as when I use command-line CURL
  2. When I use CURL, I have not found a way to add a new line and complete the {action: statements}.

Based on the posts above and the documents you referred me to which should be simple enough but haven't worked out that way for me:
a) What am I missing?
b) Is my json file in the wrong format (is it pretty-printed -- I didn't design it that way; 'just used the search result as is.
c) Do you need to see the Sense statements and results?
d) What else can I do?
Thanks.


(David Pilato) #11

How can I know what you are doing without a concrete example?

curl -XPOST "http://localhos:9200/_bulk" --data-binary @I:\ES\flu_tweet_file.json {"create": {"index": "flu", "type": "tweets", "_id": 1}} \n {"title": "flu tweets"} \n

This is the first thing you posted and this is wrong.

{"contributors": null, "truncated": false, "text": "Vaccinate your child against flu. More at: https://t.co/3gHB3j5r0a #staywellthiswinter https://t.co/3XDXiAvFHU", "is_quote_status": false, "in_reply_to_status_id": null, "id": 669420613761671168, "favorite_count": 0, "source": "SocialSignIn Application", "retweeted": false, "coordinates": null, "entities": {"symbols": [], "user_mentions": [], "hashtags": [{"indices": [68, 87], "text": "staywellthiswinter"}], "urls": [{"url": "https://t.co/3gHB3j5r0a", "indices": [44, 67], "expanded_url": "http://socsi.in/EBI6q", "display_url": "socsi.in/EBI6q"}], "media": [{"expanded_url": "http://twitter.com/west_lei_ccg/status/669420613761671168/photo/1", "display_url": "pic.twitter.com/3XDXiAvFHU", "url": "https://t.co/3XDXiAvFHU", "media_url_https": "https://pbs.twimg.com/media/CUpCgCuWUAAf0By.jpg", "id_str": "669420612872458240", "sizes": {"small": {"h": 226, "resize":

This is also wrong.

This is the only thing that I can tell based on what you provided so far.

So please reproduce your error with a script you can share and share this here or on gist.github.com. Read https://www.elastic.co/help/ if you need details.

If we can't reproduce your error, we can't help.

Also, beware of curl on windows. It really sucks.
Yes you can consider SENSE. It supports bulk format.

Is my json file in the wrong format (is it pretty-printed)

Most likely yes. The BULK format states:

The lines cannot contain unescaped newline characters, as they would interfere with parsing. This means that the JSON must not be pretty-printed.


#12

Thank you for your ongoing interest in helping me out.
Task: generate a json file to be indexed later in ES from searching a topic in Twitter. The Twitter search is performed using Python. The resulting json file is "new_tweet_file.json"

Python code is posted below. Search term is "H3N2"

I am open to other ways of performing the same task so long as it yields the correct-fomat json file "importable" to ES for indexing.
Thanks in advance.

from __future__ import division, print_function 
import twitter  # work with Twitter APIs
import json  # methods for working with JSON data
windows_system = True  # set to True if this is a Windows computer
if windows_system:
    line_termination = '\r\n' # Windows line termination
if (windows_system == False):
    line_termination = '\n' # Unix/Linus/Mac line termination

 json_filename = 'new_tweet_file.json'  

full_text_filename = 'new_tweet_review_file.txt'  

partial_text_filename = 'new_tweet_text_file.txt'  

def oauth_login():

    CONSUMER_KEY = ''
    CONSUMER_SECRET = ''
    OAUTH_TOKEN = ''
    OAUTH_TOKEN_SECRET = ''
    
    auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,
                               CONSUMER_KEY, CONSUMER_SECRET)
    
    twitter_api = twitter.Twitter(auth=auth)
    return twitter_api


def twitter_search(twitter_api, q, max_results=200, **kw):
  
    search_results = twitter_api.search.tweets(q=q, count=100, **kw)
    
    statuses = search_results['statuses']
    
     max_results = min(1000, max_results)
    
    for _ in range(10): # 10*100 = 1000
        try:
            next_results = search_results['search_metadata']['next_results']
        except KeyError, e: # No more results when next_results doesn't exist
            break        
  
     kwargs = dict([ kv.split('=') 
                        for kv in next_results[1:].split("&") ])
        
        search_results = twitter_api.search.tweets(**kwargs)
        statuses += search_results['statuses']
        
        if len(statuses) > max_results: 
            break
            
    return statuses

twitter_api = oauth_login()   
print(twitter_api)  # verify the connection

q = "*H3N2*"  # one of many possible search strings
results = twitter_search(twitter_api, q, max_results = 200)  # limit to 200 tweets

print('\n\ntype of results:', type(results)) 
print('\nnumber of results:', len(results)) 
print('\ntype of results elements:', type(results[0]))

item_count = 0  # initialize count of objects dumped to file
with open(json_filename, 'w') as outfile:
    for dict_item in results:
        json.dump(dict_item, outfile, encoding = 'utf-8')
        item_count = item_count + 1
        if item_count < len(results):
             outfile.write(line_termination)  # new line between items
                     
item_count = 0  # initialize count of objects dumped to file
with open(full_text_filename, 'w') as outfile:
    for dict_item in results:
        outfile.write('Item index: ' + str(item_count) +\
             ' -----------------------------------------' + line_termination)
        # indent for pretty printing
        outfile.write(json.dumps(dict_item, indent = 4))  
        item_count = item_count + 1
        if item_count < len(results):
             outfile.write(line_termination)  # new line between items  
        
item_count = 0  # initialize count of objects dumped to file
with open(partial_text_filename, 'w') as outfile:
    for dict_item in results:
        outfile.write(json.dumps(dict_item['text']))
        item_count = item_count + 1
        if item_count < len(results):
             outfile.write(line_termination)  # new line between text items  

Next step is to index the result in json file format in ES:

curl -XPOST "http://localhost:9200/fluproj/_bulk" -d @I:\ES\new_tweet_file.json

The errors and difficulties I have been posting come from using the bulk method as above.


(David Pilato) #13

What is in new_tweet_file.json?

I am open to other ways of performing the same task so long as it yields the correct-fomat json file "importable" to ES for indexing.

Use Logstash and its twitter input. You won't need to write and debug your own code...

I wrote a blog post about it: http://david.pilato.fr/blog/2015/06/01/indexing-twitter-with-logstash-and-elasticsearch/


#14

new_tweet_file.json is the output / result of the Twitter search per Python code above.
Thanks for the new link. I am looking into it.


(David Pilato) #15

Yeah I mean that I'm not going to execute your code so I was asking if you can share this file on gist.github.com for example.


#16

Got the 627Kb file ready. Not sure how to share it on gist.github.com; getting "server not found" browser error. How else can I upload it?
Thanks.

In the mean time, I need to read up on Logstash. I'd like to use the script you supplied.

BTW: Where can I buy the book which I am certain you have published on ES?


(David Pilato) #17

https://gist.github.com/


(David Pilato) #18

(David Pilato) #20

It's incorrect and does not respect the bulk format.

Again, read the documentation: https://www.elastic.co/guide/en/elasticsearch/guide/current/bulk.html


#21

Great reference; I've been using it. Ordered another one on the same subject not yet released (Dec 8, I'm told by Amazon). But, I think it is for For Developers.
Need one for real beginners and non-developers who may be data scientists or data enthusiasts.
Analogy: teaching what a driver can do on the freeway or on a race course or obstacle course (a lot of spectacular things) versus teaching or including how to get on the freeway / race course / obstacle course in the first place.
Thanks for your patience. If you have other books, I'm certain to get them.