Raw twitter river date format


(Chinch Pokli) #1

Hi,

I am trying to index the raw Twitter data. Twitter returns the date in the format "Sun Sep 13 19:12:56 +0000 2015" and for that I have PUT mapping for my index with date as format "EEE MMM dd HH:mm:ss +ZZZZZ yyyy". But I am getting the error:

at org.elasticsearch.index.mapper.core.DateFieldMapper.parseStringValue(DateFieldMapper.java:621)
        at org.elasticsearch.index.mapper.core.DateFieldMapper.innerParseCreateField(DateFieldMapper.java:549)
        at org.elasticsearch.index.mapper.core.NumberFieldMapper.parseCreateField(NumberFieldMapper.java:235)
        at org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:406)
        ... 12 more
Caused by: java.lang.IllegalArgumentException: Invalid format: "Sun Sep 13 19:12:56 +0000 2015" is malformed at "0000 2015"
        at org.elasticsearch.common.joda.time.format.DateTimeFormatter.parseMillis(DateTimeFormatter.java:754)
        at org.elasticsearch.index.mapper.core.DateFieldMapper.parseStringValue(DateFieldMapper.java:615)
        ... 15 more
[2015-09-13 15:12:56,866][WARN ][river.twitter            ] There was failures while executing bulk

I also tried "EEE MMM dd HH:mm:ss ZZZZZ yyyy", "date", "dateOptionalTime" but nothing works.

Been stuck on this for a while now! I'd appreciate any help.

Thanks!


(Mark Walkom) #2

Don't use the river as they will no longer be supported in the very near future, switch to using Logstash with the twitter input.


(Chinch Pokli) #3

Regardless of whether I use the river or Logstash, the question is still the same - what should be the mapping to successfully ingest the date returned by Twitter. In case of choosing the truncated data as is default, this is taken care of. However, when I want the entire stream ("raw" or "full_tweet" option), the date is ingested as a string.


(Mark Harwood) #4

I successfully used the river before and it uses this elasticsearch mapping:

{
   "my_twitter_river": {
	  "mappings": {
		 "status": {
			"properties": {
			   "created_at": {
				  "type": "date",
				  "format": "dateOptionalTime"
			   }....

The river uses Twitter4J.
Twitter4J returns a java.util.Date [1] so perhaps the question should be more about how Twitter4J parses whatever Twitter provides in its API?

[1] https://github.com/elastic/elasticsearch-river-twitter/blob/master/src/main/java/org/elasticsearch/river/twitter/TwitterRiver.java#L640


(Chinch Pokli) #5

Thanks Mark. I tried "dateOptionalTime", "date_optional_time", "EEE MMM dd HH:mm:ss ZZZZZ yyyy", "EEE MMM dd HH:mm:ss +ZZZZ yyyy", "date" but nothing works!!

I also looked at the code for Twitter4J and seems like they use "EEE MMM dd HH:mm:ss z yyyy" but even that didn't work!

I've been trying all these for 2 days now but nothing has worked so far.


(Mark Harwood) #6

I think the difference is Twitter4J uses in-built Java date parsing while we use joda which supports timezones differently [1]
If you are not using the Twitter river/Twitter4J/java date parsing call stack and working with raw Twitter records you run into this discrepancy over timezones.

[1] http://stackoverflow.com/questions/4498274/why-joda-datetimeformatter-cannot-parse-timezone-names-z


(Chinch Pokli) #7

Ok so it seem to be indexing now after I used the mapping as "EEE MMM dd HH:mm:ss +0000 yyyy".
Now one question I have is - will I need to search in the same format? That would be useless, right?
I tried searching for "2015-09-14" in the field but it gave me an error

"error": "SearchPhaseExecutionException[Failed to execute phase [query], all shards failed; shardFailures {[5mltrJcdSMmx-qQon8_m7A][twittertool_untruncated][0]: SearchParseException[[twittertool_untruncated][0]: from[-1],size[-1]: 
Parse Failure [Failed to parse source [{\n  \"query\": {\n    \"match\": {\n      \"created_at\": \"2015-09-14\"\n    }\n  }\n}\n]]]; nested: ElasticsearchParseException[failed to parse date field [2015-09-14], tried both date format [EEE MMM dd HH:mm:ss +0000 yyyy], and timestamp number]; nested: IllegalArgumentException[Invalid format: \"2015-09-14\"]; }

#8

I have the same problem as you and nothing works. I tried the mapping you posted but it does not work either. Can you post you entire mapping so I can check that we are doing the same thing?
Thanks


(system) #9