Output to indices based on field in message


(Lm Yj Q) #1

I have some logs have same filenames like access_20170427.log, then content in it is like
[{user_id:123, act:play, timestamp:2017-04-23 08:00:00},{user_id:456, act:play, timestamp:2017-04-23 08:00:00}]
I want to put them into different indices in elasticsearch by the value of user_id.
The index names will be like user_123, user_456, user_789.
There are two questions for me:

  1. how can I put records into the right index based on the value of user_id? I guess add a field and use if clause should work, but I really not familiar with the syntax.
  2. the set of my user_id is large as million, is it a bad design to have too many indices. index by act maybe better? index names: play, like, share, fav?

Thank you.


(Aaron Mildenstein) #2
index => "%{user_id}-%{+YYYY.MM.dd}"

This would make daily indices named after user_id. More info on sprintf format in the documentation.

After explaining how to do it, now I will tell you to not do something that will create millions of unique indices with few records in them. That's a fast track to watching your Elasticsearch cluster fail.

If the records are all similar, and there is no security requirement to isolate them, why separate them at all? Why not just have daily or weekly indices with all of the similar records in them? You can use the new _rollover API (also available in Curator) to roll over indices when they reach a certain age and/or have a set number of documents in them.


(Lm Yj Q) #3

Thank you Aaron.
For my first question, before this

all the log content are in a message, I'm not sure how to "parse" them by now.
I think I could use the filter:grok to do something like this, with one of the 120 grok-patterns.
filter {
grok {
match => { "message" => "%{USERNAME:user_id}" }
add_field => { "user_id" => "user_%{user_id}" }
}
}
But in fact, there are several fields have similar pattern. {user_id1:123456, user_id2:456123, video_id:123456, act:play} which means user1 plays user2's video.
Now my input is json from redis like this,

2017-04-23T16:20:31+08:00 cv.product.access.mobile {"race":"album","video_id":43633036,"ip":"117.177.78.48","cdn":"cdn-web-qn.colorv.cn","act":"update","ad_type":"AdExchange","agent":"Mozilla/5.0 (Linux; Android 5.1.1; vivo X6SPlus D Build/LMY47V; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043128 Safari/537.36 MicroMessenger/6.5.7.1041 NetType/WIFI Language/zh_CN","author_zone":0,"author_registered_at":"2015-08-02 13:06:05","post_id":0,"published_at":"","referer":"","reference_id":0,"sessid":"7e620e610447414bafe5091470f8b0b2","duration":144,"author_is_priest":0,"download_type":"myapp","author_udid":"d850a4a042d6382","status_404":"","author_version":"and-3.6.13-gdt","mold_id":10006,"url":"http://video.colorv.cn/play/43633036?from=timeline&isappinstalled=0&from=share","author_os":"and","page_kind":"mini","request_id":"ff6925b4c86b4d34be534a6609edfa2d","referrer_id":"","author_id":3934438,"play_time":60,"method":"GET","published":0}

How can I distinguish them and extract each key into a field?

And for my second question,

  1. We will have two applications: one for BI in kibana, the other is construct a recommender system.
  2. Our log data for each user will be large, [quote="theuntergeek, post:2, topic:83719"]
    millions of unique indices with few records in them
    [/quote]I mean, every click and request in our app(making and sharing short video).

So our plan is to create index for each user and video, use timestamp field as filter when quering in BI or collect training sample in recommender system.
Your mean this plan will performance bad,but if we index by date, we will have many users, videos, acts in a single "date-index", is it hard for ES to find a record we need?
So what is the trade-off here, how to decide the index structure according to our problem?


(Aaron Mildenstein) #4

This should be in its own topic. There are a lot of grok help threads that may already answer this question for you, in fact.

As stated, this is a Bad Idea™ if you have millions of unique records. That's just too many indices to be able to manage.

No, it isn't. Elasticsearch can find it very easily, and within milliseconds.

The only real trade-off here is one that must be made. You simply cannot architect your system the way you were originally planning and have it be functional.

That's the million dollar question, now, isn't it? This is also not related to the original topic, so I suggest you start a thread in the Elasticsearch forum area here. There have been many who have already received good answers about mapping their data properly (which is what you just asked for here).


(Lm Yj Q) #6

That's really helpful, thanks a lot!


(system) #7

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.