Output to indices based on field in message

LmYjQ · April 26, 2017, 2:39pm

I have some logs have same filenames like access_20170427.log, then content in it is like
[{user_id:123, act:play, timestamp:2017-04-23 08:00:00},{user_id:456, act:play, timestamp:2017-04-23 08:00:00}]
I want to put them into different indices in elasticsearch by the value of user_id.
The index names will be like user_123, user_456, user_789.
There are two questions for me:

how can I put records into the right index based on the value of user_id? I guess add a field and use if clause should work, but I really not familiar with the syntax.
the set of my user_id is large as million, is it a bad design to have too many indices. index by act maybe better? index names: play, like, share, fav？

Thank you.

theuntergeek · April 26, 2017, 2:58pm

index => "%{user_id}-%{+YYYY.MM.dd}"

This would make daily indices named after user_id. More info on sprintf format in the documentation.

After explaining how to do it, now I will tell you to not do something that will create millions of unique indices with few records in them. That's a fast track to watching your Elasticsearch cluster fail.

If the records are all similar, and there is no security requirement to isolate them, why separate them at all? Why not just have daily or weekly indices with all of the similar records in them? You can use the new _rollover API (also available in Curator) to roll over indices when they reach a certain age and/or have a set number of documents in them.

LmYjQ · April 27, 2017, 3:20am

Thank you Aaron.
For my first question, before this

all the log content are in a message, I'm not sure how to "parse" them by now.
I think I could use the filter:grok to do something like this, with one of the 120 grok-patterns.
filter {
grok {
match => { "message" => "%{USERNAME:user_id}" }
add_field => { "user_id" => "user_%{user_id}" }
}
}
But in fact, there are several fields have similar pattern. {user_id1:123456, user_id2:456123, video_id:123456, act:play} which means user1 plays user2's video.
Now my input is json from redis like this,

2017-04-23T16:20:31+08:00 cv.product.access.mobile {"race":"album","video_id":43633036,"ip":"117.177.78.48","cdn":"cdn-web-qn.colorv.cn","act":"update","ad_type":"AdExchange","agent":"Mozilla/5.0 (Linux; Android 5.1.1; vivo X6SPlus D Build/LMY47V; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043128 Safari/537.36 MicroMessenger/6.5.7.1041 NetType/WIFI Language/zh_CN","author_zone":0,"author_registered_at":"2015-08-02 13:06:05","post_id":0,"published_at":"","referer":"","reference_id":0,"sessid":"7e620e610447414bafe5091470f8b0b2","duration":144,"author_is_priest":0,"download_type":"myapp","author_udid":"d850a4a042d6382","status_404":"","author_version":"and-3.6.13-gdt","mold_id":10006,"url":"http://video.colorv.cn/play/43633036?from=timeline&isappinstalled=0&from=share","author_os":"and","page_kind":"mini","request_id":"ff6925b4c86b4d34be534a6609edfa2d","referrer_id":"","author_id":3934438,"play_time":60,"method":"GET","published":0}

How can I distinguish them and extract each key into a field?

And for my second question,

We will have two applications: one for BI in kibana, the other is construct a recommender system.
Our log data for each user will be large, [quote="theuntergeek, post:2, topic:83719"]
millions of unique indices with few records in them
[/quote]I mean, every click and request in our app(making and sharing short video).

So our plan is to create index for each user and video, use timestamp field as filter when quering in BI or collect training sample in recommender system.
Your mean this plan will performance bad,but if we index by date, we will have many users, videos, acts in a single "date-index", is it hard for ES to find a record we need?
So what is the trade-off here, how to decide the index structure according to our problem?

theuntergeek · April 27, 2017, 4:10am

This should be in its own topic. There are a lot of grok help threads that may already answer this question for you, in fact.

As stated, this is a Bad Idea™ if you have millions of unique records. That's just too many indices to be able to manage.

No, it isn't. Elasticsearch can find it very easily, and within milliseconds.

The only real trade-off here is one that must be made. You simply cannot architect your system the way you were originally planning and have it be functional.

That's the million dollar question, now, isn't it? This is also not related to the original topic, so I suggest you start a thread in the Elasticsearch forum area here. There have been many who have already received good answers about mapping their data properly (which is what you just asked for here).

LmYjQ · April 27, 2017, 5:34am

That's really helpful, thanks a lot!

system · May 25, 2017, 5:48am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Logstash output to Elasticsearch name index based on field? Logstash	12	2752	March 23, 2021
Create multiple indices grouped by a field Logstash	3	706	July 11, 2017
Index creation based on file name Logstash	4	3804	October 19, 2017
Rotating indices every few hours - output to elasticsearch Logstash	4	5940	July 6, 2017
Elasticsearch index based on source field Logstash	2	2489	September 14, 2017

Output to indices based on field in message

Related topics