Parsing Non Repetitive texts to form structured data

Hello. Please may i ask what is the best way to parse unstructured non repetitive text files in logstash. I have many text files that i need to process(extracting just small parts of the messages) and i have put an example of one of the text files below. The problems is these text files do not contain the same information format. The information is random and i cannot predict where the informa

Hey.
Can you please give the expected output with example of your data?
What is the format of your data? JSON, CSV, raw text, CEF, etc?
Is your data new line delimited?
Is there any logical way to accomplish the separation you want using regular expressions?

1 Like

my data is in form of raw text. For output I would want json or any other format that i can feed in HDFS. My plan is to extract features, merge those features with the labes that i already have and construct a machine learning model . That is why i would like to push the data into hdfs and use spark on it after.

here is an example of my data,

0 2016/11/17 09:29:27.710192 6.9105 0 ECU1 DLTD INTM 2033 log info verbose 1 Daemon launched. Starting to output traces...

20 2016/11/17 09:29:27.765977 1.3535 0 ECU1 ADB 750 549 log info verbose 2 0300 761 avCoreProfiler: time=1.353s, CPU=14+8ms 1872kb 6+2cs main

43 2016/11/17 09:29:27.766710 1.3786 9 ECU1 ADB 761 549 log info verbose 2 a200 av_adsp_IDiranaComFactory_createInstanceCom (DIRANA3_HIFI2><SPI) callCounter=2
44 2016/11/17 09:29:27.766722 1.3788 10 ECU1 ADB 761 549 log info verbose 2 a200
av_adsp_IDiranaComFactory_createInstanceCom: CSpiDevice(DIRANA3_HIFI2)=0x28f178 created, R=0x27a350, W=0x28f17c

45 2016/11/17 09:29:27.766735 1.3796 11 ECU1 ADB 761 549 log info verbose 2 a800 CSpiAccess::openDevice: DIRANA3_HIFI2, fd=6, dev=spidev2.0, devPath=/dev/spidev2.0

decided to give many lines of the text

99 2016/11/17 09:29:27.768430 1.5350 47 ECU1 ADB 761 549 log info verbose 2 a600 CPinControllerGpio::run - Interrupt occurred for GPIO(203) - val=high --> Inform Observer(0x27a6d4)!
100 2016/11/17 09:29:27.768714 1.5363 48 ECU1 ADB 761 549 log info verbose 2 a600 CPinControllerGpio::run - Interrupt occurred for GPIO(203) - val=high --> Inform Observer(0x27a6d4)!
101 2016/11/17 09:29:27.768745 1.5375 49 ECU1 ADB 761 549 log info verbose 2 a600 CPinControllerGpio::run - Interrupt occurred for GPIO(203) - val=high --> Inform Observer(0x27a6d4)!
102 2016/11/17 09:29:27.768759 1.5388 50 ECU1 ADB 761 549 log info verbose 2 a600 CPinControllerGpio::run - Interrupt occurred for GPIO(203) - val=high --> Inform Observer(0x27a6d4)!
103 2016/11/17 09:29:27.769140 1.5390 10 ECU1 ADB 750 549 log info verbose 2 0300 755 avCoreProfiler: time=1.538s, CPU=24+38ms 2144kb 94+8cs start - loading patches...
104 2016/11/17 09:29:27.769158 1.5395 51 ECU1 ADB 761 549 log info verbose 2 a600 CPinControllerGpio::run - Interrupt occurred for GPIO(203) - val=high --> Inform Observer(0x27a6d4)!
105 2016/11/17 09:29:27.769179 1.7138 52 ECU1 ADB 761 549 log info verbose 2 a600 CPinControllerGpio::run - Interrupt occurred for GPIO(203) - val=high --> Inform Observer(0x27a6d4)!
106 2016/11/17 09:29:27.769196 1.7927 11 ECU1 ADB 750 549 log info verbose 2 0300 755 avCoreProfiler: time=1.792s, CPU=26+60ms 2144kb 531+9cs end - loading patches
107 2016/11/17 09:29:27.769219 1.7927 12 ECU1 ADB 750 549 log info verbose 2 0300 755 avCoreProfiler: time=1.792s, CPU=26+62ms 2144kb 531+9cs start - image loading...
108 2016/11/17 09:29:27.769237 1.7928 53 ECU1 ADB 761 549 log info verbose 2 a600 CPinControllerGpio::monitoring_start_gpio - Bootstate observation-Thread still exists!
109 2016/11/17 09:29:27.769251 1.7928 3 ECU1 ADB 755 549 log info verbose 2 0300 cLoad: prepare load cmd EPIC:05h offset:00h size:F004h
110 2016/11/17 09:29:27.769264 1.7930 54 ECU1 ADB 761 549 log info verbose 2 a800 CSpiAccess::spi_write: (7/7) Bytes written
111 2016/11/17 09:29:27.769276 1.7930 55 ECU1 ADB 761 549 log info verbose 2 a800 CSpiDevice::write: written 7 bytes
112 2016/11/17 09:29:27.769312 1.7934 56 ECU1 ADB 761 549 log info verbose 2 a600 CPinControllerGpio::run - Interrupt occurred for GPIO(203) - val=high --> Inform Observer(0x27a5c4)!
113 2016/11/17 09:29:27.769330 1.7935 4 ECU1 ADB 755 549 log info verbose 2 0300 cLoad: load pBinary:352200DCh Size:61444
114 2016/11/17 09:29:27.769344 1.8956 57 ECU1 ADB 761 549 log info verbose 2 a800 CSpiAccess::spi_write: (61444/61444) Bytes written
115 2016/11/17 09:29:27.769356 1.8956 58 ECU1 ADB 761 549 log info verbose 2 a800 CSpiDevice::write: written 61444 bytes
116 2016/11/17 09:29:27.769368 1.8976 59 ECU1 ADB 761 549 log info verbose 2 a600 CPinControllerGpio::run - Interrupt occurred for GPIO(203) - val=high --> Inform Observer(0x27a5c4)!
117 2016/11/17 09:29:27.769383 1.8977 5 ECU1 ADB 755 549 log info verbose 2 0300 cLoad: prepare load cmd EPIC:05h offset:0Fh size:1004h
118 2016/11/17 09:29:27.769396 1.8979 60 ECU1 ADB 761 549 log info verbose 2 a800 CSpiAccess::spi_write: (7/7) Bytes written
119 2016/11/17 09:29:27.769408 1.8979 61 ECU1 ADB 761 549 log info verbose 2 a800 CSpiDevice::write: written 7 bytes

YEs my data is new line delimited. Each line of my data starts with an index followed by date and 3 other things that are uniform in each line, but i do not care about these, what i care about is the last protion in the end of each text. Yet this lastpotion is big and has no pattern whatsoever. It also differes from text.

Yes i can write regular expressions to extract the information i want. But i after i get that tiny part of information, i do not know what ti do with the rest of the data.

You need to use grok in a filter section.

Some example:

108 2016/11/17 09:29:27.769237 1.7928 53 ECU1 ADB 761 549 log info verbose 2 a600 CPinControllerGpio::monitoring_start_gpio - Bootstate observation-Thread still exists!

^%{DATA:line_number} %{DATA:year}/%{DATA:month}/%{DATA:day} %{DATA:hmsss} %{GREEDYDATA:rest}$

{
"rest": "1.7928 53 ECU1 ADB 761 549 log info verbose 2 a600 CPinControllerGpio::monitoring_start_gpio - Bootstate observation-Thread still exists!",
"hmsss": "09:29:27.769237",
"month": "11",
"year": "2016",
"line_number": "108",
"day": "17"
}

Your log lines are not complex for writing generic grok pattern in a reasonable time.

Ok, thank you. But It brings me to another question: how exactly does the program you made for me work?

I read the link you sent me and what i understood is that logstash needs to be told what to do with each chunk of the data. Like from the beggining of the line to the very end. Also i cannot predict which line my data will be found.

If i know the algorithm here and what it does to the rest of the data that will be very helpful.

Logstash is divided into three sections:

  • input - where you load your data to the memory
  • filter - where you do the processing, using multiple threads
  • outout - where you give back the processed data to the destination you want, also multithreaded

Once your log line reaches the filter section, it is being processed by your grok filter patterns and all of the rest plugins. Data from input to filter is being passed as a "message". The "message" is the line you are after to match. You load that line to the grok expression as your input and perform your regular expressions. If the regex matches, you get the "message" splitted into the new fields you have selected to map your data to.

More info: https://logz.io/blog/logstash-grok/
or just google grok tutorials.

Most important stuff you need to know is that you need to anchor g in your regular expressions with '^ $ 'and 'break_on_match' clause in the grok config for your better processing performance and overall safety of your node.

1 Like

ok, thank you so much

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.