Ingesting & indexing text files while keeping formatting

Bob_Metelsky · March 12, 2017, 2:48pm

Greetings - I have a use case where

We have plain text files in a very specific formatting
These files are in a directory
I want to do a search in ES or (another front end) for a specific id in those files
I want to return the actual file - or the exact formatting of the text

ie
we have several million text files all the same formatting, which look basically like an fancy invoice (but plain text)
User wants to search for a string and return the text file

Can logstash or ES do this.
I have ES. LS & Kibana 5.2 on a 4 node physical cluster

Thank you

theuntergeek · March 12, 2017, 3:33pm

I don't think Elasticsearch will preserve formatting of plain text. However, if those files were PDF, then the ingest attachment plugin would, I believe, allow you to do what you desire.

Bob_Metelsky · March 12, 2017, 4:07pm

Yes, I saw that. Seems the file actually has to be in binary format though. Im wondering if I can convert thm to pdf... Ill check on that. I appreciate the reply.

Possibly someone else has an idea for this?

theuntergeek · March 12, 2017, 4:14pm

In all honesty, there is no other way to preserve formatting in Elasticsearch. Whatever you send becomes the _source field, and it's a single field. Whether it preserves newlines and extra spaces and tabs or not may depend on what tool sends it to Elasticsearch, and what tool reads it from Elasticsearch. You should be able to to convert to PDF at the command-line, however. The problem with many command-line formatters, however, is that they, too, may ignore the custom formatting. There may be a lot of trial and error involved.

Logstash will not preserve formatting.

eperry · March 12, 2017, 4:17pm

How big are the files?

I mean you can put an entire file in to one field (You may have to tweak the field settings so it is fully searchable) but this data will be untouched based on what you insert. As long as it is ASCII Data

If it is binary, you could BASE64 Encode the document, then create META fields which extracted data which represents data contained in the file. But at that point you might as well keep your file on the local storage and just provide a HTTP link to the file, this way it would keep your Elastic data cluster small

Bob_Metelsky · March 12, 2017, 4:23pm

Thanks so much -

the file is 13k

The file copies and pastes easily from windows to linux and keeps all the formatting.
How can I tell what encoding the file is?
I just tried the hardcopy freature of vim

but the file wouldn't open in windows - was just testing. I may have to try to bring it in with logstash and pdf / binary plugin, just an idea...

eperry · March 12, 2017, 4:48pm

not sure what you mean, if it is a ASCII file then you don't need to convert it to PDF. since you want he whole file untouched you just have to come up with a relatively simple shell script to insert it,

Mind you since I am not sure of all that your trying to accomplish, your going to have to figure out what is best for you. Elastic is a search engine, not a CMS. While it could be used to do that that you may want to review what your actually trying to do and match the technologies.

I mean, https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html
file=cat $filename
date=date +"%Y-%m-%dT%T.%3N"
curl -XPUT http://:port//invoce -d @- <<EOF
{
"filename" : "$filename",
"post_date" : "$date",
"message" : "$file"
}
EOF

This is just quick sudo code I have not tested it. and you may have issues with BASH variable not being able to hold 15K but it should work without problems on smaller files. You will probably have to go to a programming language to do it better. Oh and you will have to change the MAPPING on the "Message" field to have more than the first 1k to be searchable

			"message" : {
			"type" : "string",
			"index" : "analyzed",
			"ignore_above": 32766
		}

eperry · March 12, 2017, 4:59pm

but now that I think about it,

I really think you should change how your thinking to do this.

Instead of pulling in the whole file. I think you should just parse the file, create separate fields for all the invoce data and then just have a URL for the user to download the original file

So your meta data might look like this ( Just assuming since you said it was an invoice type data

 {
 "account:"1234567",
 "fName:"Ed"
  "lname:"Perry"
  "Balance": 12.50
  "30day": 1.00
  "90day":  10.00
  "Interest": 1.50
  "lineitems": [  "book 123432","shoes 1234444"]
  "invoce": "http://<blah>/invoice.txt"
 }

Bob_Metelsky · March 12, 2017, 5:11pm

I appreciate the idea. I have to step away for several hours. Ill be back to working on this soon.

More info - there are about 200+ fields in this file. Its not visually sequential some sequential some liniar

something like

line 1 - city state zip
blank ctrlf
line 2 fname lname mname

line 3 loc1 loc2 loc3 loc4 ....
...
line 7 id1 id2 id3 id4 ....
...

blank crlf
line 200 history1 history2 history3 ...

So the metadata is complex - but visually pleasant. Probably hard to work with

Bob_Metelsky · March 12, 2017, 5:25pm

Im thinking converting to pdf may be better due to the complexity - I'm guessing...

eperry · March 12, 2017, 6:33pm

200 fields is not really that complex, but since I don't know your goal converting to pdf might be as good. don't know

Bob_Metelsky · March 12, 2017, 8:37pm

I dont know how logstash could map all the fields, horizontal & vertical whereas I beleive LS would be able build an index on the pdf file itself rather then converting and mapping fields. There is a new attachment plugin that can index binary files... at least I hope it works that way.

My data in thefile is like a checkerboard some blocks have the reference like name and below it is the value, some reference is on the left and the value is on the right - I think it would be had to parse... but Im not that skilled

eperry · March 12, 2017, 9:55pm

Re-reading your initial question, display and search would be two different idea's

Elasticsearch would have no problem holding the data and making it searchable. Getting it in there can be done with Logstash, or even just simple CLI commands (or any language you like) Parsed or unparsed.

Display would not be available right out of the box. Meaning you would have to write a Plugin to kibana or something in HTML/Javascript/php/lang of choice to display data in whatever format you wanted. Kibana mostly deals with Key=value pairs and aggregation of data.

While we could make this work but the end result would most likely be something custom.

Now thinking outside the box,

You may want to look at a Wiki, maybe like Confluence, Mediawiki or really any wiki with search capability (some do plugin into elasticsearch ) . There you would make each "Invoce" a different wiki page (or multiples on the same page for the same person) and when you search for something it would give you your page options. Seems like what your actually looking for.

Heck, you could just upload them to a private GITHUB or Bitbucket repo, that would give you everything with very little work. Search, file storage and display of the data in a raw format. It would not be as pretty as a wiki but would be secure and web gui for people to use.

IDK hope this helps.

Bob_Metelsky · March 13, 2017, 12:50am

I think the wiki idea is excellent and would save all kinds for coding and massaging also have a built in search - very good

2 security concerns

Needs readonly access - no one can change files
Need authentication to access the site
Ideally would need some kind of auditing

But this would be internal website
I'm thinking files get loaded in a directory and use the built in search of the wiki software - or do I need something else?

I'm going to do some googling on this. I really appreciate your ideas...

eperry · March 13, 2017, 1:58am

Mediawiki could work good enough for your needs as I understand it. It is PHP based, and mysql/postgres

The loading of the file, well they would have to be loaded in to the Database, at least the ones I use, its the only way they could be searched. But there should be an API to auto create the different pages.
It has User levels, so you can not only have Login, but also I believe level of securities.

Of course there are Git/SVN and other types of wiki's that may fit your needs but for search they will need to be loaded in to either a database or something like elastic, there are none that I know of that are directory based and provide that functionality, but who knows someone might have created it.

Glad I could point you in the right direction, hit me up if you

Bob_Metelsky · March 13, 2017, 2:01am

Ok thanks I'm playing around with it now - pmwiki looks like it has a lot of features I need. I'm deploying that on top of tomcat webserver. Ill keep you posted

system · April 10, 2017, 2:01am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Ingesting documents (pdf, word, .txt) to elasticsearch Elasticsearch	31	38632	March 21, 2017
Ingesting table-like data Logstash	6	696	May 30, 2017
Ingest a text file as a single log with the help of filebeat Beats filebeat	3	1017	March 14, 2019
Ingest Json File into Elasticsearch Elasticsearch	3	3180	June 6, 2018
Elasticsearch attachment parsing usecase Elasticsearch	6	725	May 2, 2017

Ingesting & indexing text files while keeping formatting

line 1 - city state zip blank ctrlf line 2 fname lname mname

line 3 loc1 loc2 loc3 loc4 .... ... line 7 id1 id2 id3 id4 .... ...

Related topics

line 1 - city state zip
blank ctrlf
line 2 fname lname mname

line 3 loc1 loc2 loc3 loc4 ....
...
line 7 id1 id2 id3 id4 ....
...