Ingesting & indexing text files while keeping formatting

Greetings - I have a use case where

  • We have plain text files in a very specific formatting
  • These files are in a directory
  • I want to do a search in ES or (another front end) for a specific id in those files
  • I want to return the actual file - or the exact formatting of the text

ie
we have several million text files all the same formatting, which look basically like an fancy invoice (but plain text)
User wants to search for a string and return the text file

Can logstash or ES do this.
I have ES. LS & Kibana 5.2 on a 4 node physical cluster

Thank you

I don't think Elasticsearch will preserve formatting of plain text. However, if those files were PDF, then the ingest attachment plugin would, I believe, allow you to do what you desire.

Yes, I saw that. Seems the file actually has to be in binary format though. Im wondering if I can convert thm to pdf... Ill check on that. I appreciate the reply.

Possibly someone else has an idea for this?

In all honesty, there is no other way to preserve formatting in Elasticsearch. Whatever you send becomes the _source field, and it's a single field. Whether it preserves newlines and extra spaces and tabs or not may depend on what tool sends it to Elasticsearch, and what tool reads it from Elasticsearch. You should be able to to convert to PDF at the command-line, however. The problem with many command-line formatters, however, is that they, too, may ignore the custom formatting. There may be a lot of trial and error involved.

Logstash will not preserve formatting.

How big are the files?

I mean you can put an entire file in to one field (You may have to tweak the field settings so it is fully searchable) but this data will be untouched based on what you insert. As long as it is ASCII Data

If it is binary, you could BASE64 Encode the document, then create META fields which extracted data which represents data contained in the file. But at that point you might as well keep your file on the local storage and just provide a HTTP link to the file, this way it would keep your Elastic data cluster small

Thanks so much -

the file is 13k

The file copies and pastes easily from windows to linux and keeps all the formatting.
How can I tell what encoding the file is?
I just tried the hardcopy freature of vim


but the file wouldn't open in windows - was just testing. I may have to try to bring it in with logstash and pdf / binary plugin, just an idea...

not sure what you mean, if it is a ASCII file then you don't need to convert it to PDF. since you want he whole file untouched you just have to come up with a relatively simple shell script to insert it,

Mind you since I am not sure of all that your trying to accomplish, your going to have to figure out what is best for you. Elastic is a search engine, not a CMS. While it could be used to do that that you may want to review what your actually trying to do and match the technologies.

I mean, https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html
file=cat $filename
date=date +"%Y-%m-%dT%T.%3N"
curl -XPUT http://:port//invoce -d @- <<EOF
{
"filename" : "$filename",
"post_date" : "$date",
"message" : "$file"
}
EOF

This is just quick sudo code I have not tested it. and you may have issues with BASH variable not being able to hold 15K but it should work without problems on smaller files. You will probably have to go to a programming language to do it better. Oh and you will have to change the MAPPING on the "Message" field to have more than the first 1k to be searchable

			"message" : {
			"type" : "string",
			"index" : "analyzed",
			"ignore_above": 32766
		}

but now that I think about it,

I really think you should change how your thinking to do this.

Instead of pulling in the whole file. I think you should just parse the file, create separate fields for all the invoce data and then just have a URL for the user to download the original file

So your meta data might look like this ( Just assuming since you said it was an invoice type data

 {
 "account:"1234567",
 "fName:"Ed"
  "lname:"Perry"
  "Balance": 12.50
  "30day": 1.00
  "90day":  10.00
  "Interest": 1.50
  "lineitems": [  "book 123432","shoes 1234444"]
  "invoce": "http://<blah>/invoice.txt"
 }

I appreciate the idea. I have to step away for several hours. Ill be back to working on this soon.

More info - there are about 200+ fields in this file. Its not visually sequential some sequential some liniar

something like

line 1 - city state zip
blank ctrlf
line 2 fname lname mname

line 3 loc1 loc2 loc3 loc4 ....
...
line 7 id1 id2 id3 id4 ....
...

blank crlf
line 200 history1 history2 history3 ...

So the metadata is complex - but visually pleasant. Probably hard to work with

Im thinking converting to pdf may be better due to the complexity - I'm guessing...

200 fields is not really that complex, but since I don't know your goal converting to pdf might be as good. don't know

I dont know how logstash could map all the fields, horizontal & vertical whereas I beleive LS would be able build an index on the pdf file itself rather then converting and mapping fields. There is a new attachment plugin that can index binary files... at least I hope it works that way.

My data in thefile is like a checkerboard some blocks have the reference like name and below it is the value, some reference is on the left and the value is on the right - I think it would be had to parse... but Im not that skilled :slight_smile:

Re-reading your initial question, display and search would be two different idea's

Elasticsearch would have no problem holding the data and making it searchable. Getting it in there can be done with Logstash, or even just simple CLI commands (or any language you like) Parsed or unparsed.

Display would not be available right out of the box. Meaning you would have to write a Plugin to kibana or something in HTML/Javascript/php/lang of choice to display data in whatever format you wanted. Kibana mostly deals with Key=value pairs and aggregation of data.

While we could make this work but the end result would most likely be something custom.

Now thinking outside the box,

You may want to look at a Wiki, maybe like Confluence, Mediawiki or really any wiki with search capability (some do plugin into elasticsearch ) . There you would make each "Invoce" a different wiki page (or multiples on the same page for the same person) and when you search for something it would give you your page options. Seems like what your actually looking for.

Heck, you could just upload them to a private GITHUB or Bitbucket repo, that would give you everything with very little work. Search, file storage and display of the data in a raw format. It would not be as pretty as a wiki but would be secure and web gui for people to use.

IDK hope this helps.

I think the wiki idea is excellent and would save all kinds for coding and massaging also have a built in search - very good

2 security concerns

  • Needs readonly access - no one can change files
  • Need authentication to access the site
  • Ideally would need some kind of auditing

But this would be internal website
I'm thinking files get loaded in a directory and use the built in search of the wiki software - or do I need something else?

I'm going to do some googling on this. I really appreciate your ideas...

Mediawiki could work good enough for your needs as I understand it. It is PHP based, and mysql/postgres

The loading of the file, well they would have to be loaded in to the Database, at least the ones I use, its the only way they could be searched. But there should be an API to auto create the different pages.
It has User levels, so you can not only have Login, but also I believe level of securities.

Of course there are Git/SVN and other types of wiki's that may fit your needs but for search they will need to be loaded in to either a database or something like elastic, there are none that I know of that are directory based and provide that functionality, but who knows someone might have created it.

Glad I could point you in the right direction, hit me up if you

Ok thanks I'm playing around with it now - pmwiki looks like it has a lot of features I need. I'm deploying that on top of tomcat webserver. Ill keep you posted

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.