Debugging "Failed Documents" from Data Visualizer CSV Ingestion

I have a CSV file I'm trying to convert into an index via Kibana.

I go to Machine Learning/Data Visualizer and upload my file. The "explanations" look good to me, as do all the guesses for the column types, so I hit "import". About 1,500 of 3,000 CSV lines could not be imported. The message says "This could be due to lines not matching the Grok pattern." and beneath that are the details of the error.

The data is proprietary, so I can't share it. But say that there is a "Notes" column that Kibana correctly determines to be of type text. The problematic lines are all in that column and of the form "Lorem ipsum; 99-9999, sit amet". The error message for each is "unable to convert [99-9999] to long".

My specific question is why is Kibana trying to convert this string to a numeric type when it appears in a text field? My more general question is how do I go about debugging this?

I thought I could look at the Pipeline that gets generated and find some error to fix, but that just appears to import the "Notes" column as text. It's not doing anything wrong.

What version are you on?

Is that a single column or multiple columns?

Are all the fields " quote delimited?

What IS the delimeter?

Perhaps you may have inconsistent quoting or delimeters etc.

Give us more of the "Shape" of the data? What does a couple columns / rows look like.

And you can change all this by going to the Advanced tab

This parsed fine for me Version 8.4 except I changed the ID to an integer it originally had it as a keyword

 1234,"Lorem ipsum; 99-9999, sit amet","Code"
 1234,"Lorem ipsum; 99-9999, sit amet","Code"
 1234,"Lorem ipsum; 99-9999, sit amet","Code"

Version 8.4.

That field is a single column. In the CSV, the columns to its left and right look like this.

999.9999,"Lorem ipsum; 99-9999, sit amet",x<!-- consectetur adipiscing -->

I was thinking this might be because there's a comma in the middle of the text, but in the CSV the whole cell is double-quoted and the advanced box in the Kibana UI had " as the quote delimiter.

I would look through your data. I suspect there are columns with or without the quotes. Or it's not that column at all. And one of your columns with a number in it sometimes has a number and sometimes does not. That could be another explanation.

You can still, import it, and then check the rows etc.

Everything it's doing is right in that advanced tab.

It seems like the best thing to do is start with a very small data set and write my own pipeline.

I authored a pipeline that just works on these three columns. Now I want to send one line of my CSV file through it? How do I do that?

I don't see how to tell the data file visualizer to use my custom pipeline. (I don't think that's its purpose.) I see documentation online like this blog post that describes how to convert your CSV into JSON documents outside of Kibana and then index them in the usual manner.

Is there an out-of-the-box way to ingest CSVs? Should I be tinkering with data file visualizer? Or should I just write my own CSV-to-JSON conversion script?

What kind of pipeline and ingest pipeline?

Can you show me?

You just cut-n-paste into the advanced tab

You could also create your own pipeline and use filebeat... if you plan to repeat this many times

Reading online a bit more, is this ingestion stuff what Logstash is for?

Let me back up...I know how no-SQL databases work and a few years ago I played with the development UI in Kibana, so I understand the core of the product. But now I'm trying to actually do something with it, so I need a better overview.

I'm a data scientist. I work with lots of different data sets. A lot of times they come in as CSV files. I need an easy way to ingest and analyze that data.

What I want to do is a demo for my colleagues where I open up some piece of software, load in a CSV file, push a few buttons, make a few pretty graphs, and then say, "See how easy this is? Now can we please stop writing our own dashboard software and concentrate on the things we're good at?"

Ideally I'd like to sit quietly in the back of a planning meeting where people are discussing how we're going to spend months writing visualizations and then have the visualizations done by the end of the meeting.

I think this is possible with Elasticsearch/Kibana, but I have to figure out which parts of the system to completely master in order to do it.

I suspect creating my own pipeline is the way to go.

Awesome Totally!

There are Logstash Pipelines (works great) that runs within logstash.... that take a little setup work, powerful but takes a little coding etc..

There are ingest pipelines that run inside Elasticsearch (What Data Visualizer uses) that can be called from from the REST API or Filebeat or Data Visualizer...

If you want to do that above I would write and ingest pipeline... then just cut-n-paste. into the data visualizer or use filebeat... basically you would just you a basic setup and then just set the pipeline in the output section.

Thanks. It helps to have the names of things to go look at tutorials of. Would you say that Visualizer is the out-of-box solution and Filebeat is the next step up for customization?

Yes... I think you ran into something a bit "wonky" (tech term :slight_smile: )
Visualizer for simple files usually works pretty good...

Writing an ingest pipeline can be pretty quick dev cycle... look up the _simulate API...
You use that with some sample docs and you can dev-test cycle very quick.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.