How to map unstructured data in elastic


(Lucy) #1

Hi, I’m new to elastic and was wondering how elastic handles/deals with unstructured data and if their was a way to have it so that if there was a phone number in the following format, 0412 345 678, that it will take it as a single token rather than split it into multiple tokens due to the spaces.
Thanks


(Kévin Masseix) #2

It depends,
with Elasticsearch you can either use a static or a dynamic mapping.

In the case of a static mapping then you may want to define a phone number as a keyword datatype.


(Lucy) #3

Is there a way to have it so that elastic sees all the data that is coming in and only takes the phone numbers as the keyword data type and keeps the rest as the text data type. This is because the unstructured data I am putting in has been extracted from PDFs and I want to be able to apply other analyzers to the data.


(Junaid) #4

@Lucyj ES expects data in a JSON format, so whatever you are writing it will be a number of key/val pairs. There is a concept of field mappings, so each key can be defined with a specific expected data type either statically or dynamically.

So for the above case, you'll have a number of different key/val pairs in your JSON document. One of the key will be probably phone_number, and the value against it will be the phone number i.e. 0412 345 678 for the above case. You can define the phone_number mapping type as keyword whereas keep the rest of the field mappings as text or anything else as per your requirement.

For more details about the mappings, you can refer to the documentation.


(Kévin Masseix) #5

If the content is extracted from a PDF then I guess there is only one content field being extracted with a few metadata and you would like to generate a full document with different fields ?

If so you might want to use logstash and it's regex based plugin named grok to extract those phone numbers from any field and put them in another field.

You can see an example here with the generation of documents from each syslog line