Parsing resume for relevance against job description


I have a use case where I have to compare resume documents (ms word, pdf) against job description. Since resumes are highly unstructured documents, I am struggling to clean up the documents, removing invalid characters and create a json.

Then I came across Ingest Attachment plugin. My questions are -

  1. Is it possible to ingest the attachment directly from a physical drive location?
  2. The attachment to be ingested should always be Base64 encoded? How should I query the encoded attachment data with non-encoded query string?

Any help is appreciated.

Have a look at FSCrawler project. Might help you.

1 Like

Thanks David. This tool was helpful. Though the resume text in the output json is has /n appended(understandably)

\nManaged global compliance program for a Fortune 100 company with an emphasis on creating an effective and cost-efficient program, talent development, and internal reporting and investigation procedures. \n\n· Allocated resources efficiently through the use of risk assessment protocols to identify, mitigate, and monitor compliance risks\n\n· Built cross-functional teams to develop compliance policies and communications, controls, audit plans, and training strategies that improve compliance program effectiveness\n\n· Reduced costs and employee time requirements of compliance training while improving employee awareness using internally developed training and communications resources\n·

How can I avoid those? This is because as skill keyword might get next to /n and may not come in search.

I should probably add an option for that. Could you open an issue and if possible share a document I could use as a test case input?

In the meantime I think you can use a char filter in a custom analyzer to remove those characters.

Sure, will do that

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.