The most common CV/Resume format is MS Word. Despite being easy for humans to read and understand, is quite difficult for a computer to interpret.
The task of extracting data and interpreting meaning is a surprisingly difficult task for a computer to do because:
- Language is infinitely varied. There are hundreds of ways to write down a date, for example, and countless ways to write what you did in your last job. A resume parsing tool captures all these different ways of writing the same thing through complex rules and statistical algorithms.
- Language is ambiguous. The same word or phrase can mean different things in different contexts.
• “MD” can mean a variety of things: “Medical Doctor,” If you are in the UK, you may immediately think of “Managing Director,” or if you’re more familiar with the Mid-Atlantic region in the U.S, “Maryland” may spring to mind.
• A 4-digit number can be part of a telephone number, a home address, part of a social security number, a Swiss zip code, a year or a version of a software package.
• The term “Project Manager” may indicate that the writer was indeed a project manager, but it is quite different if it is in a different context, like “I used to report to the Project Manager”.
The only way CV parsing software can resolve these ambiguities is by understanding and analyzing the context in which they are used. A good CV parser uses complex rules and statistical algorithms to be “Intelligent.”
We used a combination of Keyword based parser, Statistical parser and Grammar based parser.
- Convert the CV file from common file types like DOC, DOCX, PDF into plain text format. At this point we also extract the photos inside the document if there is any.
- The next process is passing the text data into our various parser.
- Inside the parsers, data will undergo many transformations and sectioning. Also language detection is done from here.
- The last part is wrapping the result in json and then is passed as output.