What problem they are trying to solve (in the end, what matters is choosing how close from the neighbors a letter/word/line has to be in order to be considered part of a paragraph).Īn expensive alternative (in terms of time/computer power) is generating images for each page and feeding them to OCR, may be worth a try if you have a very good OCR. I agree, the interface is pretty low level, but it makes more sense when you know Tools like PDFminer use heuristics to group letters and words again based on their position in the page. There are tons of software generating PDFs, many are defective.The original text structure is lost (letters may not be groupedĪs words and words may not be grouped in sentences, and the order they are placed in Text is in no particular order (unless order is important for printing), most of the time PDF is a document format designed to be printed, not to be parsed. This is called PDF mining, and is very hard because:
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |