private void tryImageExtraction(PDPage page, IDocearPdfImageHandler imageHandler) { CSImageExtractor ocrExtractor = new CSImageExtractor(imageHandler); CSDeviceBasedInterpreter interpreter = new CSDeviceBasedInterpreter(null, ocrExtractor); interpreter.process(page.getContentStream(), page.getResources()); }
private void extractText(PDPageTree pageTree, StringBuilder sb) { for (Iterator<?> it = pageTree.getKids().iterator(); it.hasNext();) { PDPageNode node = (PDPageNode) it.next(); if (node.isPage()) { try { CSTextExtractor extractor = new CSTextExtractor(); PDPage page = (PDPage) node; AffineTransform pageTx = new AffineTransform(); PDFGeometryTools.adjustTransform(pageTx, page); extractor.setDeviceTransform(pageTx); CSDeviceBasedInterpreter interpreter = new CSDeviceBasedInterpreter(null, extractor); interpreter.process(page.getContentStream(), page.getResources()); sb.append(extractor.getContent()); } catch (CSException e) { e.printStackTrace(); } } else { extractText((PDPageTree) node, sb); } } }
gctx.fill(rect); CSContent content = page.getContentStream();
private TreeMap<PdfTextEntity, StringBuilder> tryTextExtraction(PDPage page) { CSFormatedTextExtractor extractor = new CSFormatedTextExtractor(); AffineTransform pageTx = new AffineTransform(); PDFGeometryTools.adjustTransform(pageTx, page); extractor.setDeviceTransform(pageTx); CSDeviceBasedInterpreter interpreter = new CSDeviceBasedInterpreter(null, extractor); interpreter.process(page.getContentStream(), page.getResources()); TreeMap<PdfTextEntity, StringBuilder> map = extractor.getMap(); uniqueHash = extractor.getHash(); return map; }