org.apache.tika.parser.pdf.PDFParserConfig.setExtractUniqueInlineImagesOnly java code examples

@Field
void setExtractUniqueInlineImagesOnly(boolean extractUniqueInlineImagesOnly) {
  defaultConfig.setExtractUniqueInlineImagesOnly(extractUniqueInlineImagesOnly);
}

    getBooleanProp(props.getProperty("extractInlineImages"),
        getExtractInlineImages()));
setExtractUniqueInlineImagesOnly(
    getBooleanProp(props.getProperty("extractUniqueInlineImagesOnly"),
        getExtractUniqueInlineImagesOnly()));

@Field
void setExtractUniqueInlineImagesOnly(boolean extractUniqueInlineImagesOnly) {
  defaultConfig.setExtractUniqueInlineImagesOnly(extractUniqueInlineImagesOnly);
}

@Field
void setExtractUniqueInlineImagesOnly(boolean extractUniqueInlineImagesOnly) {
  defaultConfig.setExtractUniqueInlineImagesOnly(extractUniqueInlineImagesOnly);
}

pdfParserConfig.setExtractUniqueInlineImagesOnly((Boolean) extractUniqueInlineImagesOnly);

    getBooleanProp(props.getProperty("extractInlineImages"),
        getExtractInlineImages()));
setExtractUniqueInlineImagesOnly(
    getBooleanProp(props.getProperty("extractUniqueInlineImagesOnly"),
        getExtractUniqueInlineImagesOnly()));

/**
 * Create a new extractor, which will OCR images by default if Tesseract is available locally, extract inline
 * images from PDF files and OCR them and use PDFBox's non-sequential PDF parser.
 */
public Extractor() {
  // Calculate the SHA256 digest by default.
  setDigestAlgorithms(DigestAlgorithm.SHA256);
  // Run OCR on images contained within PDFs and not on pages.
  pdfConfig.setExtractInlineImages(true);
  pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.NO_OCR);
  // By default, only the object IDs are used for determining uniqueness.
  // In scanned documents under test from the Panama registry, different embedded images had the same ID, leading to incomplete OCRing when uniqueness detection was turned on.
  pdfConfig.setExtractUniqueInlineImagesOnly(false);
  // Set a long OCR timeout by default, because Tika's is too short.
  setOcrTimeout(Duration.ofDays(1));
  ocrConfig.setEnableImageProcessing(0); // See TIKA-2167. Image processing causes OCR to fail.
  // English text recognition by default.
  ocrConfig.setLanguage("eng");
}

    getBooleanProp(props.getProperty("extractInlineImages"),
        getExtractInlineImages()));
setExtractUniqueInlineImagesOnly(
    getBooleanProp(props.getProperty("extractUniqueInlineImagesOnly"),
        getExtractUniqueInlineImagesOnly()));

 Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
TesseractOCRConfig config = new TesseractOCRConfig();
config.setTesseractPath(tPath);
PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);
pdfConfig.setExtractUniqueInlineImagesOnly(false); // set to false if pdf contains multiple images.
ParseContext parseContext = new ParseContext();
parseContext.set(TesseractOCRConfig.class, config);
parseContext.set(PDFParserConfig.class, pdfConfig);
//need to add this to make sure recursive parsing happens!
parseContext.set(Parser.class, parser);

Javadoc

Multiple pages within a PDF file might refer to the same underlying image. If #extractUniqueInlineImagesOnly is set to false, the parser will call the EmbeddedExtractor each time the image appears on a page. This might be desired for some use cases. However, to avoid duplication of extracted images, set this to true. The default is true.

Note that uniqueness is determined only by the underlying PDF COSObject id, not by file hash or similar equality metric. If the PDF actually contains multiple copies of the same image -- all with different object ids -- then all images will be extracted.

For this parameter to have any effect, #extractInlineImages must be set to true.

Because of TIKA-1742 -- to avoid infinite recursion -- no matter the setting of this parameter, the extractor will only pull out one copy of each image per page. This parameter tries to capture uniqueness across the entire document.

Popular methods of PDFParserConfig

setExtractInlineImages
If true, extract inline embedded OBXImages.Beware: some PDF documents of modest size (~4MB) can cont
<init>
Loads properties from InputStream and then tries to close InputStream. If there is an IOException, t
setOcrStrategy
Which strategy to use for OCR
setSuppressDuplicateOverlappingText
If true, the parser should try to remove duplicated text over the same region. This is needed for so
configure
Configures the given pdf2XHTML.
setEnableAutoSpace
If true (the default), the parser should estimate where spaces should be inserted between words. For
setExtractAcroFormContent
If true (the default), extract content from AcroForms at the end of the document. If an XFA is found
setExtractAnnotationText
If true (the default), text in annotations will be extracted.
setSortByPosition
If true, sort text tokens by their x/y position before extracting text. This may be necessary for so
getAccessChecker
getAverageCharTolerance
getBooleanProp

Popular in Java

Parsing JSON documents to java classes using gson
getSystemService (Context)
getResourceAsStream (ClassLoader)
onRequestPermissionsResult (Fragment)
BufferedInputStream (java.io)
A BufferedInputStream adds functionality to another input stream-namely, the ability to buffer the i
FileNotFoundException (java.io)
Thrown when a file specified by a program cannot be found.
System (java.lang)
Provides access to system-related information and resources including standard input and output. Ena
ConnectException (java.net)
A ConnectException is thrown if a connection cannot be established to a remote host on a specific po
Format (java.text)
The base class for all formats. This is an abstract base class which specifies the protocol for clas
Component (java.awt)
A component is an object having a graphical representation that can be displayed on the screen and t
Top PhpStorm plugins

How to use setExtractUniqueInlineImagesOnlymethodin org.apache.tika.parser.pdf.PDFParserConfig

Best Java code snippets using org.apache.tika.parser.pdf.PDFParserConfig.setExtractUniqueInlineImagesOnly (Showing top 9 results out of 315)

How to use
setExtractUniqueInlineImagesOnly
method
in
org.apache.tika.parser.pdf.PDFParserConfig