Multiple pages within a PDF file might refer to the same underlying image.
If
#extractUniqueInlineImagesOnly is set to
false
, the
parser will call the EmbeddedExtractor each time the image appears on a page.
This might be desired for some use cases. However, to avoid duplication of
extracted images, set this to
true
. The default is
true
.
Note that uniqueness is determined only by the underlying PDF COSObject id, not by
file hash or similar equality metric.
If the PDF actually contains multiple copies of the same image
-- all with different object ids -- then all images will be extracted.
For this parameter to have any effect,
#extractInlineImages must be
set to true
.
Because of TIKA-1742 -- to avoid infinite recursion -- no matter the setting
of this parameter, the extractor will only pull out one copy of each image per
page. This parameter tries to capture uniqueness across the entire document.