Javadoc
Keeps the largest
TextBlock only (by the number of words). In case of
more than one block with the same number of words, the first block is chosen.
All discarded blocks are marked "not content" and flagged as
DefaultLabels#MIGHT_BE_CONTENT.
As opposed to
KeepLargestBlockFilter, the number of words are
computed using
HeuristicFilterBase#getNumFullTextWords(TextBlock), which only counts
words that occur in text elements with at least 9 words and are thus believed to be full text.
NOTE: Without language-specific fine-tuning (i.e., running the default instance), this filter
may lead to suboptimal results. You better use
KeepLargestBlockFilter instead, which
works at the level of number-of-words instead of text densities.