org.apache.tika.detect.TextStatistics.countSafeControl java code examples

/**
 * Counts "safe" (i.e. seven-bit non-control) ASCII characters.
 *
 * @see #countControl()
 * @return count of safe ASCII characters
 */
public int countSafeAscii() {
  return count(0x20, 128) + countSafeControl();
}

/**
 * Counts control characters (i.e. < 0x20, excluding tab, CR, LF,
 * page feed and escape).
 * <p>
 * This definition of control characters is based on section 4 of the
 * "Content-Type Processing Model" Internet-draft
 * (<a href="http://webblaze.cs.berkeley.edu/2009/mime-sniff/mime-sniff.txt"
 * >draft-abarth-mime-sniff-01</a>).
 * <pre>
 * +-------------------------+
 * | Binary data byte ranges |
 * +-------------------------+
 * | 0x00 -- 0x08            |
 * | 0x0B                    |
 * | 0x0E -- 0x1A            |
 * | 0x1C -- 0x1F            |
 * +-------------------------+
 * </pre>
 *
 * @see <a href="https://issues.apache.org/jira/browse/TIKA-154">TIKA-154</a>
 * @return count of control characters
 */
public int countControl() {
  return count(0, 0x20) - countSafeControl();
}

/**
 * Checks whether at least one byte was seen and that the bytes that
 * were seen were mostly plain text (i.e. < 2% control, > 90% ASCII range).
 *
 * @see <a href="https://issues.apache.org/jira/browse/TIKA-483">TIKA-483</a>
 * @see <a href="https://issues.apache.org/jira/browse/TIKA-688">TIKA-688</a>
 * @return <code>true</code> if the seen bytes were mostly safe ASCII,
 *         <code>false</code> otherwise
 */
public boolean isMostlyAscii() {
  int control = count(0, 0x20);
  int ascii = count(0x20, 128);
  int safe = countSafeControl();
  return total > 0
      && (control - safe) * 100 < total * 2
      && (ascii + safe) * 100 > total * 90;
}

/**
 * Checks whether the observed byte stream looks like UTF-8 encoded text.
 *
 * @since Apache Tika 1.3
 * @return <code>true</code> if the seen bytes look like UTF-8,
 *         <code>false</code> otherwise
 */
public boolean looksLikeUTF8() {
  int control = count(0, 0x20);
  int utf8 = count(0x20, 0x80);
  int safe = countSafeControl();
  int expectedContinuation = 0;
  int[] leading = new int[] {
      count(0xc0, 0xe0), count(0xe0, 0xf0), count(0xf0, 0xf8) };
  for (int i = 0; i < leading.length; i++) {
    utf8 += leading[i];
    expectedContinuation += (i + 1) * leading[i];
  }
  int continuation = count(0x80, 0xc0);
  return utf8 > 0
      && continuation <= expectedContinuation
      && continuation >= expectedContinuation - 3
      && count(0xf80, 0x100) == 0
      && (control - safe) * 100 < utf8 * 2;
}

/**
 * Counts "safe" (i.e. seven-bit non-control) ASCII characters.
 *
 * @see #countControl()
 * @return count of safe ASCII characters
 */
public int countSafeAscii() {
  return count(0x20, 128) + countSafeControl();
}

/**
 * Counts "safe" (i.e. seven-bit non-control) ASCII characters.
 *
 * @see #countControl()
 * @return count of safe ASCII characters
 */
public int countSafeAscii() {
  return count(0x20, 128) + countSafeControl();
}

/**
 * Counts control characters (i.e. < 0x20, excluding tab, CR, LF,
 * page feed and escape).
 * <p>
 * This definition of control characters is based on section 4 of the
 * "Content-Type Processing Model" Internet-draft
 * (<a href="http://webblaze.cs.berkeley.edu/2009/mime-sniff/mime-sniff.txt"
 * >draft-abarth-mime-sniff-01</a>).
 * <pre>
 * +-------------------------+
 * | Binary data byte ranges |
 * +-------------------------+
 * | 0x00 -- 0x08            |
 * | 0x0B                    |
 * | 0x0E -- 0x1A            |
 * | 0x1C -- 0x1F            |
 * +-------------------------+
 * </pre>
 *
 * @see <a href="https://issues.apache.org/jira/browse/TIKA-154">TIKA-154</a>
 * @return count of control characters
 */
public int countControl() {
  return count(0, 0x20) - countSafeControl();
}

/**
 * Counts control characters (i.e. < 0x20, excluding tab, CR, LF,
 * page feed and escape).
 * <p>
 * This definition of control characters is based on section 4 of the
 * "Content-Type Processing Model" Internet-draft
 * (<a href="http://webblaze.cs.berkeley.edu/2009/mime-sniff/mime-sniff.txt"
 * >draft-abarth-mime-sniff-01</a>).
 * <pre>
 * +-------------------------+
 * | Binary data byte ranges |
 * +-------------------------+
 * | 0x00 -- 0x08            |
 * | 0x0B                    |
 * | 0x0E -- 0x1A            |
 * | 0x1C -- 0x1F            |
 * +-------------------------+
 * </pre>
 *
 * @see <a href="https://issues.apache.org/jira/browse/TIKA-154">TIKA-154</a>
 * @return count of control characters
 */
public int countControl() {
  return count(0, 0x20) - countSafeControl();
}

/**
 * Checks whether at least one byte was seen and that the bytes that
 * were seen were mostly plain text (i.e. < 2% control, > 90% ASCII range).
 *
 * @see <a href="https://issues.apache.org/jira/browse/TIKA-483">TIKA-483</a>
 * @see <a href="https://issues.apache.org/jira/browse/TIKA-688">TIKA-688</a>
 * @return <code>true</code> if the seen bytes were mostly safe ASCII,
 *         <code>false</code> otherwise
 */
public boolean isMostlyAscii() {
  int control = count(0, 0x20);
  int ascii = count(0x20, 128);
  int safe = countSafeControl();
  return total > 0
      && (control - safe) * 100 < total * 2
      && (ascii + safe) * 100 > total * 90;
}

/**
 * Checks whether at least one byte was seen and that the bytes that
 * were seen were mostly plain text (i.e. < 2% control, > 90% ASCII range).
 *
 * @see <a href="https://issues.apache.org/jira/browse/TIKA-483">TIKA-483</a>
 * @see <a href="https://issues.apache.org/jira/browse/TIKA-688">TIKA-688</a>
 * @return <code>true</code> if the seen bytes were mostly safe ASCII,
 *         <code>false</code> otherwise
 */
public boolean isMostlyAscii() {
  int control = count(0, 0x20);
  int ascii = count(0x20, 128);
  int safe = countSafeControl();
  return total > 0
      && (control - safe) * 100 < total * 2
      && (ascii + safe) * 100 > total * 90;
}

/**
 * Checks whether the observed byte stream looks like UTF-8 encoded text.
 *
 * @since Apache Tika 1.3
 * @return <code>true</code> if the seen bytes look like UTF-8,
 *         <code>false</code> otherwise
 */
public boolean looksLikeUTF8() {
  int control = count(0, 0x20);
  int utf8 = count(0x20, 0x80);
  int safe = countSafeControl();
  int expectedContinuation = 0;
  int[] leading = new int[] {
      count(0xc0, 0xe0), count(0xe0, 0xf0), count(0xf0, 0xf8) };
  for (int i = 0; i < leading.length; i++) {
    utf8 += leading[i];
    expectedContinuation += (i + 1) * leading[i];
  }
  int continuation = count(0x80, 0xc0);
  return utf8 > 0
      && continuation <= expectedContinuation
      && continuation >= expectedContinuation - 3
      && count(0xf80, 0x100) == 0
      && (control - safe) * 100 < utf8 * 2;
}

/**
 * Checks whether the observed byte stream looks like UTF-8 encoded text.
 *
 * @since Apache Tika 1.3
 * @return <code>true</code> if the seen bytes look like UTF-8,
 *         <code>false</code> otherwise
 */
public boolean looksLikeUTF8() {
  int control = count(0, 0x20);
  int utf8 = count(0x20, 0x80);
  int safe = countSafeControl();
  int expectedContinuation = 0;
  int[] leading = new int[] {
      count(0xc0, 0xe0), count(0xe0, 0xf0), count(0xf0, 0xf8) };
  for (int i = 0; i < leading.length; i++) {
    utf8 += leading[i];
    expectedContinuation += (i + 1) * leading[i];
  }
  int continuation = count(0x80, 0xc0);
  return utf8 > 0
      && continuation <= expectedContinuation
      && continuation >= expectedContinuation - 3
      && count(0xf80, 0x100) == 0
      && (control - safe) * 100 < utf8 * 2;
}

Popular methods of TextStatistics

addData
count
isMostlyAscii
Checks whether at least one byte was seen and that the bytes that were seen were mostly plain text (
<init>
looksLikeUTF8
Checks whether the observed byte stream looks like UTF-8 encoded text.
toString

Popular in Java

Finding current android device location
getResourceAsStream (ClassLoader)
setScale (BigDecimal)
runOnUiThread (Activity)
URLConnection (java.net)
A connection to a URL for reading or writing. For HTTP connections, see HttpURLConnection for docume
Arrays (java.util)
This class contains various methods for manipulating arrays (such as sorting and searching). This cl
GregorianCalendar (java.util)
GregorianCalendar is a concrete subclass of Calendarand provides the standard calendar used by most
ExecutorService (java.util.concurrent)
An Executor that provides methods to manage termination and methods that can produce a Future for tr
HttpServletRequest (javax.servlet.http)
Extends the javax.servlet.ServletRequest interface to provide request information for HTTP servlets.
Loader (org.hibernate.loader)
Abstract superclass of object loading (and querying) strategies. This class implements useful common
CodeWhisperer alternatives

How to use countSafeControlmethodin org.apache.tika.detect.TextStatistics

Best Java code snippets using org.apache.tika.detect.TextStatistics.countSafeControl (Showing top 12 results out of 315)

How to use
countSafeControl
method
in
org.apache.tika.detect.TextStatistics