Tabnine Logo
Tika.parseToString
Code IndexAdd Tabnine to your IDE (free)

How to use
parseToString
method
in
org.apache.tika.Tika

Best Java code snippets using org.apache.tika.Tika.parseToString (Showing top 20 results out of 315)

origin: apache/tika

  public static void main(String[] args) throws Exception {
    // Create a Tika instance with the default configuration
    Tika tika = new Tika();

    // Parse all given files and print out the extracted
    // text content
    for (String file : args) {
      String text = tika.parseToString(new File(file));
      System.out.print(text);
    }
  }
}
origin: apache/tika

public static String parseToStringExample() throws Exception {
  File document = new File("example.doc");
  String content = new Tika().parseToString(document);
  System.out.print(content);
  return content;
}
origin: apache/tika

/**
 * Example of how to use Tika's parseToString method to parse the content of a file,
 * and return any text found.
 * <p>
 * Note: Tika.parseToString() will extract content from the outer container
 * document and any embedded/attached documents.
 *
 * @return The content of a file.
 */
public String parseToStringExample() throws IOException, SAXException, TikaException {
  Tika tika = new Tika();
  try (InputStream stream = ParsingExample.class.getResourceAsStream("test.doc")) {
    return tika.parseToString(stream);
  }
}
origin: apache/tika

  public void indexDocument(File file) throws Exception {
    Document document = new Document();
    document.add(new TextField("filename", file.getName(), Store.YES));
    document.add(new TextField("fulltext", tika.parseToString(file), Store.NO));
    writer.addDocument(document);
  }
}
origin: apache/tika

/**
 * Parses the given document and returns the extracted text content.
 * The given input stream is closed by this method.
 * <p>
 * To avoid unpredictable excess memory use, the returned string contains
 * only up to {@link #getMaxStringLength()} first characters extracted
 * from the input document. Use the {@link #setMaxStringLength(int)}
 * method to adjust this limitation.
 * <p>
 * <strong>NOTE:</strong> Unlike most other Tika methods that take an
 * {@link InputStream}, this method will close the given stream for
 * you as a convenience. With other methods you are still responsible
 * for closing the stream or a wrapper instance returned by Tika.
 *
 * @param stream the document to be parsed
 * @return extracted text content
 * @throws IOException if the document can not be read
 * @throws TikaException if the document can not be parsed
 */
public String parseToString(InputStream stream)
    throws IOException, TikaException {
  return parseToString(stream, new Metadata());
}
origin: apache/tika

/**
 * Parses the file at the given path and returns the extracted text content.
 * <p>
 * To avoid unpredictable excess memory use, the returned string contains
 * only up to {@link #getMaxStringLength()} first characters extracted
 * from the input document. Use the {@link #setMaxStringLength(int)}
 * method to adjust this limitation.
 *
 * @param path the path of the file to be parsed
 * @return extracted text content
 * @throws IOException if the file can not be read
 * @throws TikaException if the file can not be parsed
 */
public String parseToString(Path path) throws IOException, TikaException {
  Metadata metadata = new Metadata();
  InputStream stream = TikaInputStream.get(path, metadata);
  return parseToString(stream, metadata);
}
origin: apache/tika

/**
 * Parses the resource at the given URL and returns the extracted
 * text content.
 * <p>
 * To avoid unpredictable excess memory use, the returned string contains
 * only up to {@link #getMaxStringLength()} first characters extracted
 * from the input document. Use the {@link #setMaxStringLength(int)}
 * method to adjust this limitation.
 *
 * @param url the URL of the resource to be parsed
 * @return extracted text content
 * @throws IOException if the resource can not be read
 * @throws TikaException if the resource can not be parsed
 */
public String parseToString(URL url) throws IOException, TikaException {
  Metadata metadata = new Metadata();
  InputStream stream = TikaInputStream.get(url, metadata);
  return parseToString(stream, metadata);
}
origin: rnewson/couchdb-lucene

public void parse(final InputStream in, final String contentType, final String fieldName, final Document doc)
    throws IOException {
  final Metadata md = new Metadata();
  md.set(HttpHeaders.CONTENT_TYPE, contentType);
  try {
    // Add body text.
    doc.add(text(fieldName, tika.parseToString(in, md), false));
  } catch (final IOException e) {
    log.warn("Failed to index an attachment.", e);
    return;
  } catch (final TikaException e) {
    log.warn("Failed to parse an attachment.", e);
    return;
  }
  // Add DC attributes.
  addDublinCoreAttributes(md, doc);
}
origin: apache/tika

/**
 * Parses the given file and returns the extracted text content.
 * <p>
 * To avoid unpredictable excess memory use, the returned string contains
 * only up to {@link #getMaxStringLength()} first characters extracted
 * from the input document. Use the {@link #setMaxStringLength(int)}
 * method to adjust this limitation.
 *
 * @param file the file to be parsed
 * @return extracted text content
 * @throws IOException if the file can not be read
 * @throws TikaException if the file can not be parsed
 * @see #parseToString(Path)
 */
public String parseToString(File file) throws IOException, TikaException {
  Metadata metadata = new Metadata();
  @SuppressWarnings("deprecation")
  InputStream stream = TikaInputStream.get(file, metadata);
  return parseToString(stream, metadata);
}
origin: apache/tika

public TrecDocument summarize(File file) throws FileNotFoundException,
    IOException, TikaException {
  Tika tika = new Tika();
  Metadata met = new Metadata();
  String contents = tika.parseToString(new FileInputStream(file), met);
  return new TrecDocument(met.get(TikaCoreProperties.RESOURCE_NAME_KEY), contents,
      met.getDate(TikaCoreProperties.CREATED));
}
origin: stackoverflow.com

 private void compareXlsx(File expected, File result) throws IOException, TikaException {
   Tika tika = new Tika();
   String expectedText = tika.parseToString(expected);
   String resultText = tika.parseToString(result);
   assertEquals(expectedText, resultText);
}


<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parsers</artifactId>
  <version>1.13</version>
  <scope>test</scope>
</dependency>
origin: org.onehippo.cms7/hippo-cms-api

private String doParse(final InputStream inputStream) {
  try {
    // tika parseToString already closes the inputStream
    return tika.parseToString(inputStream);
  } catch (TikaException e) {
    throw new IllegalStateException("Unexpected TikaException processing failure", e);
  } catch (IOException e) {
    throw new IllegalStateException("Unexpected IOException processing failure", e);
  }
}
origin: stackoverflow.com

 public String parseToStringExample() throws IOException, SAXException, TikaException 
 {

   Tika tika = new Tika();
   try (InputStream stream = ParsingExample.class.getResourceAsStream("test.pdf")) {
      return tika.parseToString(stream); // This should return you the pdf's text
   }
}
origin: stackoverflow.com

 File inputFile = ...;
Tika tika = new Tika();
String extractedText = tika.parseToString(inputFile);
origin: org.apache.tika/tika-core

/**
 * Parses the resource at the given URL and returns the extracted
 * text content.
 * <p>
 * To avoid unpredictable excess memory use, the returned string contains
 * only up to {@link #getMaxStringLength()} first characters extracted
 * from the input document. Use the {@link #setMaxStringLength(int)}
 * method to adjust this limitation.
 *
 * @param url the URL of the resource to be parsed
 * @return extracted text content
 * @throws IOException if the resource can not be read
 * @throws TikaException if the resource can not be parsed
 */
public String parseToString(URL url) throws IOException, TikaException {
  Metadata metadata = new Metadata();
  InputStream stream = TikaInputStream.get(url, metadata);
  return parseToString(stream, metadata);
}
origin: stackoverflow.com

 Tika tika = new Tika();
Metadata metadata = new Metadata(); 
metadata.set(Metadata.RESOURCE_NAME_KEY, "myfile.name");
String text = tika.parseToString(new File("myfile.name"));
origin: com.github.lafa.tikaNoExternal/tika-core

/**
 * Parses the file at the given path and returns the extracted text content.
 * <p>
 * To avoid unpredictable excess memory use, the returned string contains
 * only up to {@link #getMaxStringLength()} first characters extracted
 * from the input document. Use the {@link #setMaxStringLength(int)}
 * method to adjust this limitation.
 *
 * @param path the path of the file to be parsed
 * @return extracted text content
 * @throws IOException if the file can not be read
 * @throws TikaException if the file can not be parsed
 */
public String parseToString(Path path) throws IOException, TikaException {
  Metadata metadata = new Metadata();
  InputStream stream = TikaInputStream.get(path, metadata);
  return parseToString(stream, metadata);
}
origin: org.apache.tika/tika-core

/**
 * Parses the file at the given path and returns the extracted text content.
 * <p>
 * To avoid unpredictable excess memory use, the returned string contains
 * only up to {@link #getMaxStringLength()} first characters extracted
 * from the input document. Use the {@link #setMaxStringLength(int)}
 * method to adjust this limitation.
 *
 * @param path the path of the file to be parsed
 * @return extracted text content
 * @throws IOException if the file can not be read
 * @throws TikaException if the file can not be parsed
 */
public String parseToString(Path path) throws IOException, TikaException {
  Metadata metadata = new Metadata();
  InputStream stream = TikaInputStream.get(path, metadata);
  return parseToString(stream, metadata);
}
origin: com.github.lafa.tikaNoExternal/tika-core

/**
 * Parses the given file and returns the extracted text content.
 * <p>
 * To avoid unpredictable excess memory use, the returned string contains
 * only up to {@link #getMaxStringLength()} first characters extracted
 * from the input document. Use the {@link #setMaxStringLength(int)}
 * method to adjust this limitation.
 *
 * @param file the file to be parsed
 * @return extracted text content
 * @throws IOException if the file can not be read
 * @throws TikaException if the file can not be parsed
 * @see #parseToString(Path)
 */
public String parseToString(File file) throws IOException, TikaException {
  Metadata metadata = new Metadata();
  @SuppressWarnings("deprecation")
  InputStream stream = TikaInputStream.get(file, metadata);
  return parseToString(stream, metadata);
}
origin: org.xwiki.platform/xwiki-platform-search-lucene-api

  private String getContentAsText(XWikiDocument doc, XWikiContext context)
  {
    String contentText = null;

    try {
      XWikiAttachment att = doc.getAttachment(this.filename);

      LOGGER.debug("Start parsing attachement [{}] in document [{}]", this.filename, doc.getDocumentReference());

      Tika tika = new Tika();

      Metadata metadata = new Metadata();
      metadata.set(Metadata.RESOURCE_NAME_KEY, this.filename);

      contentText = StringUtils.lowerCase(tika.parseToString(att.getContentInputStream(context), metadata));
    } catch (Throwable ex) {
      LOGGER.warn("error getting content of attachment [{}] for document [{}]",
        new Object[] {this.filename, doc.getDocumentReference(), ex});
    }

    return contentText;
  }
}
org.apache.tikaTikaparseToString

Javadoc

Parses the given file and returns the extracted text content.

To avoid unpredictable excess memory use, the returned string contains only up to #getMaxStringLength() first characters extracted from the input document. Use the #setMaxStringLength(int)method to adjust this limitation.

Popular methods of Tika

  • <init>
    Creates a Tika facade using the given detector, parser, and translator instances.
  • detect
    Detects the media type of the given document. The type detection is based on the first few bytes of
  • parse
    Parses the file at the given path and returns the extracted text content. Metadata information extr
  • toString
  • getParser
    Returns the parser instance used by this facade.
  • setMaxStringLength
    Sets the maximum length of strings returned by the parseToString methods.

Popular in Java

  • Making http post requests using okhttp
  • getResourceAsStream (ClassLoader)
  • getApplicationContext (Context)
  • compareTo (BigDecimal)
  • URLConnection (java.net)
    A connection to a URL for reading or writing. For HTTP connections, see HttpURLConnection for docume
  • List (java.util)
    An ordered collection (also known as a sequence). The user of this interface has precise control ove
  • Scanner (java.util)
    A parser that parses a text string of primitive types and strings with the help of regular expressio
  • TreeSet (java.util)
    TreeSet is an implementation of SortedSet. All optional operations (adding and removing) are support
  • Semaphore (java.util.concurrent)
    A counting semaphore. Conceptually, a semaphore maintains a set of permits. Each #acquire blocks if
  • Option (scala)
  • Top Sublime Text plugins
Tabnine Logo
  • Products

    Search for Java codeSearch for JavaScript code
  • IDE Plugins

    IntelliJ IDEAWebStormVisual StudioAndroid StudioEclipseVisual Studio CodePyCharmSublime TextPhpStormVimGoLandRubyMineEmacsJupyter NotebookJupyter LabRiderDataGripAppCode
  • Company

    About UsContact UsCareers
  • Resources

    FAQBlogTabnine AcademyTerms of usePrivacy policyJava Code IndexJavascript Code Index
Get Tabnine for your IDE now