congrats Icon
New! Announcing Tabnine Chat Beta
Learn More
Tabnine Logo
DefaultExtractor
Code IndexAdd Tabnine to your IDE (free)

How to use
DefaultExtractor
in
de.l3s.boilerpipe.extractors

Best Java code snippets using de.l3s.boilerpipe.extractors.DefaultExtractor (Showing top 4 results out of 315)

origin: sujitpal/hia-examples

protected String parse(String rawText) {
 if (StringUtils.isEmpty(rawText)) return null;
 else {
  try {
   return DefaultExtractor.INSTANCE.getText(rawText);
  } catch (BoilerpipeProcessingException e) {
   LOGGER.error(e.getMessage(), e);
   return null;
  }
 }
}
origin: ViDA-NYU/ache

public TargetModelElasticSearch(TargetModelCbor model) {
  URL url = Urls.toJavaURL(model.url);
  String rawContent = (String) model.response.get("body");
  Page page = new Page(url, rawContent);
  page.setParsedData(new ParsedData(new PaginaURL(url, rawContent)));
  this.html = rawContent;
  this.url = model.url;
  this.retrieved = new Date(model.timestamp * 1000);
  this.words = page.getParsedData().getWords();
  this.wordsMeta = page.getParsedData().getWordsMeta();
  this.title = page.getParsedData().getTitle();
  this.domain = url.getHost();
  try {
    this.text = DefaultExtractor.getInstance().getText(page.getContentAsString());
  } catch (Exception e) {
    this.text = "";
  }
  InternetDomainName domainName = InternetDomainName.from(page.getDomainName());
  if (domainName.isUnderPublicSuffix()) {
    this.topPrivateDomain = domainName.topPrivateDomain().toString();
  } else {
    this.topPrivateDomain = domainName.toString();
  }
}
origin: org.apache.any23.plugins/apache-any23-html-scraper

private void loadDefaultRules() {
  addTextExtractor("default-extractor"      , PAGE_CONTENT_DE_PROPERTY , DefaultExtractor.getInstance());
  addTextExtractor("article-extractor"      , PAGE_CONTENT_AE_PROPERTY , ArticleExtractor.getInstance());
  addTextExtractor("large-content-extractor", PAGE_CONTENT_LCE_PROPERTY, LargestContentExtractor.getInstance());
  addTextExtractor("canola-extractor"       , PAGE_CONTENT_CE_PROPERTY , CanolaExtractor.getInstance());
}
origin: ViDA-NYU/ache

public TargetModelElasticSearch(Page page) {
  this.url = page.getURL().toString();
  this.retrieved = page.getFetchTime() > 0 ? new Date(page.getFetchTime()) : new Date();
  this.domain = page.getDomainName();
  this.responseHeaders = page.getResponseHeaders();
  this.topPrivateDomain = LinkRelevance.getTopLevelDomain(page.getDomainName());
  this.crawlerId = page.getCrawlerId();
  this.isRelevant = page.getTargetRelevance().isRelevant() ? "relevant" : "irrelevant";
  if (page.isHtml()) {
    String contentAsString = page.getContentAsString();
    this.html = contentAsString;
    ParsedData parsedData = page.getParsedData();
    if (parsedData != null) {
      this.words = parsedData.getWords();
      this.wordsMeta = parsedData.getWordsMeta();
      this.title = parsedData.getTitle();
    }
    if (page.getTargetRelevance() != null) {
      this.relevance = page.getTargetRelevance().getRelevance();
    }
    if (contentAsString != null) {
      try {
        this.text = DefaultExtractor.getInstance().getText(contentAsString);
      } catch (Exception e) {
        this.text = "";
      }
    }
  }
}
de.l3s.boilerpipe.extractorsDefaultExtractor

Javadoc

A quite generic full-text extractor.

Most used methods

  • getInstance
    Returns the singleton instance for DefaultExtractor.
  • getText

Popular in Java

  • Updating database using SQL prepared statement
  • getContentResolver (Context)
  • scheduleAtFixedRate (Timer)
  • putExtra (Intent)
  • Proxy (java.net)
    This class represents proxy server settings. A created instance of Proxy stores a type and an addres
  • GregorianCalendar (java.util)
    GregorianCalendar is a concrete subclass of Calendarand provides the standard calendar used by most
  • Map (java.util)
    A Map is a data structure consisting of a set of keys and values in which each key is mapped to a si
  • BlockingQueue (java.util.concurrent)
    A java.util.Queue that additionally supports operations that wait for the queue to become non-empty
  • Semaphore (java.util.concurrent)
    A counting semaphore. Conceptually, a semaphore maintains a set of permits. Each #acquire blocks if
  • JCheckBox (javax.swing)
  • Top Sublime Text plugins
Tabnine Logo
  • Products

    Search for Java codeSearch for JavaScript code
  • IDE Plugins

    IntelliJ IDEAWebStormVisual StudioAndroid StudioEclipseVisual Studio CodePyCharmSublime TextPhpStormVimGoLandRubyMineEmacsJupyter NotebookJupyter LabRiderDataGripAppCode
  • Company

    About UsContact UsCareers
  • Resources

    FAQBlogTabnine AcademyTerms of usePrivacy policyJava Code IndexJavascript Code Index
Get Tabnine for your IDE now