Capture/Grab Website in Java

I have got a requirement in one of my Product to capture a website and Transform it to our custom defined one. As i’m Java Evangelist i tried to see if there are any Libraries over the web which help me do that I have choosen to use HTMLParser to parse Webpages and then use it to Capture Website in Java

If you go through HTMLParser you will actually go through following concepts

Extraction

  1. text extraction, for use as input for text search engine databases for example
  2. link extraction, for crawling through web pages or harvesting email addresses
  3. screen scraping, for programmatic data input from web pages
  4. resource extraction, collecting images or sound
  5. a browser front end, the preliminary stage of page display
  6. link checking, ensuring links are valid
  7. site monitoring, checking for page differences beyond simplistic diffs

Transformation

  1. URL rewriting, modifying some or all links on a page
  2. site capture, moving content from the web to local disk
  3. censorship, removing offending words and phrases from pages
  4. HTML cleanup, correcting erroneous pages
  5. ad removal, excising URLs referencing advertising
  6. conversion to XML, moving existing web pages to XML

During or after reading in a page, operations on the nodes can accomplish many transformation tasks “in place”, which can then be output with the toHtml() method. Depending on the purpose of your application, you will probably want to look into node decorators, visitors, or custom tags in conjunction with the PrototypicalNodeFactory.

Steps Required to Create Website Capturer Code

  1. First Get the Index Page
  2. Scan all Stylesheets
  3. Scan all Resource links in page
  4. Scan all JavaScript Links in Page
  5. For each stylesheet download any resource links in stylesheet definitions

I would present an example Using HTMLParser as utitlity to parse HTML

Here are high level steps

  • Initialize Parser/Factory and Filters
    mParser = new Parser();
    factory = new PrototypicalNodeFactory();
    factory.registerTag(new LocalLinkTag());
    factory.registerTag(new LocalFrameTag());
    factory.registerTag(new LocalBaseHrefTag());
    factory.registerTag(new LocalImageTag());
    factory.registerTag(new LocalScriptTag());
    factory.registerTag(new LocalStyleTag());
  • Get the URL of website to be captured and Directory where captured website have to be stored from user
  • For each page Do the following
    // Download Process page and relink all links to local links in HTML
    process(getFilter());
    // Download and store scripts
    while (0 != mScripts.size())
    copyScripts();
    // Download and store stylesheets
    while (0 != mStyleSheets.size())
    copyStyleSheets();
    // While processing stylesheets also download any image references in stylesheets. If yes add them to download list to get them downloaded
    // Download and store images
    while (0 != mImages.size())
    copyImages();

    You can downlod

    https://linkwithweb.googlecode.com/svn/trunk/Web2Mobile

    and run LinkwithwebSiteCapturer.java from eclipse. The project is mavenized so if you run following commands you are good to go

    mvn eclipse:clean eclipse:eclipse
    and then open up your eclipse and run
    LinkwithwebSiteCapturer

    Any Queries mail to ashwin@linkwithweb.com

    Advertisements

    Leave a Reply

    Fill in your details below or click an icon to log in:

    WordPress.com Logo

    You are commenting using your WordPress.com account. Log Out / Change )

    Twitter picture

    You are commenting using your Twitter account. Log Out / Change )

    Facebook photo

    You are commenting using your Facebook account. Log Out / Change )

    Google+ photo

    You are commenting using your Google+ account. Log Out / Change )

    Connecting to %s