jsoup HTMLParser and Parsing Dzone Links using CSS Selectors in Java

I was working on a task to parse some of Amazon web-services. There are lots of ways to parse it Using DOM/SAX/Stax .  All of them require some amount of coding. I wanted a quick fix and i finally landed on to JSoup an opensource HTML Parser ( Other html parser i like is HTMLParser) . In this article i’m going to explain how i’m going to parse DZone HTML links in java.

I’ll be retreiving description’s of all links in Dzone using the code

Note: This is not the best way to read links from Dzone ( You can use rss feed’s instead).  This tutorial is to take you through css selectors for Java

All DZone pagination queries looks like this

http://www.dzone.com/links/?type=html&p=2

i used an opensource java library to parse this and extract link text description (jsoup)

Here is sample tags we have in dzone response

<a name="link-613399">
</a>

<div class="linkblock frontpage " id="link-613399">
	<div id="thumb_613399" class="thumb">
		<a onmouseup="track(this, 'twitter4j_oauth_on_android', ''); "
			href="http://www.xoriant.com/blog/mobile-application-development/twitter4j-oauth-on-android.html">
			<img width="120" height="90"
				src="http://cdn.dzone.com/links/images/thumbs/120x90/613399-1307624607000.jpg"
				class="thumbnail" alt="Link 613399 thumbnail"
				onmouseover="return OLgetAJAX('/links/themes/reader/jsps/nodecoration/thumb-load.jsp?linkId=613399', OLcmdExT1,
 300, 'bigThumbBody');"
				onmouseout="OLclearAJAX(); nd(100);" />
		</a>
	</div>
	<div id="hidden_thumb_613399">

	</div>
	<div class="tools">
	</div>
	<div class="details">
		<div class="vwidget" id="vwidget-613399">
			<a id="upcount-613399" href="#" class="upcount"
				onclick="showLoginDialog(613399, null); return false">7</a>

			<a id="downcount-613399" href="#"
				onclick="showLoginDialog(613399, null); return false;" class="downcount">0</a>
		</div>
		<h3>
			<a onmouseup="track(this, 'twitter4j_oauth_on_android', ''); "
				href="http://www.xoriant.com/blog/mobile-application-development/twitter4j-oauth-on-android.html"
				rel="bookmark"> Twitter4j OAuth on Android</a>
		</h3>
		<p class="voteblock">
			<a href="/links/users/profile/811805.html">
				<img width="24" height="24"
					src="http://cdn.dzone.com/links/images/std/avatars/default_24.gif"
					class="avatar" alt="User 811805 avatar" />
			</a>
		</p>
		<p class="fineprint byline">
			<a href="/links/users/profile/811805.html">RituR</a>
			via
			<a href="/links/search.html?query=domain%3Axoriant.com">xoriant.com</a>
		</p>
		<p class="fineprint byline">
			<b>Promoted: </b>
			Jun 08 / 17:27. Views:
			520, Clicks: 266
		</p>
		<p class="description">
			OAuth is an open protocol
			which allows the users to share their private information and assets
			like photos, videos etc. with another site...&nbsp;
			<a href='/links/twitter4j_oauth_on_android.html'>more&nbsp;&raquo;
			</a>
		</p>
		<p class="fineprint stats">
			<a
				href="http://twitter.com/home?status=RT+%40DZone+%22Twitter4j+OAuth+on+Android%22+http%3A%2F%2Fdzone.com%2FTBxR"
				class="twitter">Tweet</a>
			<a href="/links/twitter4j_oauth_on_android.html" class="comment">0
				Comments</a>
			<span class="linkUnsaved" id="save-link-613399"
				onclick="showLoginDialog(613399); return false;">Save</span>
			<span class="linkUnshared" id="share-link-613399"
				onclick="showLoginDialog(613399); return false;">Share</span>
			Tags:
			<a href="/links/tag/mobile.html" class="tags" rel="tag">mobile</a>
			,
			<a href="/links/tag/standards.html" class="tags" rel="tag">standards</a>
		</p>

	</div>
</div>

 

To get description we have to get data from element “P” with class “description” which is actually present in DIV with class “details” Here is how we can do that in java

 

/**
 * 
 */
package com.linkwithweb.parser;

import java.io.File;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

/****************************************************************
 * Description
 * jsoup elements support a CSS (or jquery) like selector syntax to find matching elements, that allows very powerful and robust queries.
 * 
 * The select method is available in a Document, Element, or in Elements. It is contextual, so you can filter by selecting from a specific element, or
 * by chaining select calls.
 * 
 * Select returns a list of Elements (as Elements), which provides a range of methods to extract and manipulate the results.
 * 
 * Selector overview
 * tagname: find elements by tag, e.g. a
 * ns|tag: find elements by tag in a namespace, e.g. fb|name finds <fb:name> elements
 * #id: find elements by ID, e.g. #logo
 * .class: find elements by class name, e.g. .masthead
 * [attribute]: elements with attribute, e.g. [href]
 * [^attr]: elements with an attribute name prefix, e.g. [^data-] finds elements with HTML5 dataset attributes
 * [attr=value]: elements with attribute value, e.g. [width=500]
 * [attr^=value], [attr$=value], [attr*=value]: elements with attributes that start with, end with, or contain the value, e.g. [href*=/path/]
 * [attr~=regex]: elements with attribute values that match the regular expression; e.g. img[src~=(?i)\.(png|jpe?g)]
 * : all elements, e.g. *
 * Selector combinations
 * el#id: elements with ID, e.g. div#logo
 * el.class: elements with class, e.g. div.masthead
 * el[attr]: elements with attribute, e.g. a[href]
 * Any combination, e.g. a[href].highlight
 * ancestor child: child elements that descend from ancestor, e.g. .body p finds p elements anywhere under a block with class "body"
 * parent > child: child elements that descend directly from parent, e.g. div.content > p finds p elements; and body > * finds the direct children of
 * the body tag
 * siblingA + siblingB: finds sibling B element immediately preceded by sibling A, e.g. div.head + div
 * siblingA ~ siblingX: finds sibling X element preceded by sibling A, e.g. h1 ~ p
 * el, el, el: group multiple selectors, find unique elements that match any of the selectors; e.g. div.masthead, div.logo
 * Pseudo selectors
 * :lt(n): find elements whose sibling index (i.e. its position in the DOM tree relative to its parent) is less than n; e.g. td:lt(3)
 * :gt(n): find elements whose sibling index is greater than n; e.g. div p:gt(2)
 * :eq(n): find elements whose sibling index is equal to n; e.g. form input:eq(1)
 * :has(seletor): find elements that contain elements matching the selector; e.g. div:has(p)
 * :not(selector): find elements that do not match the selector; e.g. div:not(.logo)
 * :contains(text): find elements that contain the given text. The search is case-insensitive; e.g. p:contains(jsoup)
 * :containsOwn(text): find elements that directly contain the given text
 * :matches(regex): find elements whose text matches the specified regular expression; e.g. div:matches((?i)login)
 * :matchesOwn(regex): find elements whose own text matches the specified regular expression
 * Note that the above indexed pseudo-selectors are 0-based, that is, the first element is at index 0, the second at 1, etc
 * See the Selector API reference for the full supported list and details.
 * 
 * @author Ashwin Kumar
 * 
 */
public class HTMLParser {

	/**
	 * @param args
	 */
	public static void main(String[] args) {
		try {
			File input = new File("input/dZoneLinks.xml");
			Document doc = Jsoup.parse(input, "UTF-8",
					"http://www.dzone.com/links/?type=html&p=2");

			Elements descriptions = doc.select("div.details > p.description"); // get all description elements in this HTML file
			/*
			 * Elements pngs = doc.select("img[src$=.png]");
			 * // img with src ending .png
			 * 
			 * Element masthead = doc.select("div.masthead").first();
			 */
			// div with

			// Elements resultLinks = doc.select("h3.r > a"); // direct a after h3
			/**
			 * Iterate over all descriptions and display them
			 */
			for (Element element : descriptions) {
				System.out.println(element.ownText());
				System.out.println("--------------");
			}

		} catch (Exception e) {
			e.printStackTrace();
		}
	}

}

Mavenized code has been checked in to svn at following location

http://code.google.com/p/linkwithweb/source/browse/trunk/Utilities/HTMLParser

Njoy parsing anything easily using jsoup

 

Advertisements

PDF Generation Using Templates and OpenOffice and Itext in Java


Recently i had a task where in i had to create PDF dynamically by merging data and template. I have been using lots of pdf libraries since i started my programming to do such tasks. The previous one i admired a lot was Jasper Reports. But after researching some more time i found out that IText with Openoffice Draw is a simplest way we can generate PDF forms.

This visual tutorial first Explain’s how to Create PDF forms in Openoffice Draw

Save PDF to folder where you have your program or just change path for input source in program

Now here is java code which can convert this pdf to final pdf by filling in Values using IText

/**
 *
 */
package com.linkwithweb.reports;

import java.io.ByteArrayOutputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;

import com.lowagie.text.DocumentException;
import com.lowagie.text.pdf.PdfReader;
import com.lowagie.text.pdf.PdfStamper;

/**
 * @author Ashwin Kumar
 *
 */
public class ITextStamperSample {

	/**
	 * @param args
	 * @throws IOException
	 * @throws DocumentException
	 */
	public static void main(String[] args) throws IOException,
			DocumentException {
		// createSamplePDF();
		generateAshwinFriends();
	}

	/**
	 *
	 */
	private static void generateAshwinFriends() throws IOException,
			FileNotFoundException, DocumentException {
		PdfReader pdfTemplate = new PdfReader("mytemplate.pdf");
		FileOutputStream fileOutputStream = new FileOutputStream("test.pdf");
		ByteArrayOutputStream out = new ByteArrayOutputStream();
		PdfStamper stamper = new PdfStamper(pdfTemplate, fileOutputStream);
		stamper.setFormFlattening(true);

		stamper.getAcroFields().setField("name", "Ashwin Kumar");
		stamper.getAcroFields().setField("id", "1\n2\n3\n");
		stamper.getAcroFields().setField("friendname",
				"kumar\nsirisha\nsuresh\n");
		stamper.getAcroFields().setField("relation", "self\nwife\nfriend\n");

		stamper.close();
		pdfTemplate.close();

	}

}

Here is sample pom.xml

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.linkwithweb.reports</groupId>
  <artifactId>ReportTemplateTutorial</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  <name>ReportTemplateTutorial</name>
  <description>ReportTemplateTutorial</description>
  <dependencies>
  	<dependency>
  		<groupId>com.lowagie</groupId>
  		<artifactId>itext</artifactId>
  		<version>2.1.7</version>
  	</dependency>
  </dependencies>
</project>

 

A pdf file with data filled in placeholders is created with name test.pdf


OSGI for Beginners Using Maven with Equinox(HowTo)

I have struggled to understand what OSGI really means for a long time. It has been around since a very long time but not many people are aware of it. It has been hyped as a very complex technology to understand. Here is my attempt to make it simple for any Java Developer. In my view if you understand concept of Interface , JNDI or EJB( Or registering some service in some form)  you will understand OSGI. In short OSGI is a Container where you can register your services through interfaces and those can be accessed any time. Another benefit of OSGI is that all these services can be Installed/Uninstalled/Started/Stopped at runtime (i.e Code can be hot deployed at runtime ) rather than normal requirement we have where we have restart J2EE server. Similar to J2ee containers (Tomcat, WebSphere, Jboss , Weblogic ) OSGI also has container like Equinox( Which is base for Eclipse), Apache Felix …etc

In this article i’m going to explain OSGI with Eclipse Equinox container. Anyone who has Eclipse IDE installed on their machine has OSGI container also installed in Eclipse plugin’s folder.

Name of OSGI container jar file looks like   org.eclipse.osgi_<version>.jar

You can start OSGI like this

java -jar org.eclipse.osgi_3.5.2.R35x_v20100126.jar -console

Attached is sample screenshot of how i started my OSGI container ( Its analogous to starting tomcat)

Now that we started OSGI , let me start creation a HelloWorld OSGI Application using Maven. Lemme show you my project structure first and then display my pom.xml


Now i’ll display my pom.xml. My pom.xml has 2 more profiles added to create 2 more new modules(MathService and MathServiceClient) which will be later explained in this article

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
	<modelVersion>4.0.0</modelVersion>
	<groupId>com.linkwithweb.osgi</groupId>
	<artifactId>HelloWorld</artifactId>
	<version>0.0.1-SNAPSHOT</version>
	<name>HelloWorld</name>
	<dependencies>
		<dependency>
			<groupId>org.osgi</groupId>
			<artifactId>org.osgi.core</artifactId>
			<version>4.2.0</version>
		</dependency>
	</dependencies>

	<build>
		<finalName>HelloWorld-${version}</finalName>
		<plugins>
			<plugin>
				<groupId>org.apache.maven.plugins</groupId>
				<artifactId>maven-compiler-plugin</artifactId>
				<version>2.3.1</version>
				<configuration>
					<source>1.5</source>
					<target>1.5</target>
				</configuration>
			</plugin>

			<plugin>
				<groupId>org.apache.maven.plugins</groupId>
				<artifactId>maven-jar-plugin</artifactId>
				<configuration>
					<archive>
						<manifestFile>src/main/resources/META-INF/MANIFEST.MF</manifestFile>
					</archive>
				</configuration>
			</plugin>
		</plugins>
	</build>

	<profiles>
		<profile>
			<id>MathService</id>
			<build>
				<finalName>MathService-${version}</finalName>
				<plugins>
					<plugin>
						<groupId>org.apache.maven.plugins</groupId>
						<artifactId>maven-compiler-plugin</artifactId>
						<version>2.3.1</version>
						<configuration>
							<source>1.5</source>
							<target>1.5</target>
						</configuration>
					</plugin>

					<plugin>
						<groupId>org.apache.maven.plugins</groupId>
						<artifactId>maven-jar-plugin</artifactId>
						<configuration>

							<excludes>
								<exclude>**/*.xml</exclude>
								<exclude>**/*.bsh</exclude>
								<exclude>**/*.properties</exclude>
							</excludes>
							<archive>
								<manifestFile>src/main/resources/MathService/META-INF/MANIFEST.MF</manifestFile>
							</archive>
						</configuration>
					</plugin>

				</plugins>
			</build>
		</profile>
		<profile>
			<id>MathServiceClient</id>
			<build>
				<finalName>MathServiceClient-${version}</finalName>
				<plugins>
					<plugin>
						<groupId>org.apache.maven.plugins</groupId>
						<artifactId>maven-compiler-plugin</artifactId>
						<version>2.3.1</version>
						<configuration>
							<source>1.5</source>
							<target>1.5</target>
						</configuration>
					</plugin>

					<plugin>
						<groupId>org.apache.maven.plugins</groupId>
						<artifactId>maven-jar-plugin</artifactId>
						<configuration>

							<excludes>
								<exclude>**/*.xml</exclude>
								<exclude>**/*.bsh</exclude>
								<exclude>**/*.properties</exclude>
							</excludes>
							<archive>
								<manifestFile>src/main/resources/MathServiceClient/META-INF/MANIFEST.MF</manifestFile>
							</archive>
						</configuration>
					</plugin>

				</plugins>
			</build>
		</profile>
	</profiles>

</project>

Now if you observe pom.xml there are 3 different MANIFEST.MF defined for 3 different OSGI bundle’s we create.  Saying so lemme explain you that OSGI bundles are same as Java jar file with its configuration defined in Manifest file of standard jar. OSGI defined few entries in manifest file which are related to OSGi which are read by container to activate the bundle, thus avoiding learning of any new metadata format we generally have for other frameworks

Here is sample Manifest.MF i have defined for MathServiceClient

Manifest-Version: 1.0
Bundle-Name: MathServiceClient
Bundle-Activator: com.linkwithweb.osgi.service.client.MathServiceClientActivator
Bundle-SymbolicName: MathServiceClient
Bundle-Version: 1.0.0
Import-Package: org.osgi.framework,com.linkwithweb.osgi.service

If you observe above manifest file you can observer that all the entries except Manifest-Version are OSGI specific entries. These are the entries which define how to deploy a bundles what all it is dependent on and what are extension points it exposes for other services to consume.. etc

Having explained this lemme first explain a HelloWorld bundle with its MANIFEST.MF and Activator class and its installation into Equinox OSGI Container

/**
 * 
 */
package com.linkwithweb.osgi;

import org.osgi.framework.BundleActivator;
import org.osgi.framework.BundleContext;

/**
 * @author Ashwin Kumar
 * 
 */
public class HelloActivator implements BundleActivator {
	public void start(BundleContext context) {
		System.out.println("Hello World");
	}

	public void stop(BundleContext context) {
		System.out.println("Goodbye All");
	}
}
Manifest-Version: 1.0
Bundle-Name: HelloWorld
Bundle-Activator: com.linkwithweb.osgi.HelloActivator
Bundle-SymbolicName: HelloWorld
Bundle-Version: 1.0.0
Import-Package: org.osgi.framework

Now run “mvn clean package” to build our bundle

It will create HelloWorld-0.0.1-SNAPSHOT.jar in target folder and we can install that into Equinox to test. Here is the image which shows how to install and start our HelloWorld into Equinox

install file:K:\Ashwin\OSGI\MavenProject\target\HelloWorld-0.0.1-SNAPSHOT.jar

If you observe above screenshot you can use install command to install the bundle and use the start command on bundle id to start the bundle

On start of Bundle start method of Activator class is called and while stopping bundle stop method of Activator is called and so you are seeing HelloWorld

Congratulations you have learned basics of OSGI and you have just deployed your first bundle.

Now lemme explain my second part of article which explains how to Expose and Consume services of modules

Exposing and Consuming Services

To explain this i’ll take a very simple example where i’ll publish a service which can add 2 numbers

Here is the code which does that

First we need to define an Interface which we are thinking of exposing to external bundles

/**
 * 
 */
package com.linkwithweb.osgi.service;

/**
 * @author Ashwin Kumar
 * 
 */
public interface MathService {

	/**
	 * @param a
	 * @param b
	 * @return
	 */
	public int add(int a, int b);
}

Now the implementation Class

/**
 * 
 */
package com.linkwithweb.osgi.service;

/**
 * @author Ashwin Kumar
 *
 */
public class MathServiceImpl implements MathService {

	/* (non-Javadoc)
	 * @see com.linkwithweb.osgi.service.MathService#add(int, int)
	 */
	public int add(int a, int b) {
		// TODO Auto-generated method stub
		return a+b;
	}

}

Now Activator class which register’s this service with OSGI container

/**
 * 
 */
package com.linkwithweb.osgi.service;

import org.osgi.framework.BundleActivator;
import org.osgi.framework.BundleContext;

/**
 * @author Ashwin Kumar
 * 
 */
public class MathServiceActivator implements BundleActivator {
	/*
	 * (non-Javadoc)
	 * 
	 * @see org.osgi.framework.BundleActivator#start(org.osgi.framework.BundleContext)
	 */
	public void start(BundleContext context) {
		MathService service = new MathServiceImpl();
		// Third parameter is a hashmap which allows to configure the service
		// Not required in this example
		context.registerService(MathService.class.getName(), service, null);
		System.out.println("Math Service Registered");
	}

	/*
	 * (non-Javadoc)
	 * 
	 * @see org.osgi.framework.BundleActivator#stop(org.osgi.framework.BundleContext)
	 */
	public void stop(BundleContext context) {
		System.out.println("Goodbye From math service");
	}
}

Here is the manifest file which exposes service

Manifest-Version: 1.0
Bundle-Name: MathService
Bundle-Activator: com.linkwithweb.osgi.service.MathServiceActivator
Bundle-SymbolicName: MathService
Bundle-Version: 1.0.0
Import-Package: org.osgi.framework
Export-Package: com.linkwithweb.osgi.service

If you observe this we are Exporting some packages so that they can be consumed later. Also all the package that we are thinking of importign have to be defined here

Use the following command to build the jar file

mvn -PMathService package

and Here is image which displays how to Install and Run the service

Now lemme explain how to implement consumer

/**
 * 
 */
package com.linkwithweb.osgi.service.client;

import org.osgi.framework.BundleActivator;
import org.osgi.framework.BundleContext;
import org.osgi.framework.ServiceReference;

import com.linkwithweb.osgi.service.MathService;

/**
 * @author Ashwin Kumar
 * 
 */
public class MathServiceClientActivator implements BundleActivator {
	MathService service;
	private BundleContext context;

	/*
	 * (non-Javadoc)
	 * 
	 * @see org.osgi.framework.BundleActivator#start(org.osgi.framework.BundleContext)
	 */
	public void start(BundleContext context) {
		this.context = context;
		// Register directly with the service
		ServiceReference reference = context
				.getServiceReference(MathService.class.getName());
		service = (MathService) context.getService(reference);
		System.out.println(service.add(1, 2));
	}	/*
	 * (non-Javadoc)
	 * 
	 * @see org.osgi.framework.BundleActivator#stop(org.osgi.framework.BundleContext)
	 */
	public void stop(BundleContext context) {
		System.out.println(service.add(5, 6));
	}
}

And now manifest file

Manifest-Version: 1.0
Bundle-Name: MathServiceClient
Bundle-Activator: com.linkwithweb.osgi.service.client.MathServiceClientActivator
Bundle-SymbolicName: MathServiceClient
Bundle-Version: 1.0.0
Import-Package: org.osgi.framework,com.linkwithweb.osgi.service

Now here is how we create package and install

mvn -PMathServiceClient package

Source code have been checkedin to following location

https://linkwithweb.googlecode.com/svn/trunk/osgi-tutorials/MavenProject

 

Njoy creating OSGI bundles . Main benefit is you can redeploy your bundles/services at runtime.

Mail me incase you have any doubts