jsoup HTMLParser and Parsing Dzone Links using CSS Selectors in Java

June 9, 2011 ashwinrayaprolu CodeProject, Java, Javascript, JQuery, Maven, Utilities, XMLcss selector, HTML parser, Java, jsoup, Maven

I was working on a task to parse some of Amazon web-services. There are lots of ways to parse it Using DOM/SAX/Stax . All of them require some amount of coding. I wanted a quick fix and i finally landed on to JSoup an opensource HTML Parser ( Other html parser i like is HTMLParser) . In this article i’m going to explain how i’m going to parse DZone HTML links in java.

I’ll be retreiving description’s of all links in Dzone using the code

Note: This is not the best way to read links from Dzone ( You can use rss feed’s instead). This tutorial is to take you through css selectors for Java

All DZone pagination queries looks like this

http://www.dzone.com/links/?type=html&p=2

i used an opensource java library to parse this and extract link text description (jsoup)

Here is sample tags we have in dzone response

<a name="link-613399">
</a>

<div class="linkblock frontpage " id="link-613399">
	<div id="thumb_613399" class="thumb">
		<a onmouseup="track(this, 'twitter4j_oauth_on_android', ''); "
			href="http://www.xoriant.com/blog/mobile-application-development/twitter4j-oauth-on-android.html">
			<img width="120" height="90"
				src="http://cdn.dzone.com/links/images/thumbs/120x90/613399-1307624607000.jpg"
				class="thumbnail" alt="Link 613399 thumbnail"
				onmouseover="return OLgetAJAX('/links/themes/reader/jsps/nodecoration/thumb-load.jsp?linkId=613399', OLcmdExT1,
 300, 'bigThumbBody');"
				onmouseout="OLclearAJAX(); nd(100);" />
		</a>
	</div>
	<div id="hidden_thumb_613399">

	</div>
	<div class="tools">
	</div>
	<div class="details">
		<div class="vwidget" id="vwidget-613399">
			<a id="upcount-613399" href="#" class="upcount"
				onclick="showLoginDialog(613399, null); return false">7</a>

			<a id="downcount-613399" href="#"
				onclick="showLoginDialog(613399, null); return false;" class="downcount">0</a>
		</div>
		<h3>
			<a onmouseup="track(this, 'twitter4j_oauth_on_android', ''); "
				href="http://www.xoriant.com/blog/mobile-application-development/twitter4j-oauth-on-android.html"
				rel="bookmark"> Twitter4j OAuth on Android</a>
		</h3>
		<p class="voteblock">
			<a href="/links/users/profile/811805.html">
				<img width="24" height="24"
					src="http://cdn.dzone.com/links/images/std/avatars/default_24.gif"
					class="avatar" alt="User 811805 avatar" />
			</a>
		</p>
		<p class="fineprint byline">
			<a href="/links/users/profile/811805.html">RituR</a>
			via
			<a href="/links/search.html?query=domain%3Axoriant.com">xoriant.com</a>
		</p>
		<p class="fineprint byline">
			<b>Promoted: </b>
			Jun 08 / 17:27. Views:
			520, Clicks: 266
		</p>
		<p class="description">
			OAuth is an open protocol
			which allows the users to share their private information and assets
			like photos, videos etc. with another site...&nbsp;
			<a href='/links/twitter4j_oauth_on_android.html'>more&nbsp;&raquo;
			</a>
		</p>
		<p class="fineprint stats">
			<a
				href="http://twitter.com/home?status=RT+%40DZone+%22Twitter4j+OAuth+on+Android%22+http%3A%2F%2Fdzone.com%2FTBxR"
				class="twitter">Tweet</a>
			<a href="/links/twitter4j_oauth_on_android.html" class="comment">0
				Comments</a>
			<span class="linkUnsaved" id="save-link-613399"
				onclick="showLoginDialog(613399); return false;">Save</span>
			<span class="linkUnshared" id="share-link-613399"
				onclick="showLoginDialog(613399); return false;">Share</span>
			Tags:
			<a href="/links/tag/mobile.html" class="tags" rel="tag">mobile</a>
			,
			<a href="/links/tag/standards.html" class="tags" rel="tag">standards</a>
		</p>

	</div>
</div>

To get description we have to get data from element “P” with class “description” which is actually present in DIV with class “details” Here is how we can do that in java

/**
 * 
 */
package com.linkwithweb.parser;

import java.io.File;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

/****************************************************************
 * Description
 * jsoup elements support a CSS (or jquery) like selector syntax to find matching elements, that allows very powerful and robust queries.
 * 
 * The select method is available in a Document, Element, or in Elements. It is contextual, so you can filter by selecting from a specific element, or
 * by chaining select calls.
 * 
 * Select returns a list of Elements (as Elements), which provides a range of methods to extract and manipulate the results.
 * 
 * Selector overview
 * tagname: find elements by tag, e.g. a
 * ns|tag: find elements by tag in a namespace, e.g. fb|name finds <fb:name> elements
 * #id: find elements by ID, e.g. #logo
 * .class: find elements by class name, e.g. .masthead
 * [attribute]: elements with attribute, e.g. [href]
 * [^attr]: elements with an attribute name prefix, e.g. [^data-] finds elements with HTML5 dataset attributes
 * [attr=value]: elements with attribute value, e.g. [width=500]
 * [attr^=value], [attr$=value], [attr*=value]: elements with attributes that start with, end with, or contain the value, e.g. [href*=/path/]
 * [attr~=regex]: elements with attribute values that match the regular expression; e.g. img[src~=(?i)\.(png|jpe?g)]
 * : all elements, e.g. *
 * Selector combinations
 * el#id: elements with ID, e.g. div#logo
 * el.class: elements with class, e.g. div.masthead
 * el[attr]: elements with attribute, e.g. a[href]
 * Any combination, e.g. a[href].highlight
 * ancestor child: child elements that descend from ancestor, e.g. .body p finds p elements anywhere under a block with class "body"
 * parent > child: child elements that descend directly from parent, e.g. div.content > p finds p elements; and body > * finds the direct children of
 * the body tag
 * siblingA + siblingB: finds sibling B element immediately preceded by sibling A, e.g. div.head + div
 * siblingA ~ siblingX: finds sibling X element preceded by sibling A, e.g. h1 ~ p
 * el, el, el: group multiple selectors, find unique elements that match any of the selectors; e.g. div.masthead, div.logo
 * Pseudo selectors
 * :lt(n): find elements whose sibling index (i.e. its position in the DOM tree relative to its parent) is less than n; e.g. td:lt(3)
 * :gt(n): find elements whose sibling index is greater than n; e.g. div p:gt(2)
 * :eq(n): find elements whose sibling index is equal to n; e.g. form input:eq(1)
 * :has(seletor): find elements that contain elements matching the selector; e.g. div:has(p)
 * :not(selector): find elements that do not match the selector; e.g. div:not(.logo)
 * :contains(text): find elements that contain the given text. The search is case-insensitive; e.g. p:contains(jsoup)
 * :containsOwn(text): find elements that directly contain the given text
 * :matches(regex): find elements whose text matches the specified regular expression; e.g. div:matches((?i)login)
 * :matchesOwn(regex): find elements whose own text matches the specified regular expression
 * Note that the above indexed pseudo-selectors are 0-based, that is, the first element is at index 0, the second at 1, etc
 * See the Selector API reference for the full supported list and details.
 * 
 * @author Ashwin Kumar
 * 
 */
public class HTMLParser {

	/**
	 * @param args
	 */
	public static void main(String[] args) {
		try {
			File input = new File("input/dZoneLinks.xml");
			Document doc = Jsoup.parse(input, "UTF-8",
					"http://www.dzone.com/links/?type=html&p=2");

			Elements descriptions = doc.select("div.details > p.description"); // get all description elements in this HTML file
			/*
			 * Elements pngs = doc.select("img[src$=.png]");
			 * // img with src ending .png
			 * 
			 * Element masthead = doc.select("div.masthead").first();
			 */
			// div with

			// Elements resultLinks = doc.select("h3.r > a"); // direct a after h3
			/**
			 * Iterate over all descriptions and display them
			 */
			for (Element element : descriptions) {
				System.out.println(element.ownText());
				System.out.println("--------------");
			}

		} catch (Exception e) {
			e.printStackTrace();
		}
	}

}

Mavenized code has been checked in to svn at following location

http://code.google.com/p/linkwithweb/source/browse/trunk/Utilities/HTMLParser

Njoy parsing anything easily using jsoup

37 thoughts on “jsoup HTMLParser and Parsing Dzone Links using CSS Selectors in Java”

hypnotherapy scripts says:

June 29, 2012 at 7:14 am

Excellent post but I was wanting to know if you could write a litte more
on this topic? I’d be very grateful if you could elaborate a little bit more. Thank you!

Reply
cost Of life insurance says:

November 30, 2012 at 10:09 am

Hello there, I do believe your blog could be having
browser compatibility problems. Whenever I look at your blog in Safari,
it looks fine but when opening in Internet Explorer, it has some
overlapping issues. I merely wanted to give you a quick heads up!

Apart from that, wonderful blog!

Reply
how to become an alpha male full pdf says:

December 22, 2012 at 6:05 am

Hi there! I know this is kinda off topic however I’d figured I’d ask.
Would you be interested in exchanging links or maybe guest writing a blog post or vice-versa?

My blog covers a lot of the same subjects as yours and I feel we could greatly benefit from each other.
If you are interested feel free to send me an email.

I look forward to hearing from you! Superb blog by the way!

Reply
Marty says:

January 7, 2013 at 5:41 am

Heya i am for the first time here. I came across this board and I find It really useful
& it helped me out much. I hope to give something back and aid others like you
aided me.

Reply
visit website says:

January 22, 2013 at 2:32 pm

Pretty nice post. I just stumbled upon your
blog and wanted to say that I’ve really enjoyed browsing your blog posts. In any case I will be subscribing to your feed and I hope you write again soon!

Reply
Janet says:

January 25, 2013 at 6:38 am

Very good post! We will be linking to this great article on our website.
Keep up the great writing.

Reply
Zelma says:

February 1, 2013 at 12:25 am

Ahaa, its pleasant conversation about this article at this place at this web site, I have read all that, so now me also commenting here.

Reply
Annett says:

February 2, 2013 at 6:23 am

I’m impressed, I must say. Rarely do I encounter a blog that’s equally
educative and entertaining, and without a doubt, you have hit the nail on the head.
The problem is an issue that too few folks are speaking intelligently about.
Now i’m very happy I stumbled across this in my search for something concerning this.

Reply
http://wilsonpark.net/ says:

February 4, 2013 at 12:50 pm

This info is invaluable. Where can I find
out more?

Reply
easy math problems says:

February 7, 2013 at 3:32 am

Howdy would you mind letting me know which hosting company you’re utilizing? I’ve loaded your blog in
3 different internet browsers and I must say this blog loads a lot faster then most.

Can you suggest a good hosting provider at a fair price?
Cheers, I appreciate it!

Reply
Wilbur says:

February 20, 2013 at 4:39 am

This specific post, “jsoup HTMLParser and Parsing Dzone Links using CSS Selectors in Java | Technology Portal”
ended up being excellent. I am printing out a replica to present my personal colleagues.
Thanks a lot,Dorothea

Reply
bookbunn.wallinside.com says:

March 2, 2013 at 10:45 am

First of all it says what most movie industry insiders
know. Advantage: Harry Potter and the Deathly Hallows Pt.
A desperate bird that lives in perpetual passion, according to the Butcher in Carroll’s later poem The Hunting of the Snark.

Reply
cardiff bay hotels tripadvisor says:

March 6, 2013 at 11:50 pm

Thanks for sharing your thoughts on family friendly hotels south cornwall.
Regards

Reply
2 girls teach sex online says:

April 19, 2013 at 2:58 am

I understand I am able to satisfy my guy and indeed this record is false in some but as the anal I like him to simply get it done a
finger in my ass whilst fucking a lot much better also there’s amove we just started Yes we’re intercourse freaks so alway trying all of it.
Possess the lady do a proped up cowgirl fashion an trust difficult an deep also an put
her encounter inside your shoulder an jack rabbit feels incredible
to each parties an I’ve a tough time cumming since I final to extended but I came 3x’s with this place.
Also some women will not confess it but phony rape is usually exciting we adore to be controled an function
perform is fascinating. Also if you possess a girl just to lose lay her on her stomach legs together an fuck
help it become tighter also generating her operate the pussy muscle may also allow
it to be super restricted.

Reply
laundry and dry cleaning budget says:

April 21, 2013 at 4:30 pm

I think this is among the most important information for me.
And i am glad reading your article. But want to remark
on few general things, The website style is wonderful,
the articles is really great : D. Good job, cheers

Reply
grocery coupons says:

April 29, 2013 at 3:00 am

When I originally commented I clicked the “Notify me when new comments are added” checkbox and
now each time a comment is added I get four e-mails with
the same comment. Is there any way you can remove people from that service?
Thanks!

Reply
Kathie says:

May 2, 2013 at 2:47 pm

I was suggested this blog by my cousin. I am not sure whether this post
is written by him as no one else know such detailed about my problem.
You are wonderful! Thanks!

Reply
http://www.ipnetworksolutions.com/__media__/js/netsoltrademark.php?d=icamzlive.com says:

May 2, 2013 at 5:49 pm

Good post! We will be linking to this great article on our site.

Keep up the good writing.

Reply
the pirate bay music download says:

June 13, 2013 at 6:37 am

Much is considered and written about the crimes committed from the aborigines
from the white man, however it needs to be remembered the blacks were given great
provocation. The breaking-in of young wild steers to operate in the
team seemed somewhat cruel. On the Early days of the
Peak Downs Field from “The Peak Downs Telegraph”.

Reply
Alvaro says:

June 14, 2013 at 10:00 pm

It’s a pity you don’t have a donate button! I’d definitely donate to this superb blog! I guess for now i’ll settle for book-marking and adding your
RSS feed to my Google account. I look forward to new updates
and will talk about this blog with my Facebook group.
Talk soon!

Reply
Perfect Kick Hack Cheat says:

July 3, 2013 at 6:55 am

I like the valuable info you provide in your articles. I will bookmark your weblog and check again here frequently.
I am quite sure I will learn many new stuff
right here! Good luck for the next!

Reply
Pregnancy Without Pounds Review says:

July 4, 2013 at 5:58 am

May I simply just say what a comfort to find somebody who
actually knows what they are talking about on the net.
You certainly realize how to bring an issue to light and make it important.

More and more people need to look at this and understand this side of the
story. I was surprised that you are not more popular because
you surely possess the gift.

Reply
free cell phone tracker and spy says:

July 12, 2013 at 7:31 pm

Pretty section of content. I just stumbled upon your weblog and in accession capital to assert that I acquire actually enjoyed account your blog posts.
Anyway I’ll be subscribing to your augment and even I achievement you access consistently rapidly.

Reply
Facebook Bot Adder Like says:

July 13, 2013 at 1:41 pm

I’m extremely inspired along with your writing talents and also with the layout in your weblog. Is this a paid subject or did you customize it yourself? Either way stay up the nice high quality writing, it is rare to peer a great blog like this one nowadays..

Reply
Couponpress Review says:

July 15, 2013 at 3:16 am

Hello Dear, are you truly visiting this web site on a regular basis, if so
then you will without doubt get nice experience.

Reply
bad credit loans says:

July 15, 2013 at 8:35 pm

They do not bother to look at is the credit history
on the applicants, however the belief that you cannot
get credit card consolidation for bad credit holders and even those with no credit
history. If you are in that position right now, you can’t get a credit card consolidation.

Reply
ideiromanesti.blogspot.ro says:

August 31, 2013 at 1:44 am

My spouse and I stumbled over here by a different web address and
thought I may as well check things out. I like what I see
so now i’m following you. Look forward to checking out your web page again.

Reply
volume pills Forum says:

September 18, 2013 at 4:04 am

Amazing! Its in fact amazing post, I have got much clear
idea about from this article.

Reply
Maricruz says:

January 25, 2014 at 7:54 pm

“jsoup HTMLParser and Parsing Dzone Links using CSS Selectors in Java | Technology Portal”
was in fact a fantastic post, cannot help but wait to examine much more of ur
postings. Time to spend some time on the internet hehe.

Thanks for your effort -Justina

Reply
TheMassageOutlet says:

April 17, 2014 at 10:20 pm

Prettty portion ߋf content. І just stumbled upon yoսr weblog aոd in accession capital tߋ
claim tɦаt I get аctually loved account yߋur weblog posts.

Any ѡay Ӏ ԝill be subscribing fοr ʏour feeds οr even I achievement you get entry tο
constantly rapidly.

Reply
Emory Kerzman says:

June 1, 2016 at 10:02 pm

you’re truly a excellent webmaster. The site loading speed is amazing. It kind of feels that you’re doing any unique trick. Also, The contents are masterpiece. you’ve done a fantastic job in this subject!

Reply
Lyndsey Frihart says:

June 8, 2016 at 8:55 pm

The very heart of your writing while appearing agreeable at first, did not work perfectly with me after some time. Someplace within the paragraphs you were able to make me a believer but just for a short while. I however have got a problem with your jumps in assumptions and you would do well to help fill in those gaps. When you can accomplish that, I would certainly end up being fascinated.

Reply
iphone zoom lens says:

October 19, 2016 at 3:00 pm

Pretty! This was a really wonderful post. Thank you for your provided information.

Reply
Earle Toran says:

February 7, 2017 at 7:18 pm

I do trust all the concepts you’ve presented in your post. They are really convincing and can definitely work. Nonetheless, the posts are too short for beginners. Could you please prolong them a little from subsequent time? Thanks for the post.

Reply
Cecil Elza says:

February 19, 2017 at 4:02 pm

Enjoyed studying this, very good stuff, appreciate it.TubeSync

Reply
Melida Parmely says:

February 20, 2017 at 4:12 am

I was very happy to find this web-site.I wanted to thanks on your time for this glorious learn!! I undoubtedly enjoying each little bit of it and I’ve you bookmarked to check out new stuff you weblog post.CosmosEntertainment

Reply
Jesse Grillo says:

September 4, 2017 at 5:55 am

You remind me of my bestie. I saw your blog on my WordPress feed. I worked in this field back in the day when I lived in Illinois. I think the admin of this web posts is really working hard in support of his web site, since here every material is quality based information. I hope you are making cash off this website

Reply