jsoup HTMLParser and Parsing Dzone Links using CSS Selectors in Java

I was working on a task to parse some of Amazon web-services. There are lots of ways to parse it Using DOM/SAX/Stax .  All of them require some amount of coding. I wanted a quick fix and i finally landed on to JSoup an opensource HTML Parser ( Other html parser i like is HTMLParser) . In this article i’m going to explain how i’m going to parse DZone HTML links in java.

I’ll be retreiving description’s of all links in Dzone using the code

Note: This is not the best way to read links from Dzone ( You can use rss feed’s instead).  This tutorial is to take you through css selectors for Java

All DZone pagination queries looks like this

http://www.dzone.com/links/?type=html&p=2

i used an opensource java library to parse this and extract link text description (jsoup)

Here is sample tags we have in dzone response

<a name="link-613399">
</a>

<div class="linkblock frontpage " id="link-613399">
	<div id="thumb_613399" class="thumb">
		<a onmouseup="track(this, 'twitter4j_oauth_on_android', ''); "
			href="http://www.xoriant.com/blog/mobile-application-development/twitter4j-oauth-on-android.html">
			<img width="120" height="90"
				src="http://cdn.dzone.com/links/images/thumbs/120x90/613399-1307624607000.jpg"
				class="thumbnail" alt="Link 613399 thumbnail"
				onmouseover="return OLgetAJAX('/links/themes/reader/jsps/nodecoration/thumb-load.jsp?linkId=613399', OLcmdExT1,
 300, 'bigThumbBody');"
				onmouseout="OLclearAJAX(); nd(100);" />
		</a>
	</div>
	<div id="hidden_thumb_613399">

	</div>
	<div class="tools">
	</div>
	<div class="details">
		<div class="vwidget" id="vwidget-613399">
			<a id="upcount-613399" href="#" class="upcount"
				onclick="showLoginDialog(613399, null); return false">7</a>

			<a id="downcount-613399" href="#"
				onclick="showLoginDialog(613399, null); return false;" class="downcount">0</a>
		</div>
		<h3>
			<a onmouseup="track(this, 'twitter4j_oauth_on_android', ''); "
				href="http://www.xoriant.com/blog/mobile-application-development/twitter4j-oauth-on-android.html"
				rel="bookmark"> Twitter4j OAuth on Android</a>
		</h3>
		<p class="voteblock">
			<a href="/links/users/profile/811805.html">
				<img width="24" height="24"
					src="http://cdn.dzone.com/links/images/std/avatars/default_24.gif"
					class="avatar" alt="User 811805 avatar" />
			</a>
		</p>
		<p class="fineprint byline">
			<a href="/links/users/profile/811805.html">RituR</a>
			via
			<a href="/links/search.html?query=domain%3Axoriant.com">xoriant.com</a>
		</p>
		<p class="fineprint byline">
			<b>Promoted: </b>
			Jun 08 / 17:27. Views:
			520, Clicks: 266
		</p>
		<p class="description">
			OAuth is an open protocol
			which allows the users to share their private information and assets
			like photos, videos etc. with another site...&nbsp;
			<a href='/links/twitter4j_oauth_on_android.html'>more&nbsp;&raquo;
			</a>
		</p>
		<p class="fineprint stats">
			<a
				href="http://twitter.com/home?status=RT+%40DZone+%22Twitter4j+OAuth+on+Android%22+http%3A%2F%2Fdzone.com%2FTBxR"
				class="twitter">Tweet</a>
			<a href="/links/twitter4j_oauth_on_android.html" class="comment">0
				Comments</a>
			<span class="linkUnsaved" id="save-link-613399"
				onclick="showLoginDialog(613399); return false;">Save</span>
			<span class="linkUnshared" id="share-link-613399"
				onclick="showLoginDialog(613399); return false;">Share</span>
			Tags:
			<a href="/links/tag/mobile.html" class="tags" rel="tag">mobile</a>
			,
			<a href="/links/tag/standards.html" class="tags" rel="tag">standards</a>
		</p>

	</div>
</div>

 

To get description we have to get data from element “P” with class “description” which is actually present in DIV with class “details” Here is how we can do that in java

 

/**
 * 
 */
package com.linkwithweb.parser;

import java.io.File;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

/****************************************************************
 * Description
 * jsoup elements support a CSS (or jquery) like selector syntax to find matching elements, that allows very powerful and robust queries.
 * 
 * The select method is available in a Document, Element, or in Elements. It is contextual, so you can filter by selecting from a specific element, or
 * by chaining select calls.
 * 
 * Select returns a list of Elements (as Elements), which provides a range of methods to extract and manipulate the results.
 * 
 * Selector overview
 * tagname: find elements by tag, e.g. a
 * ns|tag: find elements by tag in a namespace, e.g. fb|name finds <fb:name> elements
 * #id: find elements by ID, e.g. #logo
 * .class: find elements by class name, e.g. .masthead
 * [attribute]: elements with attribute, e.g. [href]
 * [^attr]: elements with an attribute name prefix, e.g. [^data-] finds elements with HTML5 dataset attributes
 * [attr=value]: elements with attribute value, e.g. [width=500]
 * [attr^=value], [attr$=value], [attr*=value]: elements with attributes that start with, end with, or contain the value, e.g. [href*=/path/]
 * [attr~=regex]: elements with attribute values that match the regular expression; e.g. img[src~=(?i)\.(png|jpe?g)]
 * : all elements, e.g. *
 * Selector combinations
 * el#id: elements with ID, e.g. div#logo
 * el.class: elements with class, e.g. div.masthead
 * el[attr]: elements with attribute, e.g. a[href]
 * Any combination, e.g. a[href].highlight
 * ancestor child: child elements that descend from ancestor, e.g. .body p finds p elements anywhere under a block with class "body"
 * parent > child: child elements that descend directly from parent, e.g. div.content > p finds p elements; and body > * finds the direct children of
 * the body tag
 * siblingA + siblingB: finds sibling B element immediately preceded by sibling A, e.g. div.head + div
 * siblingA ~ siblingX: finds sibling X element preceded by sibling A, e.g. h1 ~ p
 * el, el, el: group multiple selectors, find unique elements that match any of the selectors; e.g. div.masthead, div.logo
 * Pseudo selectors
 * :lt(n): find elements whose sibling index (i.e. its position in the DOM tree relative to its parent) is less than n; e.g. td:lt(3)
 * :gt(n): find elements whose sibling index is greater than n; e.g. div p:gt(2)
 * :eq(n): find elements whose sibling index is equal to n; e.g. form input:eq(1)
 * :has(seletor): find elements that contain elements matching the selector; e.g. div:has(p)
 * :not(selector): find elements that do not match the selector; e.g. div:not(.logo)
 * :contains(text): find elements that contain the given text. The search is case-insensitive; e.g. p:contains(jsoup)
 * :containsOwn(text): find elements that directly contain the given text
 * :matches(regex): find elements whose text matches the specified regular expression; e.g. div:matches((?i)login)
 * :matchesOwn(regex): find elements whose own text matches the specified regular expression
 * Note that the above indexed pseudo-selectors are 0-based, that is, the first element is at index 0, the second at 1, etc
 * See the Selector API reference for the full supported list and details.
 * 
 * @author Ashwin Kumar
 * 
 */
public class HTMLParser {

	/**
	 * @param args
	 */
	public static void main(String[] args) {
		try {
			File input = new File("input/dZoneLinks.xml");
			Document doc = Jsoup.parse(input, "UTF-8",
					"http://www.dzone.com/links/?type=html&p=2");

			Elements descriptions = doc.select("div.details > p.description"); // get all description elements in this HTML file
			/*
			 * Elements pngs = doc.select("img[src$=.png]");
			 * // img with src ending .png
			 * 
			 * Element masthead = doc.select("div.masthead").first();
			 */
			// div with

			// Elements resultLinks = doc.select("h3.r > a"); // direct a after h3
			/**
			 * Iterate over all descriptions and display them
			 */
			for (Element element : descriptions) {
				System.out.println(element.ownText());
				System.out.println("--------------");
			}

		} catch (Exception e) {
			e.printStackTrace();
		}
	}

}

Mavenized code has been checked in to svn at following location

http://code.google.com/p/linkwithweb/source/browse/trunk/Utilities/HTMLParser

Njoy parsing anything easily using jsoup

 

About these ads

29 Responses to jsoup HTMLParser and Parsing Dzone Links using CSS Selectors in Java

  1. Excellent post but I was wanting to know if you could write a litte more
    on this topic? I’d be very grateful if you could elaborate a little bit more. Thank you!

  2. Hello there, I do believe your blog could be having
    browser compatibility problems. Whenever I look at your blog in Safari,
    it looks fine but when opening in Internet Explorer, it has some
    overlapping issues. I merely wanted to give you a quick heads up!

    Apart from that, wonderful blog!

  3. Hi there! I know this is kinda off topic however I’d figured I’d ask.
    Would you be interested in exchanging links or maybe guest writing a blog post or vice-versa?

    My blog covers a lot of the same subjects as yours and I feel we could greatly benefit from each other.
    If you are interested feel free to send me an email.

    I look forward to hearing from you! Superb blog by the way!

  4. Marty says:

    Heya i am for the first time here. I came across this board and I find It really useful
    & it helped me out much. I hope to give something back and aid others like you
    aided me.

  5. Pretty nice post. I just stumbled upon your
    blog and wanted to say that I’ve really enjoyed browsing your blog posts. In any case I will be subscribing to your feed and I hope you write again soon!

  6. Janet says:

    Very good post! We will be linking to this great article on our website.
    Keep up the great writing.

  7. Zelma says:

    Ahaa, its pleasant conversation about this article at this place at this web site, I have read all that, so now me also commenting here.

  8. Annett says:

    I’m impressed, I must say. Rarely do I encounter a blog that’s equally
    educative and entertaining, and without a doubt, you have hit the nail on the head.
    The problem is an issue that too few folks are speaking intelligently about.
    Now i’m very happy I stumbled across this in my search for something concerning this.

  9. This info is invaluable. Where can I find
    out more?

  10. Howdy would you mind letting me know which hosting company you’re utilizing? I’ve loaded your blog in
    3 different internet browsers and I must say this blog loads a lot faster then most.

    Can you suggest a good hosting provider at a fair price?
    Cheers, I appreciate it!

  11. Wilbur says:

    This specific post, “jsoup HTMLParser and Parsing Dzone Links using CSS Selectors in Java | Technology Portal”
    ended up being excellent. I am printing out a replica to present my personal colleagues.
    Thanks a lot,Dorothea

  12. First of all it says what most movie industry insiders
    know. Advantage: Harry Potter and the Deathly Hallows Pt.
    A desperate bird that lives in perpetual passion, according to the Butcher in Carroll’s later poem The Hunting of the Snark.

  13. Thanks for sharing your thoughts on family friendly hotels south cornwall.
    Regards

  14. I understand I am able to satisfy my guy and indeed this record is false in some but as the anal I like him to simply get it done a
    finger in my ass whilst fucking a lot much better also there’s amove we just started Yes we’re intercourse freaks so alway trying all of it.
    Possess the lady do a proped up cowgirl fashion an trust difficult an deep also an put
    her encounter inside your shoulder an jack rabbit feels incredible
    to each parties an I’ve a tough time cumming since I final to extended but I came 3x’s with this place.
    Also some women will not confess it but phony rape is usually exciting we adore to be controled an function
    perform is fascinating. Also if you possess a girl just to lose lay her on her stomach legs together an fuck
    help it become tighter also generating her operate the pussy muscle may also allow
    it to be super restricted.

  15. I think this is among the most important information for me.
    And i am glad reading your article. But want to remark
    on few general things, The website style is wonderful,
    the articles is really great : D. Good job, cheers

  16. When I originally commented I clicked the “Notify me when new comments are added” checkbox and
    now each time a comment is added I get four e-mails with
    the same comment. Is there any way you can remove people from that service?
    Thanks!

  17. Kathie says:

    I was suggested this blog by my cousin. I am not sure whether this post
    is written by him as no one else know such detailed about my problem.
    You are wonderful! Thanks!

  18. Good post! We will be linking to this great article on our site.

    Keep up the good writing.

  19. Much is considered and written about the crimes committed from the aborigines
    from the white man, however it needs to be remembered the blacks were given great
    provocation. The breaking-in of young wild steers to operate in the
    team seemed somewhat cruel. On the Early days of the
    Peak Downs Field from “The Peak Downs Telegraph”.

  20. Alvaro says:

    It’s a pity you don’t have a donate button! I’d definitely donate to this superb blog! I guess for now i’ll settle for book-marking and adding your
    RSS feed to my Google account. I look forward to new updates
    and will talk about this blog with my Facebook group.
    Talk soon!

  21. I like the valuable info you provide in your articles. I will bookmark your weblog and check again here frequently.
    I am quite sure I will learn many new stuff
    right here! Good luck for the next!

  22. May I simply just say what a comfort to find somebody who
    actually knows what they are talking about on the net.
    You certainly realize how to bring an issue to light and make it important.

    More and more people need to look at this and understand this side of the
    story. I was surprised that you are not more popular because
    you surely possess the gift.

  23. Pretty section of content. I just stumbled upon your weblog and in accession capital to assert that I acquire actually enjoyed account your blog posts.
    Anyway I’ll be subscribing to your augment and even I achievement you access consistently rapidly.

  24. I’m extremely inspired along with your writing talents and also with the layout in your weblog. Is this a paid subject or did you customize it yourself? Either way stay up the nice high quality writing, it is rare to peer a great blog like this one nowadays..

  25. Hello Dear, are you truly visiting this web site on a regular basis, if so
    then you will without doubt get nice experience.

  26. They do not bother to look at is the credit history
    on the applicants, however the belief that you cannot
    get credit card consolidation for bad credit holders and even those with no credit
    history. If you are in that position right now, you can’t get a credit card consolidation.

  27. My spouse and I stumbled over here by a different web address and
    thought I may as well check things out. I like what I see
    so now i’m following you. Look forward to checking out your web page again.

  28. Amazing! Its in fact amazing post, I have got much clear
    idea about from this article.

  29. Maricruz says:

    “jsoup HTMLParser and Parsing Dzone Links using CSS Selectors in Java | Technology Portal”
    was in fact a fantastic post, cannot help but wait to examine much more of ur
    postings. Time to spend some time on the internet hehe.

    Thanks for your effort -Justina

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 27 other followers

%d bloggers like this: