jsoup HTMLParser and Parsing Dzone Links using CSS Selectors in Java

I was working on a task to parse some of Amazon web-services. There are lots of ways to parse it Using DOM/SAX/Stax .  All of them require some amount of coding. I wanted a quick fix and i finally landed on to JSoup an opensource HTML Parser ( Other html parser i like is HTMLParser) . In this article i’m going to explain how i’m going to parse DZone HTML links in java.

I’ll be retreiving description’s of all links in Dzone using the code

Note: This is not the best way to read links from Dzone ( You can use rss feed’s instead).  This tutorial is to take you through css selectors for Java

All DZone pagination queries looks like this

http://www.dzone.com/links/?type=html&p=2

i used an opensource java library to parse this and extract link text description (jsoup)

Here is sample tags we have in dzone response

<a name="link-613399">
</a>

<div class="linkblock frontpage " id="link-613399">
	<div id="thumb_613399" class="thumb">
		<a onmouseup="track(this, 'twitter4j_oauth_on_android', ''); "
			href="http://www.xoriant.com/blog/mobile-application-development/twitter4j-oauth-on-android.html">
			<img width="120" height="90"
				src="http://cdn.dzone.com/links/images/thumbs/120x90/613399-1307624607000.jpg"
				class="thumbnail" alt="Link 613399 thumbnail"
				onmouseover="return OLgetAJAX('/links/themes/reader/jsps/nodecoration/thumb-load.jsp?linkId=613399', OLcmdExT1,
 300, 'bigThumbBody');"
				onmouseout="OLclearAJAX(); nd(100);" />
		</a>
	</div>
	<div id="hidden_thumb_613399">

	</div>
	<div class="tools">
	</div>
	<div class="details">
		<div class="vwidget" id="vwidget-613399">
			<a id="upcount-613399" href="#" class="upcount"
				onclick="showLoginDialog(613399, null); return false">7</a>

			<a id="downcount-613399" href="#"
				onclick="showLoginDialog(613399, null); return false;" class="downcount">0</a>
		</div>
		<h3>
			<a onmouseup="track(this, 'twitter4j_oauth_on_android', ''); "
				href="http://www.xoriant.com/blog/mobile-application-development/twitter4j-oauth-on-android.html"
				rel="bookmark"> Twitter4j OAuth on Android</a>
		</h3>
		<p class="voteblock">
			<a href="/links/users/profile/811805.html">
				<img width="24" height="24"
					src="http://cdn.dzone.com/links/images/std/avatars/default_24.gif"
					class="avatar" alt="User 811805 avatar" />
			</a>
		</p>
		<p class="fineprint byline">
			<a href="/links/users/profile/811805.html">RituR</a>
			via
			<a href="/links/search.html?query=domain%3Axoriant.com">xoriant.com</a>
		</p>
		<p class="fineprint byline">
			<b>Promoted: </b>
			Jun 08 / 17:27. Views:
			520, Clicks: 266
		</p>
		<p class="description">
			OAuth is an open protocol
			which allows the users to share their private information and assets
			like photos, videos etc. with another site...&nbsp;
			<a href='/links/twitter4j_oauth_on_android.html'>more&nbsp;&raquo;
			</a>
		</p>
		<p class="fineprint stats">
			<a
				href="http://twitter.com/home?status=RT+%40DZone+%22Twitter4j+OAuth+on+Android%22+http%3A%2F%2Fdzone.com%2FTBxR"
				class="twitter">Tweet</a>
			<a href="/links/twitter4j_oauth_on_android.html" class="comment">0
				Comments</a>
			<span class="linkUnsaved" id="save-link-613399"
				onclick="showLoginDialog(613399); return false;">Save</span>
			<span class="linkUnshared" id="share-link-613399"
				onclick="showLoginDialog(613399); return false;">Share</span>
			Tags:
			<a href="/links/tag/mobile.html" class="tags" rel="tag">mobile</a>
			,
			<a href="/links/tag/standards.html" class="tags" rel="tag">standards</a>
		</p>

	</div>
</div>

 

To get description we have to get data from element “P” with class “description” which is actually present in DIV with class “details” Here is how we can do that in java

 

/**
 * 
 */
package com.linkwithweb.parser;

import java.io.File;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

/****************************************************************
 * Description
 * jsoup elements support a CSS (or jquery) like selector syntax to find matching elements, that allows very powerful and robust queries.
 * 
 * The select method is available in a Document, Element, or in Elements. It is contextual, so you can filter by selecting from a specific element, or
 * by chaining select calls.
 * 
 * Select returns a list of Elements (as Elements), which provides a range of methods to extract and manipulate the results.
 * 
 * Selector overview
 * tagname: find elements by tag, e.g. a
 * ns|tag: find elements by tag in a namespace, e.g. fb|name finds <fb:name> elements
 * #id: find elements by ID, e.g. #logo
 * .class: find elements by class name, e.g. .masthead
 * [attribute]: elements with attribute, e.g. [href]
 * [^attr]: elements with an attribute name prefix, e.g. [^data-] finds elements with HTML5 dataset attributes
 * [attr=value]: elements with attribute value, e.g. [width=500]
 * [attr^=value], [attr$=value], [attr*=value]: elements with attributes that start with, end with, or contain the value, e.g. [href*=/path/]
 * [attr~=regex]: elements with attribute values that match the regular expression; e.g. img[src~=(?i)\.(png|jpe?g)]
 * : all elements, e.g. *
 * Selector combinations
 * el#id: elements with ID, e.g. div#logo
 * el.class: elements with class, e.g. div.masthead
 * el[attr]: elements with attribute, e.g. a[href]
 * Any combination, e.g. a[href].highlight
 * ancestor child: child elements that descend from ancestor, e.g. .body p finds p elements anywhere under a block with class "body"
 * parent > child: child elements that descend directly from parent, e.g. div.content > p finds p elements; and body > * finds the direct children of
 * the body tag
 * siblingA + siblingB: finds sibling B element immediately preceded by sibling A, e.g. div.head + div
 * siblingA ~ siblingX: finds sibling X element preceded by sibling A, e.g. h1 ~ p
 * el, el, el: group multiple selectors, find unique elements that match any of the selectors; e.g. div.masthead, div.logo
 * Pseudo selectors
 * :lt(n): find elements whose sibling index (i.e. its position in the DOM tree relative to its parent) is less than n; e.g. td:lt(3)
 * :gt(n): find elements whose sibling index is greater than n; e.g. div p:gt(2)
 * :eq(n): find elements whose sibling index is equal to n; e.g. form input:eq(1)
 * :has(seletor): find elements that contain elements matching the selector; e.g. div:has(p)
 * :not(selector): find elements that do not match the selector; e.g. div:not(.logo)
 * :contains(text): find elements that contain the given text. The search is case-insensitive; e.g. p:contains(jsoup)
 * :containsOwn(text): find elements that directly contain the given text
 * :matches(regex): find elements whose text matches the specified regular expression; e.g. div:matches((?i)login)
 * :matchesOwn(regex): find elements whose own text matches the specified regular expression
 * Note that the above indexed pseudo-selectors are 0-based, that is, the first element is at index 0, the second at 1, etc
 * See the Selector API reference for the full supported list and details.
 * 
 * @author Ashwin Kumar
 * 
 */
public class HTMLParser {

	/**
	 * @param args
	 */
	public static void main(String[] args) {
		try {
			File input = new File("input/dZoneLinks.xml");
			Document doc = Jsoup.parse(input, "UTF-8",
					"http://www.dzone.com/links/?type=html&p=2");

			Elements descriptions = doc.select("div.details > p.description"); // get all description elements in this HTML file
			/*
			 * Elements pngs = doc.select("img[src$=.png]");
			 * // img with src ending .png
			 * 
			 * Element masthead = doc.select("div.masthead").first();
			 */
			// div with

			// Elements resultLinks = doc.select("h3.r > a"); // direct a after h3
			/**
			 * Iterate over all descriptions and display them
			 */
			for (Element element : descriptions) {
				System.out.println(element.ownText());
				System.out.println("--------------");
			}

		} catch (Exception e) {
			e.printStackTrace();
		}
	}

}

Mavenized code has been checked in to svn at following location

http://code.google.com/p/linkwithweb/source/browse/trunk/Utilities/HTMLParser

Njoy parsing anything easily using jsoup

 

37 thoughts on “jsoup HTMLParser and Parsing Dzone Links using CSS Selectors in Java

  1. Hello there, I do believe your blog could be having
    browser compatibility problems. Whenever I look at your blog in Safari,
    it looks fine but when opening in Internet Explorer, it has some
    overlapping issues. I merely wanted to give you a quick heads up!

    Apart from that, wonderful blog!

  2. Hi there! I know this is kinda off topic however I’d figured I’d ask.
    Would you be interested in exchanging links or maybe guest writing a blog post or vice-versa?

    My blog covers a lot of the same subjects as yours and I feel we could greatly benefit from each other.
    If you are interested feel free to send me an email.

    I look forward to hearing from you! Superb blog by the way!

  3. Heya i am for the first time here. I came across this board and I find It really useful
    & it helped me out much. I hope to give something back and aid others like you
    aided me.

  4. Pretty nice post. I just stumbled upon your
    blog and wanted to say that I’ve really enjoyed browsing your blog posts. In any case I will be subscribing to your feed and I hope you write again soon!

  5. I’m impressed, I must say. Rarely do I encounter a blog that’s equally
    educative and entertaining, and without a doubt, you have hit the nail on the head.
    The problem is an issue that too few folks are speaking intelligently about.
    Now i’m very happy I stumbled across this in my search for something concerning this.

  6. Howdy would you mind letting me know which hosting company you’re utilizing? I’ve loaded your blog in
    3 different internet browsers and I must say this blog loads a lot faster then most.

    Can you suggest a good hosting provider at a fair price?
    Cheers, I appreciate it!

  7. This specific post, “jsoup HTMLParser and Parsing Dzone Links using CSS Selectors in Java | Technology Portal”
    ended up being excellent. I am printing out a replica to present my personal colleagues.
    Thanks a lot,Dorothea

  8. First of all it says what most movie industry insiders
    know. Advantage: Harry Potter and the Deathly Hallows Pt.
    A desperate bird that lives in perpetual passion, according to the Butcher in Carroll’s later poem The Hunting of the Snark.

  9. I understand I am able to satisfy my guy and indeed this record is false in some but as the anal I like him to simply get it done a
    finger in my ass whilst fucking a lot much better also there’s amove we just started Yes we’re intercourse freaks so alway trying all of it.
    Possess the lady do a proped up cowgirl fashion an trust difficult an deep also an put
    her encounter inside your shoulder an jack rabbit feels incredible
    to each parties an I’ve a tough time cumming since I final to extended but I came 3x’s with this place.
    Also some women will not confess it but phony rape is usually exciting we adore to be controled an function
    perform is fascinating. Also if you possess a girl just to lose lay her on her stomach legs together an fuck
    help it become tighter also generating her operate the pussy muscle may also allow
    it to be super restricted.

  10. When I originally commented I clicked the “Notify me when new comments are added” checkbox and
    now each time a comment is added I get four e-mails with
    the same comment. Is there any way you can remove people from that service?
    Thanks!

  11. I was suggested this blog by my cousin. I am not sure whether this post
    is written by him as no one else know such detailed about my problem.
    You are wonderful! Thanks!

  12. Much is considered and written about the crimes committed from the aborigines
    from the white man, however it needs to be remembered the blacks were given great
    provocation. The breaking-in of young wild steers to operate in the
    team seemed somewhat cruel. On the Early days of the
    Peak Downs Field from “The Peak Downs Telegraph”.

  13. It’s a pity you don’t have a donate button! I’d definitely donate to this superb blog! I guess for now i’ll settle for book-marking and adding your
    RSS feed to my Google account. I look forward to new updates
    and will talk about this blog with my Facebook group.
    Talk soon!

  14. I like the valuable info you provide in your articles. I will bookmark your weblog and check again here frequently.
    I am quite sure I will learn many new stuff
    right here! Good luck for the next!

  15. May I simply just say what a comfort to find somebody who
    actually knows what they are talking about on the net.
    You certainly realize how to bring an issue to light and make it important.

    More and more people need to look at this and understand this side of the
    story. I was surprised that you are not more popular because
    you surely possess the gift.

  16. Pretty section of content. I just stumbled upon your weblog and in accession capital to assert that I acquire actually enjoyed account your blog posts.
    Anyway I’ll be subscribing to your augment and even I achievement you access consistently rapidly.

  17. I’m extremely inspired along with your writing talents and also with the layout in your weblog. Is this a paid subject or did you customize it yourself? Either way stay up the nice high quality writing, it is rare to peer a great blog like this one nowadays..

  18. They do not bother to look at is the credit history
    on the applicants, however the belief that you cannot
    get credit card consolidation for bad credit holders and even those with no credit
    history. If you are in that position right now, you can’t get a credit card consolidation.

  19. “jsoup HTMLParser and Parsing Dzone Links using CSS Selectors in Java | Technology Portal”
    was in fact a fantastic post, cannot help but wait to examine much more of ur
    postings. Time to spend some time on the internet hehe.

    Thanks for your effort -Justina

  20. Prettty portion ߋf content. І just stumbled upon yoսr weblog aոd in accession capital tߋ
    claim tɦаt I get аctually loved account yߋur weblog posts.

    Any ѡay Ӏ ԝill be subscribing fοr ʏour feeds οr even I achievement you get entry tο
    constantly rapidly.

  21. you’re truly a excellent webmaster. The site loading speed is amazing. It kind of feels that you’re doing any unique trick. Also, The contents are masterpiece. you’ve done a fantastic job in this subject!

  22. The very heart of your writing while appearing agreeable at first, did not work perfectly with me after some time. Someplace within the paragraphs you were able to make me a believer but just for a short while. I however have got a problem with your jumps in assumptions and you would do well to help fill in those gaps. When you can accomplish that, I would certainly end up being fascinated.

  23. I do trust all the concepts you’ve presented in your post. They are really convincing and can definitely work. Nonetheless, the posts are too short for beginners. Could you please prolong them a little from subsequent time? Thanks for the post.

  24. I was very happy to find this web-site.I wanted to thanks on your time for this glorious learn!! I undoubtedly enjoying each little bit of it and I’ve you bookmarked to check out new stuff you weblog post.CosmosEntertainment

  25. You remind me of my bestie. I saw your blog on my WordPress feed. I worked in this field back in the day when I lived in Illinois. I think the admin of this web posts is really working hard in support of his web site, since here every material is quality based information. I hope you are making cash off this website

Leave a comment