Contact Us

HTML Web page scrapping using JSOUP in ColdFusion

HTML Web page Scraping Using JSOUP in ColdFusion

JSOUP is an open-source project distributed under the liberal MIT license. The source code is available on GitHub.

It is a Java library for working with HTML web pages. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the HTML DOM, JQuery, and CSS selectors.
It is designed to deal with all varieties of HTML, from pristine and validating, to invalid tag-soup, it will create a sensible parse tree for HTML DOM.

Overview

JSOUP can parse HTML files, input ms, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM: DOM is the language-independent representation of the HTML documents, which defines the structure and the styling of the document) traversal methods and CSS and jQuery-like selectors.

JSOUP can manipulate the content:

  1. the HTML element itself,
  2. its attributes, or
  3. its text.

It updates older content based on HTML or XHTML by converting deprecated tags to new versions. It can also do cleanupallowlistswhitelists, tidy HTML output, and complete unbalanced tags automagically.

Read more

  1. DOM navigation 
  2. Selector syntax 

Environment Setup

    1. Download[[https://jsoup.org/download]] the latest version of the JSOUP jar file
    2. Copy JAR file to your ColdFusion project's LIB folder or any other appropriate location according to your project settings
    3. Use Java loader settings to specify your JAR file's classpath to link with your project. Here we are specifying JSOUP classpath link in Application.cfc as follows

Application.cfc

component {
	this.name = "JSOUP_Scrapper";
	// Loads the JAR File
	// A user can can specify any other location according to project settings and can load JAR file by specifying a correct JAR file location with javaSettings javaloader
	this.javaSettings = { loadPaths = [ "#expandPath('./lib/jsoup-1.8.3.jar')#" ], reloadOnChange = false };

}

Once the javaSettings is updated as per your JSOUP jar locations please reload the application's classpath [reinit your application] to use the JSOUP classes with your application.

Parsing and traversing a Document

Once ColdFusion's java setting reloaded [reinit application] you can create a new object of JSOUP class as following example

test.cfm

<cfset getJsoup = createObject("java", "org.jsoup.Jsoup")> <!--- Create JAVA object to refer JSOUP. --->
	
<cfset target = "http://target_website_with-table.com/"> <!--- Set a target URL for the web page to scrap --->
<cfset getCurrentPageContent = getJsoup.connect(parentURL).get()> <!--- Connect target URL & get parsed HTML DOM using the jsoup object --->
<cfset getBodyContent = getCurrentPageContent.body()> <!--- Get HTML body content from parsed HTML DOM content of target URL --->
	
<cfset TheTable = getBodyContent.select("##tableByID")> <!--- Identify a specific table containing the data to scrape --->

<!--- Read each row data to scrape specific table --->
<cfset rows = TheTable.select("tr")> 
<cfloop array="#rows#" item="rowData">
		<!--- do your stuff --->
</cfloop>

 

Further reading

  1. JSOUP HTML Parser
  2. DOM navigation 
  3. Selector syntax

Leave a Reply

Site Updates