|
Using htmlparser without having it download the pageJeffrey P. Bigham |
|
Related ArticlesRelated Ads |
htmlparser is an incredibly handy parser, but one thing I didn't like about it at first is that it makes it seem like you have to use the built-in function for downloading the web page. This isn't always ideal because it isn't that flexible and it precludes you from having a cached copy on your local machine in a database or other location not easily accessible as a file or URL. The setInputHTML method of the Parser object would seem to get around most of these complaints, but not quite. If you want to extract the links, image or other objects that can be accessed in a way relative to the base URL, you can't. You'll get URLs back that look like "/images/foo.gif", which you obviously won't be able to download directly. To get around this problem, you must first set the base URL, which isn't exactly straighforward, and then all relative URLs will be correctly rectified. The code below does all of these things. It first sets the input HTML from a String and then sets the base URL from a provided String.
Obviously, you need to fill in the pieces that actually set the URL and HTML to the right values for your applications, but that's how it works in a nutshell. Note: Be sure not to try to set the Base URL before setting the input HTML. Doing so won't work for some reason. Happy coding! |
|
|
||