An XML/HTML parser for Objective-C, inspired by Hpricot. Computer Science and Web Education, Training, and Entertainment | Topfunky Corporation

Parse block into UITableCell

I'm using Hpple to parse HTML of a website onto my app. The parsing is working great but instead of all the contents of the tr block to be in one cell, each of the td elements in the tr block are table cells of their own. Here's what I mean. The TR Block:

                    <td>down 1</td>
                    <td>Justin Bieber</td>
                    <td>What Do You Mean?</td>

What it looks like in the app:
enter image description here

When I want it to actually look like this: enter image description here

The code I'm using for the parsing looks like this:

 - (void)loadSongs {
// 1
NSURL *tutorialsUrl = [NSURL URLWithString:@""];
NSData *tutorialsHtmlData = [NSData dataWithContentsOfURL:tutorialsUrl];

// 2
TFHpple *tutorialsParser = [TFHpple hppleWithHTMLData:tutorialsHtmlData];

// 3
NSString *tutorialsXpathQueryString = @"//tr/td";
NSArray *tutorialsNodes = [tutorialsParser searchWithXPathQuery:tutorialsXpathQueryString];

// 4
NSMutableArray *newTutorials = [[NSMutableArray alloc] initWithCapacity:0];
for (TFHppleElement *element in tutorialsNodes) {
    // 5
    Tutorial *tutorial = [[Tutorial alloc] init];
    [newTutorials addObject:tutorial];

    // 6
    tutorial.title = [[element firstChild] content];
   tutorial.peakPosition = ???;


// 8
_objects = newTutorials;
[self.tableView reloadData];

Source: (StackOverflow)

How to parse HTML on iPhone using TouchXML or other libraries?

I have a dirty HTML code that is loaded from a foreign server (so I can't make a json file or clean the html code). My HTML's structure is like:


<div class="pic"> ... </div>

<div class="pic" id="pic311809">

<input type="hidden" class="pic_id" name="pic_id" value="311809" />

<!-- tylko -->
<div style="font-family: verdana, arial, helvetica, sans-serif; font-weight: bold; font-size: 9px;">
                                        <a rel='nofollow' href="pic/show_series/1">FFFUUU (rageman)</a>

<h1 class="picture">Kochana babcia</h1>

<div class="infobar">
    Wrzucone 15 października 2010 o 16:03       przez <a rel='nofollow' href="/user/Astraly">Astraly</a>
    <a rel='nofollow' href="">Skomentuj (23)</a>
    <!-- głosowanie przeniesione pod spód obrazka -->
</div><!-- .infobar -->

<div class="pic_image">
                <a rel='nofollow' href=""><img src="" class="pic" alt="Kochana babcia - Wnusiu, a ty jeszcze nie w szkole? Dziś mamy na 10 babciu Co ty tam majaczysz? Jesteś na wagarach!? już ja to powiem twojej mamie! Ale babciu.... Przynosisz nam wstyd! Myślisz, że nie wiem o tej ostatniej niedzieli, w której nie byłeś u komunii? ZAMKNIJ SIĘ KU**A!!!! .... Nie musisz tak krzyczeć! Powiem twojej mamie z jakim tonem odnosisz się do mnie! " /></a>          </div><!-- .pic_image -->

                <div class="source">Źródło: Kto mieszka z babcią, ten wie jak to jest ;)</div>

<!-- głosowanie i ocena -->

<div class="source">

    <div class="infobar center">


        <a rel='nofollow' href="/pic/vote/311809/up"
             onclick="votowanie(this); return false;"
             class="vote voteup iconlink"
            mocne ↑         </a>


        <a rel='nofollow' href="/pic/vote/311809/down"
             onclick="votowanie(this); return false;"
             class="vote votedown iconlink"
            słabe ↓         </a>



        <span class="points">
                                87% mocnych

        <span class="count">
                                z 1291 głosów

        <span class="vote_result"></span>

                    | <a rel='nofollow' href="/user/add_favorite/311809" class="favorite">Do ulubionych</a>

    </div><!-- .infobar -->

    <div style="text-align: center;">
        <fb:like rel='nofollow' href=""
                         style="width: 130px;">

    <!-- tylko -->
    <a rel='nofollow' href="" class="picbutton">Pokaż podobne komixxy</a>       <a rel='nofollow' href="" class="picbutton">Zrób własną wersję</a>
    <div style="clear: both;"></div>

</div><!-- .source -->

</div><!-- .pic -->

<div class="pic"> ... </div>

<div class="pic"> ... </div>

<div class="pic"> ... </div>

I want to select all <div class="pic" id="*"> by using xPath //div[@class='pic'][@id].

Here are two libraries that I used:

- Hpple
- TouchXML

As for Hpple -> it's great but I can't select innerHTML of an emelent. As for TouchXML, I use it for parsing XML and it's great. But it doesn't manage to parse dirty HTML - I get dozens of errors.

Is there a way to parse this HTML in iOS5 using TouchXML? It can be a different library, but I prefer that one.

I heard something about CTidy.h and I did as instructed but nothing's changed...

Source: (StackOverflow)

hpple : 'libxml/tree.h' file not found

When I use hpple and build, aways show " 'libxml/tree.h' file not found " error.

I have , set [Header Search Path] "${SDKROOT}/usr/include/libxml2" set [Other Linker Flag] value as "-lxml2"

Source: (StackOverflow)

TFHppleElement (Hpple), parsing HTML on iphone

I'm using Hpple and it's been great so far however I want to get all the divs inside another and that I can do. But then I am unable to further parse the contents (innerHTML, and in the source it is labelled innerHTML not innerText) asking for the elements content returns nothing as there is no text directly in that element, only child nodes/elements which then contain text.

What alternatives are there to Hpple and parsing HTML on the iPhone.

Source: (StackOverflow)

Parsing inner HTML iteratively using Hpple parser and NSXMLParser

I have been working on school newspaper app for iPad platfrom. I am using NSXMLParser to get the titles, brief descriptions, and links for each article. In order to get HTML items from each parsed link, I decided to use Hpple parser. I think I am parsing and storing RSS items correctly, but when I try to parse HTML items from each parsed link using for loop, it tells me that I have an empty array for RSS items. However, I can display the content of RSS item holder on console. So, it is NOT empty. I will put some portion of my code and display from console. Please help me out. Due date for this project is soon. Thanks in advance.

Here is how I start loading my RSS parser (articleParser):

- (void)loadData {
    [self loadInitData];

    //[self loadDataWithLink];


- (void)loadInitData {
    if (sections == nil) {
        [activityIndicator startAnimating];


        Parser *articleParser = [[Parser alloc] init];
        [articleParser parseRssFeed:@"" withDelegate:self];
        [articleParser release];
    } else {



And below is how I store the recieved Article items in NSMutable array called "sections". Then I used for loop to iterate over each link of parsed articles.

- (void)receivedArticleItems:(Article *)theArticle {
    if (sections == nil) {
        sections = [[NSMutableArray alloc] init];
    [sections addObject:theArticle];

    NSLog(@"We recieved the article!");
    NSLog(@"Article: %@", theArticle);
    NSLog(@"What is in sections: %@", sections);

for (int i = 1; i < 5; i++) {
        NSLog(@"articleItems: %@",[sections objectAtIndex:0]);
        NSLog(@"articleItems at index 0: %@",[[[sections objectAtIndex:0] articleItems] objectAtIndex:0]);

        [self loadDataWithLink:[[[[sections objectAtIndex:0] articleItems] objectAtIndex:0] objectForKey:@"link"]];
    [activityIndicator stopAnimating];

Below is how I used TFFHpple parser to get HTML items from each parsed link:

- (void)loadDataWithLink:(NSString *)urlString{

 NSData *htmlData = [NSData dataWithContentsOfURL:[NSURL URLWithString:urlString]];

 // Create parser
 TFHpple *xpathParser = [[TFHpple alloc] initWithHTMLData:htmlData];

 //Get all the cells main body
 htmlElements  = [xpathParser search:@"//div[@id='main']/div[@id='mainCol1']/div[@id='main-body']"];

 // Access the first cell
 TFHppleElement *htmlElement = [htmlElements objectAtIndex:0];

 // NSString *title = [htmlElement content];

 NSLog(@"What is in element: %@", htmlElement);

 [xpathParser release];
 //[htmlData release];

And this is what I am getting on the console:

2011-05-02 22:58:35.355 TheCalAggie[2443:207] Parsing started for article!
2011-05-02 22:58:35.356 TheCalAggie[2443:207] Adding story title: Students say, 'No time for books'
2011-05-02 22:58:35.356 TheCalAggie[2443:207] From the link:
2011-05-02 22:58:35.357 TheCalAggie[2443:207] Summary: The last book managerial economics major Kiyan Parsa read for fun was The Lord of the Rings. That was in high school.
2011-05-02 22:58:35.358 TheCalAggie[2443:207] Published on: Tue, 03 May 2011 00:00:00 -0700
2011-05-02 22:58:35.359 TheCalAggie[2443:207] Parsing started for article!
2011-05-02 22:58:35.360 TheCalAggie[2443:207] Adding story title: UC Davis craft center one of largest college crafting centers
2011-05-02 22:58:35.360 TheCalAggie[2443:207] From the link:
2011-05-02 22:58:35.361 TheCalAggie[2443:207] Summary: Hidden away in the South Silo, the UC Davis Craft Center offers 10 craft studios and more than a hundred classes for students looking to learn or perfect their crafting skills.
2011-05-02 22:58:35.362 TheCalAggie[2443:207] Published on: Mon, 02 May 2011 00:00:00 -0700
2011-05-02 22:58:35.362 TheCalAggie[2443:207] We recieved the article!
2011-05-02 22:58:35.363 TheCalAggie[2443:207] Article: *nil description*
2011-05-02 22:58:35.364 TheCalAggie[2443:207] What is in sections: (
2011-05-02 22:58:35.374 TheCalAggie[2443:207] articleItems: *nil description*
2011-05-02 22:58:35.375 TheCalAggie[2443:207] articleItems at index 0: {
    link = "\n";
    pubDate = "Tue, 03 May 2011 00:00:00 -0700";
    summary = "The announcement of Osama bin Laden's death sent a wave of patriotism across the nation and UC Davis. Bin Laden was the leader of al-Qaeda - the organization allegedly behind the Sept. 11, 2001 attacks that killed over 3,000 Americans.\n";
    title = "Peaceful rally held on campus after killing of bin Laden \n";
2011-05-02 22:59:35.376 TheCalAggie[2443:207] Unable to parse.
2011-05-02 22:59:35.379 TheCalAggie[2443:207] *** Terminating app due to uncaught exception 'NSRangeException', reason: '*** -[NSMutableArray objectAtIndex:]: index 0 beyond bounds for empty array'
*** Call stack at first throw:

Any help will be greatly appreciated. Thanks again.

Source: (StackOverflow)

Hpple parsing HTML with Objective-C

I've follow the tutorial from RayWendErlich to parse HTML node.

I get the content from an index.html. I've try to use this method to fetch the background value.

+ (void)parseWithHTMLString:(NSString *)string
  NSData *data = [string dataUsingEncoding:NSUTF8StringEncoding];
  TFHpple *parser = [TFHpple hppleWithData:data isXML:NO];

  NSString *XpathQueryString = @"//div[class='content']/div/div";
  NSArray *nodes = [parser searchWithXPathQuery:XpathQueryString];

  NSMutableArray *resultArray = [[NSMutableArray alloc] initWithCapacity:0];
  for (TFHppleElement *element in nodes) {
    Model *model = [[Model alloc] init];
    model.colorString = [element objectForKey:@"style"];
    [resultArray addObject:model];

So the question is:

What I had done wrong?

Source: (StackOverflow)

NSString with some html tags, how can I search for tag and get the content of url

I have a NSString with some html tags, how can I search for tag and get the content of url? I'm not sure if I must use Hpple or a simple Regex expression. In both cases can I have some example?

Source: (StackOverflow)

TFHpple only parse texts before

This is the HTML:

<p id="content">Every Sunday, our Chef proposes a buffet high in color.
A brunch either classic or on a theme for special events
Every Sunday at the restaurant

And Here is my code:

NSString *u = [[NSString alloc] initWithFormat:@"http//", _currentNews.url];
NSURL *url = [NSURL URLWithString:u];
NSData *paHtmlData = [NSData dataWithContentsOfURL:url];

TFHpple *paParser = [TFHpple hppleWithHTMLData:paHtmlData];

NSArray *array = [paParser searchWithXPathQuery:@"//p[@id='content']/text()"];
TFHppleElement *ele= [array objectAtIndex:0];

_currentBody.text = [ele text];
_currentTitle.text = _currentNews.title;

I want to parse all text between <p id="content"></p> without the <br>

Can anyone help me?

Source: (StackOverflow)

XPath Text Search/Sibling Selection

This question might be a little specific but the test program im writing uses XPath to find the data i need in HTML. This piece of HTML(found here) is what im trying to parse.

<table border="0" cellspacing="0" cellpadding="0">
        <td class="textSm" align="right">1.&nbsp;</td> <!-- Location of the number is here -->
        <td align="left" nowrap>
            <a rel='nofollow' href="/stats/individual_stats_player.jsp?c_id=sf&playerID=467055" class="textSm">P Sandoval</a> <!-- Player location is here of the number is here -->

My goal is to find the name of the person by using the number that corresponds to him to find him. This requires me to find a node by the specific text contained in "td class="textSm" align="right">1. </td>" and then find the sibling of that node "<td align="left" nowrap>" then find the child of that sibling "<a rel='nofollow' href="/stats/individual_stats_player.jsp?c_id=sf&playerID=467055" class="textSm">P Sandoval</a>" to get the desired result. I was wondering what kind of query I could use to find this. Any help is very much appreciated.

Source: (StackOverflow)

Get the href value using XPath/Hpple/Objective-C

I tried everything that makes (and doesn't make) sense for me. I have the following html code which I try to parse with XPath in Objective-C:

<tr style="background-color: #eaeaea">
   <td class="content">
      <a rel='nofollow' href="index.php?cmd=search&id=foo">bar</a>

I get the "bar" via //tr/td[@class='content']/a/text().

But I have no idea how to get the index.php?cmd=search&id=foo.

It really drives me to despair :-(

Thank you so much for your help!

Source: (StackOverflow)

Hpple in Objective-C can't find a particular object (XML/HTML Parser)

For those veterans who haven't tried Hpple, it's great. It uses Xpath for searching through HTML/XML documents. It gets the job done and it's easy enough for a newbie like me to understand. However, I'm having trouble.

I have this chunk of HTML:

    <ul class="challengesList dailyChallengesList">

<div class="corner topLeft"></div>
<img id="ctl00_mainContent_dailyChallengesRepeater_ctl00_challengeImage" title="Gunslinger" src="/images/reachstats/challenges/0.png" alt="Gunslinger" style="border-width:0px;">
<div class="info">
<div class="rFloat">
<p id="ctl00_mainContent_dailyChallengesRepeater_ctl00_challengeExpiration" class="timeDisplay dailyExpirationCountdown"><span>0d</span><span>19h</span><span>9m</span><span class="seconds">37s</span></p>
<p class="description">Kill 150 enemies in multiplayer Matchmaking.</p>
<div class="reward">

<div id="ctl00_mainContent_dailyChallengesRepeater_ctl00_progressBox" class="barContainer">
<div id="ctl00_mainContent_dailyChallengesRepeater_ctl00_progressBar" class="bar" style="width:21%;"><span></span></div> 
<div class="clear"></div>

<div class="corner topLeft"></div>
<img id="ctl00_mainContent_dailyChallengesRepeater_ctl01_challengeImage" title="A Great Friend" src="/images/reachstats/challenges/0.png" alt="A Great Friend" style="border-width:0px;">
<div class="info">
<div class="rFloat">
<p id="ctl00_mainContent_dailyChallengesRepeater_ctl01_challengeExpiration" class="timeDisplay dailyExpirationCountdown"><span>0d</span><span>19h</span><span>9m</span><span class="seconds">37s</span></p>
<h5>A Great Friend</h5>
<p class="description">Earn 15 assists today in multiplayer Matchmaking.</p>
<div class="reward">

<div id="ctl00_mainContent_dailyChallengesRepeater_ctl01_progressBox" class="barContainer">
<div id="ctl00_mainContent_dailyChallengesRepeater_ctl01_progressBar" class="bar" style="width:40%;"><span></span></div> 
<div class="clear"></div>

<div class="corner topLeft"></div>
<img id="ctl00_mainContent_dailyChallengesRepeater_ctl02_challengeImage" title="Cannon Fodder" src="/images/reachstats/challenges/2.png" alt="Cannon Fodder" style="border-width:0px;">
<div class="info">
<div class="rFloat">
<p id="ctl00_mainContent_dailyChallengesRepeater_ctl02_challengeExpiration" class="timeDisplay dailyExpirationCountdown"><span>0d</span><span>19h</span><span>9m</span><span class="seconds">37s</span></p>
<h5>Cannon Fodder</h5>
<p class="description">Kill 50 infantry-class foes in the Campaign today.</p>
<div class="reward">

<div id="ctl00_mainContent_dailyChallengesRepeater_ctl02_progressBox" class="barContainer">
<div id="ctl00_mainContent_dailyChallengesRepeater_ctl02_progressBar" class="bar" style="width:0%;"><span></span></div> 
<div class="clear"></div>

<div class="corner topLeft"></div>
<img id="ctl00_mainContent_dailyChallengesRepeater_ctl03_challengeImage" title="Heroic Demon" src="/images/reachstats/challenges/3.png" alt="Heroic Demon" style="border-width:0px;">
<div class="info">
<div class="rFloat">
<p id="ctl00_mainContent_dailyChallengesRepeater_ctl03_challengeExpiration" class="timeDisplay dailyExpirationCountdown"><span>0d</span><span>19h</span><span>9m</span><span class="seconds">37s</span></p>
<h5>Heroic Demon</h5>
<p class="description">Kill 30 Elites in Firefight Matchmaking on Heroic or harder.</p>
<div class="reward">

<div id="ctl00_mainContent_dailyChallengesRepeater_ctl03_progressBox" class="barContainer">
<div id="ctl00_mainContent_dailyChallengesRepeater_ctl03_progressBar" class="bar" style="width:0%;"><span></span></div> 
<div class="clear"></div>


The nutty part is, I cannot get Hpple to "see" the <div class="reward">. I'm using the following to find it:

NSArray * rawProgress = [doc search:@"//ul[@class='challengesList']

This always returns an empty array. It's driving me nuts, as the same kind of thing worked for all of the other elements in this project...

Any help would be appreciated :)


This works:

NSArray * rawDescriptions = [doc search:@"//ul[@class='challengesList']

This doesn't:

NSArray * rawProgress = [doc search:@"//ul[@class='challengesList']

Furthermore, trying to list the child nodes of rFloat or reward produces a crash :(

Source: (StackOverflow)

Getting the HTML tags in hpple as well as text?

The code below takes all of the text from a certain div. Is it possible for me to take all the text from the div as well as the html attributes? So it also adds all of the <p> </p>'s and <br> </br>'s to the string, myString?

//trims string from previous page
        NSString *trimmedString = [stringy stringByTrimmingCharactersInSet:
                                 [NSCharacterSet whitespaceAndNewlineCharacterSet]]; 

    NSData *data = [[NSString stringWithContentsOfURL:[NSURL URLWithString:trimmedString]] dataUsingEncoding:NSUTF8StringEncoding];
    TFHpple *xpathParser = [[TFHpple alloc] initWithHTMLData:data];    
    NSArray *elements  = [xpathParser searchWithXPathQuery:@"//div[@class='field-item even']"];
    TFHppleElement *element = [elements lastObject]; //may need to change this number?!
    NSString *mystring = [self getStringForTFHppleElement:element];

    trimmedTextView.text = [trimmedTextView.text stringByAppendingString:mystring];

Method here:

-(NSString*) getStringForTFHppleElement:(TFHppleElement *)element 

NSMutableString *result = [NSMutableString new];

// Iterate recursively through all children
for (TFHppleElement *child in [element children])
    [result appendString:[self getStringForTFHppleElement:child]];

// Hpple creates a <text> node when it parses texts
if ([element.tagName isEqualToString:@"text"])
    [result appendString:element.content];

return result;

Any ideas would be appreciated. Cheers.

Source: (StackOverflow)

Hpple Error - "_OBJC_CLASS_$_TFHpple"

My app uses Hpple. I've included, TFHpple.h, TFHpple.m, TFHppleElement.h, TFHppleElement.m, XPathQuery.h & XPathQuery.m. Also included ${SDKROOT}/usr/include/libxml2 and -lxml2.

I have this tiny bit of code:

NSData *data = [[NSData alloc] initWithContentsOfFile:@"example.html"];
TFHpple *xpathParser = [[TFHpple alloc] initWithHTMLData:data];

When I try to run it, I receive this error:

"_OBJC_CLASS_$_TFHpple", referenced from: objc-class-ref in test.o
ld: symbol(s) not found for architecture armv7
clang: error: linker command failed with exit code 1 (use -v to see invocation)

I don't know how to solve this. Any ideas?

Source: (StackOverflow)

Hpple, getting text after

So I think this is my last Hpple question! I have found an entry in the HTML doc that I am parsing with Hpple. I have tried many different queries, but no luck. Here is a sample of the HTML.HTML

I can get the text staring with "Today's project" with //div[@class = 'entry-content']/p. I can also get the next tag with //div[@class = 'entry-content']//a[@title]//* along with all the text after it. However, as you can see there is still some text after "/span". However, nothing that I have tried will work. I have tried looking at the children of the element, tried //div[@class = 'entry-content']/p//text(), //div[@class = 'entry-content']/p//following::*, nothing works. If anyone has any ideas, I am all ears!!! Thanks again for all of your time.

EDIT #1 As I try different things I was looking at the HTML. Under the p tag is the text I need, "Today's project..." then there is a span changing the text color and including a link, followed by more text. What I need to do is jump over that span to continue reading the text. Maybe my question should be, how do you jump over a span? Thanks for looking.

EDIT #2 Well, I am going to start a bounty on this one. I really need some help. I have looked everywhere and have tried a ton of different things. But nothing is working for me. I can not get the text after that one closed span. And this format appears often. The author of the blog I am parsing this for the App sometimes changes the style of her words and I can not get the text after she changes the style. Any help would be appreciated. Thanks again for looking.

EDIT #3 Here is another screen shot of the DOM tree HTML. If you can notice I am parsing the div class "entry content" The text in question is exposed. Starts with "Today..." then the span to change the color of the text, I can get that text. It is the text after that, that I need, " It was one....." right before the close p tag.

Dom Tree

I also placed the entire HTML on gist. HERE. The line in question is 102. Although the HTML did not copy that nicely. Thanks.

Source: (StackOverflow)

HTML parsing in iOS using Hpple search href

I want to find url links in html source code. I am using Hpple for parsing the HTML. I know by giving the path we will find the url in html. For different url's the path must be changed. So i am unable to search url links.

Let me explain clearly. For example, i am taking In this web site contains cricket, weather, sports, mail, news etc., i want to find that links in the source code.

What i am doing here is simply i am giving the path and search "a href". If the path is correct all urls present in the path will display. Remaining url's present in the same page(not in same path) are not getting. How should I do this?

TFHpple *htmlParser = [TFHpple hppleWithHTMLData:htmlData];

if (htmlData)   //check that htmlData contains data
    //Enter your Xpath query here to obtain the data you want from the webpage
    //more info on Xpath queries can be found at
    NSString *content;
    NSArray *nodes = [htmlParser searchWithXPathQuery:@"//html/head/link"];
    //NSString *searchStr = @"a href";

    for (TFHppleElement *element in nodes) {
        NSString *href = [element attributes][@"href"];
        if ([href rangeOfString:@"href"].location!= (NSNotFound)) {
            NSLog(@"got it");
            NSLog(@"not found");
        content = [element content];
        //NSLog(@"%@ -> %@", href, content);
        [urlsArray addObject:href];
    }   //searching for all h2 in document
    NSLog(@"urls array = %@", urlsArray);

    //Set the textView text on the view to the result of the HTML parser
    _displayTextView.text = [NSString stringWithFormat:@"%@", urlsArray];

    //Display an error if htmlData is not available. I.E no internet connection etc
    _displayTextView.text = @"Error - No data";

Thanks in advance.

Source: (StackOverflow)