Parse HTML in Swift instead of using regular expressions


yellow pillow

Here is the HTML code I want to parse in Swift:

<td class="pinyin">
<a href="rsc/audio/voice_pinyin_pz/yi1.mp3">
<span class="mpt1">yī</span></a> 
<a href="rsc/audio/voice_pinyin_pz/yan3.mp3">
<span class="mpt3">yǎn</span>
</a>
</td>

I've read that regular expressions are not a good way to parse through HTML but anyway I've written an expression that captures what I want (which is the letters between the spans): andyǎn

Regular expression:

/pinyin.+<span.+>(.+)<\/.+<span.+>(.+)<\//Us

I don't know how to implement it so that I can capture both at yǎnthe same time and save it to an array. Also, I wonder if there is any other way I can do this without Regex.

edit:

I ended up using TFHpple as suggested by Rob . Although it took me a long time to figure out how to import this into Swift, so I thought it would be helpful to post it here for convenience:

1. Open the project and drag the TFHpple file into it

2. At this point, if the current project does not contain any Obj-C code, XCode may prompt you to create a bridging header class file. In this bridging header file you should add:

#import <Foundation/Foundation.h>
#import "TFHpple.h"
#import "TFHppleElement.h"

3. Select the target under " General" in " Link" and "Libraries" (just scroll down in the " General" tab to see the target, add libxml2.2.dylib and libxml2.dylib).

4. In Build Settings , in Header Search Path , add $(SDKROOT)/usr/include/libxml2 WARNING: Make sure it's not User Header Search Path , as this is not the same

5. In the build settings , in the other linker tags , add -lxml2

enjoy your meal!

grab

You can use the typical iOS HTML parser TFHpple :

let data = NSData(contentsOfFile: path)
let doc = TFHpple(HTMLData: data)
if let elements = doc.searchWithXPathQuery("//td[@class='pinyin']/a/span") as? [TFHppleElement] {
    for element in elements {
        println(element.content)
    }
}

Or you can use NDHpple :

let data = NSData(contentsOfFile: path)!
let html = NSString(data: data, encoding: NSUTF8StringEncoding)!
let doc = NDHpple(HTMLData: html)
if let elements = doc.searchWithXPathQuery("//td/a/span") {
    for element in elements {
        println(element.children?.first?.content)
    }
}

I have more miles on TFHpple, so I'm personally more comfortable with it. NDHpple seems to be an alternative in theory, although I'm not that crazy personally (eg why doesn't the HTMLDataparameter take a string instead NSData? Why do I have to navigate through the children to get //td/a/spanthe content of the result ? [@class='pinyin']No qualifiers work etc.). However, try both and see which you prefer.

Both need the bridging header: TFHpple needs it TFHpple.hin the bridging header, NDHpple needs the libxml header there. See each documentation for more information.

Related