You’ll often be told not to parse HTML with RegEx – but what if you’re a rebel?
WHY YOU SHOULDN’T PARSE HTML WITH REGEX
WHY YOU COULD PARSE WITH REGEX
Parsing from static templates is pretty easy with RegEx and quite simple. The basic course of action is matching a line with what you’d want to match and either add grouping selectors in the RegEX or get your hands dirty and polish the data from that abhorrent line of HTML.
I made a successful RESTful service, Jikan.moe, using nothing but RegEx. This didn’t require any extra dependencies, libraries, yadda yadda. Neither was speed a concern since the parse was pretty quick.
What am I going on about?
With a terrible choice of a name, I began to simplify my repetitive tasks while parsing HTML using RegEx which consists of RegEx/pattern matching, loops, and so on.
Skraypar is an abstract PHP class which works by parsing by pattern matching, Iterators and Look Aheads’.
The parsing tasks split into 2.
- (Inception) Pattern matching & callback on the line of match – Iterators
- Additional pattern matching and callbacks within Iterators for dynamic HTML location – Look Aheads’
Think of it as the Iterator matching a table, and another Iterator matching the rows and the Look Aheads’ parsing the cells.
This is a pretty abstract and experimental project, I won’t blame you if you think I’ve gone mad. But heck – finding new ways to do things is one thing I like to do.
How does it work
1 – File Loading
Skraypar uses Guzzle as a dependency to fetch the HTML or if it’s a local file, it simply loads it. The file is loaded into an array, each line means each new index.
1B – Accepting & Rejecting
Fetching from the interwebs means you get to tell Skraypar which HTTP responses to allow and which ones to throw an exception at. By default, 200 (OK) and 303 (Forwarding) are accepted HTTP responses.
2 – Rules
When you extend a class with Skraypar, you’ve to set a method namely, loadRules, with added rules for Skraypar to remember when parsing.
Rules are patterns and callback functions for that pattern match. They loop at every line of code and if there’s a match and a callback executes – that particular rule is disabled.
3 – Iterators
Iterators are used inside of Rule Callbacks, by setting a breakpoint pattern and a callback pattern; the Iterator loops over each line executing a pattern match or Look Aheads until that breakpoint pattern is reached.
If breakpoint pattern is not found, Skraypar throws an exception that the parser failed by pointing to an unset offset in the array of lines from the file (since it increments)
There can be Iterators within Iterators.
4 – Look Aheads
Look Aheads are used inside Iterators. Usually, one could simply access a data on the next line given a pattern match for a line by incrementing the iterator count by 1. But in given cases, the data may not be available on the next line rather on the offset of 2 lines. This is a dynamic location for the data that is being parsed, hence a Look Ahead method basically looks for a pattern of that dynamically located data and parses it with a function callback.
5 – References
Everything is passed, controlled and set by references within the Iterator callables. You can pass a reference of the Iterator itself within it’s own callable to access setting responses or using the Look Ahead method of the Iterator Class or manually setting the iterator count property to an offset.
That’s pretty much it. This project is in development and is to be used as a dependency for the next major Jikan release. It’s not limited to Jikan, it can be used on any website or file.
No documentation is available at the moment.