Skraypar: Pattern parsing with Iterators and Look Aheads

You’ll often be told not to parse HTML with RegEx – but what if you’re a rebel?

WHY YOU SHOULDN’T PARSE HTML WITH REGEX

Clicky.

WHY YOU COULD PARSE WITH REGEX

Parsing from static templates is pretty easy with RegEx and quite simple. The basic course of action is matching a line with what you’d want to match and either add grouping selectors in the RegEX or get your hands dirty and polish the data from that abhorrent line of HTML.


I made a successful RESTful service,Ā Jikan.moe, using nothing but RegEx. This didn’t require any extra dependencies, libraries, yadda yadda. Neither was speed a concern since the parse was pretty quick.


 

 

What am I going on about?

Enter;Skraypar

With a terrible choice of a name, I began to simplify my repetitive tasks while parsing HTML using RegEx which consists of RegEx/pattern matching, loops, and so on.

Skraypar is an abstract PHP class which works by parsing by pattern matching, Iterators and Look Aheads’.

The parsing tasks split into 2.

  • (Inception) Pattern matching & callback on the line of match – Iterators
  • Additional pattern matching and callbacks within Iterators for dynamic HTML location – Look Aheads’

 

Think of it as the IteratorĀ matching a table, and another Iterator matching the rows and the Look Aheads’Ā parsing the cells.

This is a pretty abstract and experimental project, I won’t blame you if you think I’ve gone mad. But heck – finding new ways to do things is one thing I like to do.

 

How does it work

1 – File Loading

Skraypar usesĀ GuzzleĀ as a dependency to fetch the HTML or if it’s a local file, it simply loads it. The file is loaded into an array, each line means each new index.

1B – Accepting & Rejecting

Fetching from the interwebs means you get to tell Skraypar which HTTP responses to allow and which ones to throw an exception at. By default, 200 (OK) and 303 (Forwarding) are accepted HTTP responses.

2 – Rules

When you extend a class with Skraypar, you’ve to set a method namely,Ā loadRules,Ā with added rules for Skraypar to remember when parsing.


Rules are patterns and callback functions for that pattern match. They loop at every line of code and if there’s a match and a callback executes – that particular rule is disabled.


3 – Iterators

Iterators are used inside of Rule Callbacks, by setting a breakpoint pattern and a callback pattern; the Iterator loops over each line executing a pattern match or Look Aheads until that breakpoint pattern is reached.

If breakpoint pattern is not found, Skraypar throws an exception that the parser failed by pointing to an unset offset in the array of lines from the file (since it increments)

There can be Iterators within Iterators.

4 – Look Aheads

Look Aheads are used inside Iterators. Usually, one could simply access a data on the next line given a pattern match for a line by incrementing the iterator count by 1. But in given cases, the data may not be available on the next line rather on the offset of 2 lines. This is a dynamic location for the data that is being parsed, hence a Look AheadĀ method basically looks for a pattern of that dynamically located data and parses it with a function callback.

5 – References

Everything is passed, controlled and set by references within the IteratorĀ callables. You can pass a reference of the Iterator itself within it’s own callable to access setting responses or using the Look Ahead method of the Iterator Class or manually setting the iterator count property to an offset.


That’s pretty much it. This project is in development and is to be used as a dependency for the next major Jikan release. It’s not limited to Jikan, it can be used on any website or file.

 

No documentation is available at the moment.


Links

Advertisements

Jikan News & Updates – Mid-2018

Okay, this news is almost a month old. Here goes.


Already 5 months into 2018 and I’ve already exciting news regarding Jikan. I wrote a post back in January – laying out the road map of Jikan for the current year. I had announced 4 more features that were to be done this year. I’ve completed 3 of them with User Related scraping to be done by the release of REST 2.3.

 

RELATED

 

Over the past year, Jikan has gained a huge traction, client and development wise. Here are the highlights of the past 6 months.

Jikan REST 2.2

With the release of REST 2.2, came many new features.

  1. More extended data for Anime and Manga (with the exception of reviews & recommendations – for now)
  2. Anime/Manga/People/Characters Search! This comes with advanced search filters and pagination support.
  3. Top Anime and Manga with advanced filters
  4. Season – To list the Anime airing this season and for other years/seasons.
  5. Schedule – Anime scheduling for the week for this season
  6. MetaĀ – Experimental requests for getting usage stats for Jikan and most requested links by daily, weekly & monthly periods.

 

And some service changes.

  1. Jikan has moved domain to Jikan.moe. The previous (Jikan.me) domain has been discontinued.
  2. Jikan REST API is now being hosted in Tokyo (closer to MyAnimeList’s Tokyo server) by an awesome dude calledĀ Hibiki.

 

100% Jikan Open Source

That’s right. The entirety of Jikan has been open-sourced under MIT License. This includes the website, docs and REST API service.

This not only adds flexibility, but the code is easier to manage and deploy. There goes the days of patches having to wait till the next REST version. Now the RESTful services is updated as soon as a new JikanPHP version is out – this ofcourse will vary for major feature releases as I’ve to set up the controllers on the REST service.

 

Usage Stats

This is theĀ Meta feature I mentioned.

 

It works by logging requests made in Redis and increasing the respective counters for that request. Here are some interesting usage links.

You canĀ read moreĀ about the further usability.

 

Late 2018 Roadmap (REST 2.3)

So here’s a few stuff that will definitely be completed before the end of 2018. Perhaps in the upcoming months.

  • Top Characters/People
  • Anime/Manga Extended Data – Reviews & Recommendations
  • User Data – Profile, Watch History, Friends

 

Early 2019

This is given if the MyAnimeList’s new API hasn’t been publicly released yet or people haven’t started ditching Jikan.

  • JikanPHP (Core) – Rewrite. This will introduce JikanPHP 2.X.
    • Separation of the parser as an abstraction class for Requests & RegEx parsing
    • Faster Parsing – ReworkĀ Extended Requests.
  • Jikan REST 3.0Ā – Given the crazy amount of requests we’ve been gettings. The main problem is rate limiting from MyAnimeList since we’re making all these requests from one server, i.e one IP Address.
    • Rework Redis Database data caching
    • API Keys. Note:Ā This won’t replace free, unmonitored GET requests. The current limit of 5,000 will be lowered down to encourage app/project developers to get an API key that will support higher rate limits.
    • Rework Extended Requests as separate API calls. This is a bottleneck right now as extended requests makeĀ 2 requests instead of one to merge the data for you into 1 request.
  • Relational dataĀ – Expand to other sites (maybe)