Skraypar: Pattern parsing with Iterators and Look Aheads

You’ll often be told not to parse HTML with RegEx – but what if you’re a rebel?

WHY YOU SHOULDN’T PARSE HTML WITH REGEX

Clicky.

WHY YOU COULD PARSE WITH REGEX

Parsing from static templates is pretty easy with RegEx and quite simple. The basic course of action is matching a line with what you’d want to match and either add grouping selectors in the RegEX or get your hands dirty and polish the data from that abhorrent line of HTML.


I made a successful RESTful service,¬†Jikan.moe, using nothing but RegEx. This didn’t require any extra dependencies, libraries, yadda yadda. Neither was speed a concern since the parse was pretty quick.


 

 

What am I going on about?

Enter;Skraypar

With a terrible choice of a name, I began to simplify my repetitive tasks while parsing HTML using RegEx which consists of RegEx/pattern matching, loops, and so on.

Skraypar is an abstract PHP class which works by parsing by pattern matching, Iterators and Look Aheads’.

The parsing tasks split into 2.

  • (Inception) Pattern matching & callback on the line of match – Iterators
  • Additional pattern matching and callbacks within Iterators for dynamic HTML location – Look Aheads’

 

Think of it as the Iterator¬†matching a table, and another Iterator matching the rows and the Look Aheads’¬†parsing the cells.

This is a pretty abstract and experimental project, I won’t blame you if you think I’ve gone mad. But heck – finding new ways to do things is one thing I like to do.

 

How does it work

1 – File Loading

Skraypar uses¬†Guzzle¬†as a dependency to fetch the HTML or if it’s a local file, it simply loads it. The file is loaded into an array, each line means each new index.

1B – Accepting & Rejecting

Fetching from the interwebs means you get to tell Skraypar which HTTP responses to allow and which ones to throw an exception at. By default, 200 (OK) and 303 (Forwarding) are accepted HTTP responses.

2 – Rules

When you extend a class with Skraypar, you’ve to set a method namely,¬†loadRules,¬†with added rules for Skraypar to remember when parsing.


Rules are patterns and callback functions for that pattern match. They loop at every line of code and if there’s a match and a callback executes – that particular rule is disabled.


3 – Iterators

Iterators are used inside of Rule Callbacks, by setting a breakpoint pattern and a callback pattern; the Iterator loops over each line executing a pattern match or Look Aheads until that breakpoint pattern is reached.

If breakpoint pattern is not found, Skraypar throws an exception that the parser failed by pointing to an unset offset in the array of lines from the file (since it increments)

There can be Iterators within Iterators.

4 – Look Aheads

Look Aheads are used inside Iterators. Usually, one could simply access a data on the next line given a pattern match for a line by incrementing the iterator count by 1. But in given cases, the data may not be available on the next line rather on the offset of 2 lines. This is a dynamic location for the data that is being parsed, hence a Look Ahead method basically looks for a pattern of that dynamically located data and parses it with a function callback.

5 – References

Everything is passed, controlled and set by references within the Iterator¬†callables. You can pass a reference of the Iterator itself within it’s own callable to access setting responses or using the Look Ahead method of the Iterator Class or manually setting the iterator count property to an offset.


That’s pretty much it. This project is in development and is to be used as a dependency for the next major Jikan release. It’s not limited to Jikan, it can be used on any website or file.

 

No documentation is available at the moment.


Links

Advertisements

Jikan News & Updates – Mid-2018

Okay, this news is almost a month old. Here goes.


Already 5 months into 2018 and I’ve already exciting news regarding Jikan. I wrote a post back in January – laying out the road map of Jikan for the current year. I had announced 4 more features that were to be done this year. I’ve completed 3 of them with User Related scraping to be done by the release of REST 2.3.

 

RELATED

 

Over the past year, Jikan has gained a huge traction, client and development wise. Here are the highlights of the past 6 months.

Jikan REST 2.2

With the release of REST 2.2, came many new features.

  1. More extended data for Anime and Manga (with the exception of reviews & recommendations – for now)
  2. Anime/Manga/People/Characters Search! This comes with advanced search filters and pagination support.
  3. Top Anime and Manga with advanced filters
  4. Season – To list the Anime airing this season and for other years/seasons.
  5. Schedule – Anime scheduling for the week for this season
  6. Meta РExperimental requests for getting usage stats for Jikan and most requested links by daily, weekly & monthly periods.

 

And some service changes.

  1. Jikan has moved domain to Jikan.moe. The previous (Jikan.me) domain has been discontinued.
  2. Jikan REST API is now being hosted in Tokyo (closer to MyAnimeList’s Tokyo server) by an awesome dude called¬†Hibiki.

 

100% Jikan Open Source

That’s right. The entirety of Jikan has been open-sourced under MIT License. This includes the website, docs and REST API service.

This not only adds flexibility, but the code is easier to manage and deploy. There goes the days of patches having to wait till the next REST version. Now the RESTful services is updated as soon as a new JikanPHP version is out – this ofcourse will vary for major feature releases as I’ve to set up the controllers on the REST service.

 

Usage Stats

This is the Meta feature I mentioned.

 

It works by logging requests made in Redis and increasing the respective counters for that request. Here are some interesting usage links.

You can read more about the further usability.

 

Late 2018 Roadmap (REST 2.3)

So here’s a few stuff that will definitely be completed before the end of 2018. Perhaps in the upcoming months.

  • Top Characters/People
  • Anime/Manga Extended Data – Reviews & Recommendations
  • User Data – Profile, Watch History, Friends

 

Early 2019

This is given if the MyAnimeList’s new API hasn’t been publicly released yet or people haven’t started ditching Jikan.

  • JikanPHP (Core) – Rewrite. This will introduce JikanPHP 2.X.
    • Separation of the parser as an abstraction class for Requests & RegEx parsing
    • Faster Parsing – Rework¬†Extended Requests.
  • Jikan REST 3.0¬†– Given the crazy amount of requests we’ve been gettings. The main problem is rate limiting from MyAnimeList since we’re making all these requests from one server, i.e one IP Address.
    • Rework Redis Database data caching
    • API Keys. Note:¬†This won’t replace free, unmonitored GET requests. The current limit of 5,000 will be lowered down to encourage app/project developers to get an API key that will support higher rate limits.
    • Rework Extended Requests as separate API calls. This is a bottleneck right now as extended requests make¬†2 requests instead of one to merge the data for you into 1 request.
  • Relational data¬†– Expand to other sites (maybe)

Jikan API – Vision 2018 ūüéÜ [Unofficial MyAnimeList API]

So it’s 2018 and Jikan is now 1 year old! MyAnimeList announce late 2017 that they’ll be working on fixing up their API but until then I’ll have Jikan running around. I have some plans for Jikan that need to be done, hopefully by mid-2018 or earlier, depending on college.

 

READ

 

There are some things I’m still interesting in scraping off of MAL, here’s the list.

 

User Profile

Taking an example of my own profile;

 

There’s a lot of data available per user profile. The best part here would be their favorite characters, people, anime, manga and basic stats. The hardest part to extract here would be the user based “About Me” which is highly customizable. So this, I might consider parsing since MAL’s HTML source is already terrible enough.

 

Top Anime/Manga/People/Characters

These pages give you access to a paginated list of anime/manga/people/characters ranked by their popularity/favoritism by the community from #1 to the last ranking available. Tis a gold mine entry.

 

Anime/Manga/Person/Character Search!

The official MAL API already has this feature but it only returns the first page of results! It only allows simple string queries and requires user authentication for the API call to work, which is what Jikan is meant to over come. This has been a requested feature, so I’ll most likely be working on a parser for this in the months to come.

 

 

Extended Data for Anime/Manga

This has been in the prospect of Jikan since the beginning, but I’ve held off any other extended parsing other than characters/staff and episodes until recently as I begun making scrapers for Pictures, Videos & News related to the item. This trend will continue as there are more pages that consist of interesting data regarding an anime or manga. Especially the reviews page since this has the best data for sentient analysis and averaging of any show or manga.

 

Will be focusing on these 4 for this year! It takes time to mine pure data since scraping HTML off MAL means a lot of weird and round-about ways of doing things!

Jikan REST 2.0 – Developers Preview + November Update

What a basic app utilizing Jikan would require would be data on any anime or manga, then furthermore on the characters and staff members. These 4 types of data are essential to any app for the masses and Jikan can now robustly cover any app developer in those areas.

 

Tl;Dr: https://jikan.me/api/v2.0-dev/

Note: There is no doc available for this endpoint as of yet, you’ll have to play by the data responses.

 

without further ado

It’s been a year since I started on Jikan and half a year since the REST API went up. To get this out of the way – I’m immensely excited to announce that a complete rewrite of the Jikan PHP API has been completed. Making the API more:

  • Friendly to developers for contribution
  • Cleaner responses + less bugs
  • Easier installation
  • More Robust
  • PSR-4/Autoloading

 

Now what’s left is the rewrite of the REST API. I’ve selected Lumen¬†as the micro-framework to handle Jikan REST requests. And that’s currently in the works as I wrap my head around the features of this framework.

But my excitement could not be held back and I really wanted to see the new API in action – spitting out nicely formatted JSON without any malformed sorts of data. I quickly set up a new endpoint using the old REST API code – producing a developers endpoint.

And I hereby present: https://jikan.me/api/v2.0-dev/

You’ll notice a massive difference from the v1.1 or v1.2 REST version as this version of the API is equipped with the latest Jikan PHP commits. Now let me show you the possible type of requests.

 

  1. http://jikan.me/api/v2.0-dev/anime/1
  2. http://jikan.me/api/v2.0-dev/anime/1/characters_staff
  3. http://jikan.me/api/v2.0-dev/anime/1/episodes
  4. http://jikan.me/api/v2.0-dev/anime/21/episodes/1 – Episode pages are now paginated if there’s more than a 100 episodes, a key named episode_last_page will tell you how many pages the episodes page is paginated into.
  5. http://jikan.me/api/v2.0-dev/anime/21/episodes/2
  6. http://jikan.me/api/v2.0-dev/manga/1
  7. http://jikan.me/api/v2.0-dev/manga/1/characters
  8. http://jikan.me/api/v2.0-dev/person/1
  9. http://jikan.me/api/v2.0-dev/character/1

 

With these core prospects for the API being stable and robust, it’s time to focus on implementing more endpoints for scraping more data out of an anime, or the most required function – the search endpoint.

 

the success of this project

I’ve been contacted by a plethora number of developers regarding the usage/feedback/etc of this project. Everyone’s happy – I’m happy. There’s a working, easy to use API that can tell you anything about your favorite Japanese cartoon¬†and I think that’s what matters the most.

Currently there’s a popular and active android App, namely AnYme that’s utilizing Jikan for their data, you can check them out here:¬†https://github.com/zunjae/anYme

The usage of Jikan has been very successful – there’s a thousand of requests spanned across of hundreds of clients daily. Here’s a small chart on the usage since we hit off back in May.

jikan stats chart

 

what’s in store next?

The next foremost thing that is going to be accomplished is going to be REST v2.0. This will be based on the Lumen framework and a much faster server – thanks to a friend of mine. The base endpoint would be¬†api.jikan.me, instead of what we’ve now.

After that – I’ll see what’s next on the agenda.

 

oh by the way

Did I mention that Jikan is now available on packagist.org/composer? You can install it as a dependency in your PHP project as simply as: composer require jikan-me/jikan 

Using jQuery events on dynamically appended HTML

 

JQuery has become one of our foundations of processing requests while staying on the same page. No refreshes, simple requests to your backend PHP scripts via AJAX and updating the DOM Modals.

But often times, we’re met with a minor setback that leaves us pondering for hours. This is the second time that its happened to me and I decided to do something about it – write a blog post (how convenient, am i right?).

Lets get started with your code jQuery event.

$(‘.some_class’).on(‘click’, function() { … });

This may work for the DOM that was given when your page loaded but it doesn’t do well for appended content. The fact that you’re possibly using .on¬†for appending dynamically is half of your solution but there’s still one more step left.

undynamically

The jquery event isn’t firing when I click on edit, unlike the DOM that’s added on page load.

Fortunately, there’s a really quick fix for this. You need to rewrite it as:

$(document).on(‘click’, ‘.some_class’, function() { … });

Apparently jQuery now reads the DOM off the document, where it’s dynamically appended. Unfortunately, I have no idea why the previous method doesn’t work.

 

Read

CS2D – Released Stats:Extract 0.3

I haven’t talked about what Stats:Extract is here before. So I’ll introduce it first. S:E, in short, is a PHP library that’s used to¬†parse cs2d’s serverstats.html file & decode&parse cs2d’s userstats.dat file.

CS2D (http://cs2d.com) servers generate a statistic html page which has the design equivalent to a potato and henceforth this parses the data from it so any web developer would be able to implement it into his own design. The servers also have a userstats.dat which holds player rankings in ENDIAN coded format so it decodes that as well. But that’s not all! It even has the ability to get any server’s real-time¬†data.

The recent update comes with bug fixes, script optimizations and changes. There is nothing now although. It’s been about half a year since the last update and I have a few more plans for the script till ¬†I put it in it’s final state.

The download: http://www.unrealsoftware.de/files_show.php?file=16081
The Github: https://github.com/irfan-dahir/stats-extract

And finally, a demo to summarize its potential: http://irfandahir.com/stats-extract/

the beginning.

I’ve finally got my own domain registered. It’s irfandahir.com. I’m still using freehost, just linked the name-servers. I wouldn’t need more for my own portfolio so a freehost works well.

I’ve done some updates. The website is now more interactive. Every single effect, apart from the fade-ins in the intro-panel at the top is done with pure CSS3, especially the new Project Viewer. CSS3 animation transitions are epic and I’m currently doing my work around with them.

header2
For example, this is the header on hover. Those are two borders wrapped around the “r” and “f”. And the “” expand on hover as well. There are some alpha transitions as well. I did experiment around and this was the final result that looked appealing. It’s only available on the desktop mode however.

projects1
projects3
Here is the new projects viewer. The background of the boxes is the average color extracted from the image itself. On hover it displays the preview and download links(if there are any). The transition effect itself is pretty slick. I’ve been working with rotated boxes transformed boxes for it.
projects2
The old project viewer is still available as well. You can choose which one you’d wish to use from the little button on the top right of the project viewer. It was a bit of a hassle but I managed to get it done in pure CSS3. The new project viewer is default on desktops. However the old project viewer is the default and only one available on tablets. It’s a bit different for phones tho. The phone version has a horizontal slider for it. It’s a minimized version of the new project viewer.

who-i-am
I’ve minimalised the “about me” context a bit more. Although I’ve not shown most of the skillsets here but I’m planning to replace it with doughnut charts instead. It’d rather look more appealing. I’ll do that with CSS3 transform too.

message1
message2
The contact section has minor changes. I’ve changed the “You can only contact me once per day” text to the following with a link to my email if you press it so. The only reason for this is because since I’m getting host on a free account, the limit for daily directly emailing someone is around 15. And abusing it results in an account lockdown. Therefore I had to resort to putting up a database for it. After doing so I was met by bot advertisement spams so I decided to further lock it down. And I haven’t gotten a single bot message since then. So I’ve got that thing going for me. o.x

This is fairly pretty much it for now. I’m working on a few other client projects so I’ll update on those when I’m done with them.

the #hype for my own domain. yaaaaayyyyyyyyy. ;0

irfandahir.com