Skraypar: Pattern parsing with Iterators and Look Aheads

You’ll often be told not to parse HTML with RegEx – but what if you’re a rebel?

WHY YOU SHOULDN’T PARSE HTML WITH REGEX

Clicky.

WHY YOU COULD PARSE WITH REGEX

Parsing from static templates is pretty easy with RegEx and quite simple. The basic course of action is matching a line with what you’d want to match and either add grouping selectors in the RegEX or get your hands dirty and polish the data from that abhorrent line of HTML.


I made a successful RESTful service,¬†Jikan.moe, using nothing but RegEx. This didn’t require any extra dependencies, libraries, yadda yadda. Neither was speed a concern since the parse was pretty quick.


 

 

What am I going on about?

Enter;Skraypar

With a terrible choice of a name, I began to simplify my repetitive tasks while parsing HTML using RegEx which consists of RegEx/pattern matching, loops, and so on.

Skraypar is an abstract PHP class which works by parsing by pattern matching, Iterators and Look Aheads’.

The parsing tasks split into 2.

  • (Inception) Pattern matching & callback on the line of match – Iterators
  • Additional pattern matching and callbacks within Iterators for dynamic HTML location – Look Aheads’

 

Think of it as the Iterator¬†matching a table, and another Iterator matching the rows and the Look Aheads’¬†parsing the cells.

This is a pretty abstract and experimental project, I won’t blame you if you think I’ve gone mad. But heck – finding new ways to do things is one thing I like to do.

 

How does it work

1 – File Loading

Skraypar uses¬†Guzzle¬†as a dependency to fetch the HTML or if it’s a local file, it simply loads it. The file is loaded into an array, each line means each new index.

1B – Accepting & Rejecting

Fetching from the interwebs means you get to tell Skraypar which HTTP responses to allow and which ones to throw an exception at. By default, 200 (OK) and 303 (Forwarding) are accepted HTTP responses.

2 – Rules

When you extend a class with Skraypar, you’ve to set a method namely,¬†loadRules,¬†with added rules for Skraypar to remember when parsing.


Rules are patterns and callback functions for that pattern match. They loop at every line of code and if there’s a match and a callback executes – that particular rule is disabled.


3 – Iterators

Iterators are used inside of Rule Callbacks, by setting a breakpoint pattern and a callback pattern; the Iterator loops over each line executing a pattern match or Look Aheads until that breakpoint pattern is reached.

If breakpoint pattern is not found, Skraypar throws an exception that the parser failed by pointing to an unset offset in the array of lines from the file (since it increments)

There can be Iterators within Iterators.

4 – Look Aheads

Look Aheads are used inside Iterators. Usually, one could simply access a data on the next line given a pattern match for a line by incrementing the iterator count by 1. But in given cases, the data may not be available on the next line rather on the offset of 2 lines. This is a dynamic location for the data that is being parsed, hence a Look Ahead method basically looks for a pattern of that dynamically located data and parses it with a function callback.

5 – References

Everything is passed, controlled and set by references within the Iterator¬†callables. You can pass a reference of the Iterator itself within it’s own callable to access setting responses or using the Look Ahead method of the Iterator Class or manually setting the iterator count property to an offset.


That’s pretty much it. This project is in development and is to be used as a dependency for the next major Jikan release. It’s not limited to Jikan, it can be used on any website or file.

 

No documentation is available at the moment.


Links

Advertisements

Jikan – The Unofficial MyAnimeList REST API

Jikan is a REST based API that fulfills the lacking requests of the official MyAnimeList API for developers. https://jikan.me


Documentation: https://jikan.me/docs

Source: https://github.com/irfan-dahir/jikan

 

Introduction

As the idea of creating my own Anime Database sparked within me, I set out to create parse data from an existing website, MyAnimeList, since I utilize it a lot for managing the content I parse through my mind. 

Read:¬†Making My Own Anime Database – Part 1¬†–¬†Making My Own Anime Database – Part 2

I was dumbfounded when I realized that the official API did not support for fetching anime or manga details. There was a way to do this via the official API but it was totally round-about. You had to use one of their API endpoints where you searched for a query and it would return a list of similar anime/manga with their details.

I could have used AniList’s API but I was already familiar with scraping data. I’ve done this before in a lot of former projects. And so I set out to develop Jikan to¬†fulfill my parent goal; to make my own anime database. And so it¬†took a project of it’s own.

History

Jikan was uploaded to GitHub on January the 11th with a single function of scraping anime data.

It wasn’t even called ‘Jikan’ back then, it was called the ‘Unofficial MAL API’. Quite generic, I know.

I came to terms with the name ‘Jikan’ as it was the only domain name available for the .me¬†TLD and it’s a¬†commonly used word in Japanese – ‘Time’. The ‘Plan A’ name was ‘Shiro’, but unfortunately everyone seemed to have hogged all the TLDs for it.

With this API, I guess you could say I’d be saving developers some … Jikan –¬†Heh.

 


 

Enter;Jikan

Sounds like a title from the Steins;Gate multiverse.

Anyways, Jikan can provide these details from MAL simply by their ID on MAL

  • Anime
  • Manga
  • Character
  • Person

These are the implemented functions as of now. There are some further planned features.

Search Results

The official API does support this. However;

  1. The response is in XML
  2. It only shows the results of the first page

Jikan will change that by showing results for whatever page you select. And oh Рit returns in JSON.

Is that it?

Mostly, yes. The reason this API was developed to provide¬†very easy access to developers to data which isn’t supported by the official API. And there you go.

 

Get Started

So, what are you waiting for?

Head over to the documentation and get started!

https://jikan.me

Project.Extract Cloud (Alpha) is live

There’s been delays but it’s here. The Alpha version of the CS2D log data extractor, Project.Extract Cloud, is up and running. There’s are¬†some stuff left to do. I’ll explain this in a second.

Other than that you can only extract 1 file.¬†I might as well set this as the limit. I’m gratified to be hosted for free by¬†BroHosting¬†as a¬†testing for their hosting services and so far there’s absolutely no¬†critical problems.

 

 

checkout.png

WHAT YOU SHOULD CHECK OUT!

That would be the server statistics functionality. The core of the application lies within there. Feel free to drop in whatever log of your choice and get as much as information out of it as possible!

 

TODO!

Text Searching

The text searching page right now is bare minimum,¬†it’s simply 5% done. It’ll look more polished and organized like the ‘server statistics’ page.

User Database

User database will be a offline feature only of PE4, it’ll¬†automatically store player information¬†as a database for you to easily access.

Server Statistics Polishing

As complete as it looks, it’s still a bit far from done. First off, the map graph you see is a complete dummy. It’s not implemented at all. Secondly, there are some design polishing I need to do. Apart from that I want to see if I can fit in more data and graphs in there.

Usage Statistics

You’ve¬†probably noticed a blank space in the black bar at the top after you click it. What’s meant to be stored there is a graph of your usage statistics of the browser app.¬†The core functionality of this is complete but I’m planning to add the graphs and such at the end.

 

PRIVACY

Some of you might be wondering about the log files that you’re uploading to the server.¬†I’ll let you know before hand that these log files are stored. The reason for this is that they’re cached incase you reload the page.¬†An JSON format of the extracted contents are stored as well.

When I release beta, what I said will still be applicable to your offline version of Project.Extract but the cloud version won’t store anything. Nothing will be cached.

 

That’s it for now, until the beta phase.

Migrating towards Ubuntu

Well, it has been quite a while since my last post. Which was infact about me working on a CMS. To be quite frank I sort of ditched that project, for now. Lots happened, one would be ditching windows xp (well, it’s still installed side-by-side with ubuntu) and fetching a developer’s based OS. All I can say is that Ubuntu is perfect.

It’s 5.21 in the morning and I’m still managing with new additions. So, what’s this post about? Well first off, it’s a fangirling post and second off, some new stuff I had to adapt to as a web developer.

Screenshot from 2015-09-25 05:12:46

This is one of the projects I’ve been working on. It is quite bigger than my CMS project. Rather than storing the theme links and information in configuration files and parsing it from there, this is directly done from the database via mySQLi.

A thing about Ubuntu I love is the mass amount of virtual desktops you can create. See all those tabs in the task bar? They’re all OPEN. Except on a different desktop. I can switch desktops easily as Alt + (left/right or mouse scroll).

Another best thing is the light weight LXQT Desktop feature. It’s still in alpha phase as of my writing but it’s still modifiable and I’ll be using this forever probably. Note: I’m running on an old potato with 512mb ram with these features. I’m quite amazed. This is probably the most normal reaction for a windows -> ubuntu migrating individual.

Instead of using phpmyadmin, I’ve taken a step forward and am doing mostly everything from bash command line. It’s easy to adapt to and quite fun.

One thing I had to adapt to was file paths and the permissions stuff. They’re easy once you get to understand them but one thing with requiring/including files in PHP is that you have to do them from the base.

You can’t include a file like this: <?php require ‘core/framework.php’ ?>

It’s more like this: <?php require ‘/var/www/project_folder/core/framework.php’ ?>

Well, I’m sure there are other ways and hacks for this but I’ll probably learn them later. o.o

Screenshot from 2015-09-25 05:14:43Well, I suppose that’s it for now. I’ll be writing soon regarding the project I am working on and my newly achievements on it sooner or later (it’s huge ;-;).

The Path To A CMS (Part 3)

Alright! So I managed to complete the article module as well as the theme module! The theme module will be able to load any theme given that it’s properly configured as I said in my previous post. Other than that I have made a default template for it which is even responsive (yay!). I’ll be completing the template along the rest of the completion of the CMS. So far that’s left is the design, admin panel and the articles page. The articles page shouldn’t take long. I would be needing to implementing a url parser for that, which reminds me! I managed to implement pretty urls for the pages after a thirty minute struggle on how to fix the broken style sheets and index links.
cm1

Unlike my prototype articles class, this one loads directly from a database using mysqli.
cms2
And stores the articles that are to be shown on the page in a public array, which can be called in the index page after calling the Articles Class. Doing this serves that you’d be able to use in any theme you add wherever.
cms3

Your CMS’s design looks quite familiar to the one you’re using on WordPress.
hahaha, what? Nonsense. >.>

Portfolio Update

Well, it has been a while since my last update due to some unforeseen circumstances (laptop got #rekt lol). This shall be my first update since then. The way my portfolio was currently was the epitome of a mess. Fonts lingering over 3~ MB in download size which were killing the user’s bandwidth as well as some parts of the design being a bit off. And of course, some backhand left un-done.

In the time being of a few hours I managed to re-gift the portfolio a sense of design and patched up some ends. I still haven’t worked on the administration panel, but who cares about that, it’s not like anyone else other than me requires it.

Some design changes included the navigational area as well as the fixed avatar of myself I had. That and the introduction area. I removed that and added a context of my awesome name with a crap ton of padding plus a slide down link towards the portfolio.

web2

The portfolio was already almost finished last time. This time I just added a better download backhand to keep track of download hits as well as the counter showing in on the button. The icons are now to the left of the text.

web3

The ‘About Me’ section is pretty much the same, except I increased the percentage in my visual skill set levels. lawl
web4

Other than that I pondered a good 20 minutes on the submit button for the contact area. The button style was making me cringe so I suppose this one might as well be ‘The One’ currently…hopefully.

I suppose that’s it.
Oh, did I mention I got a new theme for Sublime Text? I swear this makes everything I write significantly even more beautiful. *~*
web5

The Path To A CMS (part 2)

So far I had no clue how to manage themes, but after a little pondering I managed to come up with a solution. I was too lazy to google it anyways and I believe this solution fits me best because I created it. No, I do not care whether it is an actual method, I manifested it within my own ideologies so therefore; it is mein.

Anyhows, this is pretty much how it will manage the templates.
So it will load themes as theme1,theme2,theme3 and so on using the swagging configuration parser. Now using the values of the keys, I’ll take control of the explode() function with the delimiter of ‘.‘ to get the correct values.

themes class

 

 

themes config

This would be a prototype of how the themes ‘database’ would be like. If you’re wondering what the ‘.pb‘ extension is, that’s the CMS’s own configuration extension abbreviated as ‘project blog‘.

So yeah, I suppose this concludes the theory behind the themes management module. I will update what goes in the upcoming posts. It might be of another module as I lack the ability to finish one module at a time and I usual end up doing bits and pieces of everything all together. Oh well.