Skraypar: Pattern parsing with Iterators and Look Aheads

You’ll often be told not to parse HTML with RegEx – but what if you’re a rebel?

WHY YOU SHOULDN’T PARSE HTML WITH REGEX

Clicky.

WHY YOU COULD PARSE WITH REGEX

Parsing from static templates is pretty easy with RegEx and quite simple. The basic course of action is matching a line with what you’d want to match and either add grouping selectors in the RegEX or get your hands dirty and polish the data from that abhorrent line of HTML.


I made a successful RESTful service,¬†Jikan.moe, using nothing but RegEx. This didn’t require any extra dependencies, libraries, yadda yadda. Neither was speed a concern since the parse was pretty quick.


 

 

What am I going on about?

Enter;Skraypar

With a terrible choice of a name, I began to simplify my repetitive tasks while parsing HTML using RegEx which consists of RegEx/pattern matching, loops, and so on.

Skraypar is an abstract PHP class which works by parsing by pattern matching, Iterators and Look Aheads’.

The parsing tasks split into 2.

  • (Inception) Pattern matching & callback on the line of match – Iterators
  • Additional pattern matching and callbacks within Iterators for dynamic HTML location – Look Aheads’

 

Think of it as the Iterator¬†matching a table, and another Iterator matching the rows and the Look Aheads’¬†parsing the cells.

This is a pretty abstract and experimental project, I won’t blame you if you think I’ve gone mad. But heck – finding new ways to do things is one thing I like to do.

 

How does it work

1 – File Loading

Skraypar uses¬†Guzzle¬†as a dependency to fetch the HTML or if it’s a local file, it simply loads it. The file is loaded into an array, each line means each new index.

1B – Accepting & Rejecting

Fetching from the interwebs means you get to tell Skraypar which HTTP responses to allow and which ones to throw an exception at. By default, 200 (OK) and 303 (Forwarding) are accepted HTTP responses.

2 – Rules

When you extend a class with Skraypar, you’ve to set a method namely,¬†loadRules,¬†with added rules for Skraypar to remember when parsing.


Rules are patterns and callback functions for that pattern match. They loop at every line of code and if there’s a match and a callback executes – that particular rule is disabled.


3 – Iterators

Iterators are used inside of Rule Callbacks, by setting a breakpoint pattern and a callback pattern; the Iterator loops over each line executing a pattern match or Look Aheads until that breakpoint pattern is reached.

If breakpoint pattern is not found, Skraypar throws an exception that the parser failed by pointing to an unset offset in the array of lines from the file (since it increments)

There can be Iterators within Iterators.

4 – Look Aheads

Look Aheads are used inside Iterators. Usually, one could simply access a data on the next line given a pattern match for a line by incrementing the iterator count by 1. But in given cases, the data may not be available on the next line rather on the offset of 2 lines. This is a dynamic location for the data that is being parsed, hence a Look Ahead method basically looks for a pattern of that dynamically located data and parses it with a function callback.

5 – References

Everything is passed, controlled and set by references within the Iterator¬†callables. You can pass a reference of the Iterator itself within it’s own callable to access setting responses or using the Look Ahead method of the Iterator Class or manually setting the iterator count property to an offset.


That’s pretty much it. This project is in development and is to be used as a dependency for the next major Jikan release. It’s not limited to Jikan, it can be used on any website or file.

 

No documentation is available at the moment.


Links

Advertisements

Jikan API – Vision 2018 ūüéÜ [Unofficial MyAnimeList API]

So it’s 2018 and Jikan is now 1 year old! MyAnimeList announce late 2017 that they’ll be working on fixing up their API but until then I’ll have Jikan running around. I have some plans for Jikan that need to be done, hopefully by mid-2018 or earlier, depending on college.

 

READ

 

There are some things I’m still interesting in scraping off of MAL, here’s the list.

 

User Profile

Taking an example of my own profile;

 

There’s a lot of data available per user profile. The best part here would be their favorite characters, people, anime, manga and basic stats. The hardest part to extract here would be the user based “About Me” which is highly customizable. So this, I might consider parsing since MAL’s HTML source is already terrible enough.

 

Top Anime/Manga/People/Characters

These pages give you access to a paginated list of anime/manga/people/characters ranked by their popularity/favoritism by the community from #1 to the last ranking available. Tis a gold mine entry.

 

Anime/Manga/Person/Character Search!

The official MAL API already has this feature but it only returns the first page of results! It only allows simple string queries and requires user authentication for the API call to work, which is what Jikan is meant to over come. This has been a requested feature, so I’ll most likely be working on a parser for this in the months to come.

 

 

Extended Data for Anime/Manga

This has been in the prospect of Jikan since the beginning, but I’ve held off any other extended parsing other than characters/staff and episodes until recently as I begun making scrapers for Pictures, Videos & News related to the item. This trend will continue as there are more pages that consist of interesting data regarding an anime or manga. Especially the reviews page since this has the best data for sentient analysis and averaging of any show or manga.

 

Will be focusing on these 4 for this year! It takes time to mine pure data since scraping HTML off MAL means a lot of weird and round-about ways of doing things!

Jikan REST 2.0 – Developers Preview + November Update

What a basic app utilizing Jikan would require would be data on any anime or manga, then furthermore on the characters and staff members. These 4 types of data are essential to any app for the masses and Jikan can now robustly cover any app developer in those areas.

 

Tl;Dr: https://jikan.me/api/v2.0-dev/

Note: There is no doc available for this endpoint as of yet, you’ll have to play by the data responses.

 

without further ado

It’s been a year since I started on Jikan and half a year since the REST API went up. To get this out of the way – I’m immensely excited to announce that a complete rewrite of the Jikan PHP API has been completed. Making the API more:

  • Friendly to developers for contribution
  • Cleaner responses + less bugs
  • Easier installation
  • More Robust
  • PSR-4/Autoloading

 

Now what’s left is the rewrite of the REST API. I’ve selected Lumen¬†as the micro-framework to handle Jikan REST requests. And that’s currently in the works as I wrap my head around the features of this framework.

But my excitement could not be held back and I really wanted to see the new API in action – spitting out nicely formatted JSON without any malformed sorts of data. I quickly set up a new endpoint using the old REST API code – producing a developers endpoint.

And I hereby present: https://jikan.me/api/v2.0-dev/

You’ll notice a massive difference from the v1.1 or v1.2 REST version as this version of the API is equipped with the latest Jikan PHP commits. Now let me show you the possible type of requests.

 

  1. http://jikan.me/api/v2.0-dev/anime/1
  2. http://jikan.me/api/v2.0-dev/anime/1/characters_staff
  3. http://jikan.me/api/v2.0-dev/anime/1/episodes
  4. http://jikan.me/api/v2.0-dev/anime/21/episodes/1 – Episode pages are now paginated if there’s more than a 100 episodes, a key named episode_last_page will tell you how many pages the episodes page is paginated into.
  5. http://jikan.me/api/v2.0-dev/anime/21/episodes/2
  6. http://jikan.me/api/v2.0-dev/manga/1
  7. http://jikan.me/api/v2.0-dev/manga/1/characters
  8. http://jikan.me/api/v2.0-dev/person/1
  9. http://jikan.me/api/v2.0-dev/character/1

 

With these core prospects for the API being stable and robust, it’s time to focus on implementing more endpoints for scraping more data out of an anime, or the most required function – the search endpoint.

 

the success of this project

I’ve been contacted by a plethora number of developers regarding the usage/feedback/etc of this project. Everyone’s happy – I’m happy. There’s a working, easy to use API that can tell you anything about your favorite Japanese cartoon¬†and I think that’s what matters the most.

Currently there’s a popular and active android App, namely AnYme that’s utilizing Jikan for their data, you can check them out here:¬†https://github.com/zunjae/anYme

The usage of Jikan has been very successful – there’s a thousand of requests spanned across of hundreds of clients daily. Here’s a small chart on the usage since we hit off back in May.

jikan stats chart

 

what’s in store next?

The next foremost thing that is going to be accomplished is going to be REST v2.0. This will be based on the Lumen framework and a much faster server – thanks to a friend of mine. The base endpoint would be¬†api.jikan.me, instead of what we’ve now.

After that – I’ll see what’s next on the agenda.

 

oh by the way

Did I mention that Jikan is now available on packagist.org/composer? You can install it as a dependency in your PHP project as simply as: composer require jikan-me/jikan 

Jikan – The Unofficial MyAnimeList REST API

Jikan is a REST based API that fulfills the lacking requests of the official MyAnimeList API for developers. https://jikan.me


Documentation: https://jikan.me/docs

Source: https://github.com/irfan-dahir/jikan

 

Introduction

As the idea of creating my own Anime Database sparked within me, I set out to create parse data from an existing website, MyAnimeList, since I utilize it a lot for managing the content I parse through my mind. 

Read:¬†Making My Own Anime Database – Part 1¬†–¬†Making My Own Anime Database – Part 2

I was dumbfounded when I realized that the official API did not support for fetching anime or manga details. There was a way to do this via the official API but it was totally round-about. You had to use one of their API endpoints where you searched for a query and it would return a list of similar anime/manga with their details.

I could have used AniList’s API but I was already familiar with scraping data. I’ve done this before in a lot of former projects. And so I set out to develop Jikan to¬†fulfill my parent goal; to make my own anime database. And so it¬†took a project of it’s own.

History

Jikan was uploaded to GitHub on January the 11th with a single function of scraping anime data.

It wasn’t even called ‘Jikan’ back then, it was called the ‘Unofficial MAL API’. Quite generic, I know.

I came to terms with the name ‘Jikan’ as it was the only domain name available for the .me¬†TLD and it’s a¬†commonly used word in Japanese – ‘Time’. The ‘Plan A’ name was ‘Shiro’, but unfortunately everyone seemed to have hogged all the TLDs for it.

With this API, I guess you could say I’d be saving developers some … Jikan –¬†Heh.

 


 

Enter;Jikan

Sounds like a title from the Steins;Gate multiverse.

Anyways, Jikan can provide these details from MAL simply by their ID on MAL

  • Anime
  • Manga
  • Character
  • Person

These are the implemented functions as of now. There are some further planned features.

Search Results

The official API does support this. However;

  1. The response is in XML
  2. It only shows the results of the first page

Jikan will change that by showing results for whatever page you select. And oh Рit returns in JSON.

Is that it?

Mostly, yes. The reason this API was developed to provide¬†very easy access to developers to data which isn’t supported by the official API. And there you go.

 

Get Started

So, what are you waiting for?

Head over to the documentation and get started!

https://jikan.me

Making my own Anime Database – Part 2

More than 5 months have passed since I posted about making my own Anime Database, yet it does not age. It’s time to get back. Anime Database!

Read Part 1: https://irfandahir.wordpress.com/2016/12/21/making-my-own-anime-database-part-1/

Apart from that terrible reference, it is indeed time to tell you where the anime database stands right now. But first of all, I thought it would be best to clear up what this is all really about since my former related post was just really me typing at 200wpm while breathing heavily as the idea held a cast over me.

What is this sh*t?

There’s a bunch of anime databases out there apart from MyAnimeList, such as Anime Planet, AniDB and Anime News Network to name a few.¬†Websites like these contain anime/manga/novel entries which detail the¬†item. It can be compared to IMDB which does the same – except for movies. Sometimes, it’s useful to integrate a RESTful API which can allow developers to fetch these item details from your databases and¬†add them to their¬†own applications. Because the last thing we want to do is input all the anime/manga data into our own¬†databaes¬†using traditional methods.¬†Why not let the computer do it for us, amirite?

rest_api

via https://codeplanet.io/principles-good-restful-api-design/

Now, back to MyAnimeList. MAL has an API but it’s¬†very lacking. You can’t fetch anime, manga, people or even character details directly. Furthermore, the output is in XML rather than¬†JSON. ūüė¶

Okay, what now?

So what do we do? We create our own. Let’s say that now we have an API that can fetch any anime or manga¬†data via their link through means of Scraping.

Let’s talk about Scraping. Scraping is a method that fetches the web page and goes through all the nicely written /s¬†HTML code using an algorithm that extracts the information you need from that web page. When there’s no API, this is an only solution. This or we use another service that provides an API but I really wanted to see how far I could go with this project – so why not?

What’s left?

We now have code that scrapes the¬†web page and returns juicy data that you can cache/save/add/whatever. This requires you to provide the algorithm a link to the page you want to be scrapped, but there’s over¬†hundreds of thousands of anime and manga out there.¬†It would be ridiculous to leave that to human hands. This is where the Crawler comes in.

The Crawler

What a ‘Crawler’ generally does is start at some page and scans that page for other links. Those other links get saved and¬†then it visits those links, and this recursively keeps on going and going and going.

a88

Now as the crawler is doing its job, the scraper is going through the newly cache of links that are being populated and gets the data from that. This is basically how search engines index pages.

But we’re making a really specific crawler. What I’m looking for are links to anime entries within MAL, as I mentioned before. Which falls¬†unto this:¬†https://myanimelist.net/anime/{anime id}

The crawler looks for links with this pattern and save them and then we have the scraper go through them and we get an indexed database!


What’s new?

Due to busy college life and other projects, I’ve been unable to pay¬†complete attention to finish this, however as¬†summer approaches, I find myself once again with a lot of time on my hands.

Realizing that MyAnimeList was lacking a simple API to fetch anime or manga details, I decided to create my own. I teased a few screenshots at the end of the previous related post as well. I basically decided to create an unofficial API that lets you simply do what you can’t do¬†from the official API.

Meet ‘Jikan’ – The Unofficial¬†MyAnimeList API

Github: https://github.com/irfan-dahir/jikan

This is the Scraper I’ve been talking about, it’s written in PHP and OOP. So far it can fetch Anime, Manga and some Character details. It’s going to be a lot more, very soon.

Hell, I even got a domain for it: http://jikan.me, although there is nothing to be seen there at the moment. For now, I plan on hosting the API there once it finishes for others to utilize as well with easy. Jikan returns data in JSON format with a simple, RESTful GET request.

It seems I’ve gotten quite side tracked. Right now I have a solid algorithm to fetch the details requires to make an Anime database. The next obvious step would be to make a robust crawler, right?

 

No.

That would double bandwidth and processing power. Each page will be required to be downloaded and scanned twice. Once for the crawler, once for the scraper. I do realize that I previously used the crawler method and got a list of quite a few anime with their details but it was not until a few days later I realized that MAL had a sitemap.

According to¬†this¬†and¬†this¬†we have two less time consuming methods. The first one is a sitemap for anime listings for crawlers/search engines. Then the second one consists of a method to download a huge list of entries using wildcards in the search. Personally, I have a terrible internet speed and wish to conclude that this works by testing¬†my API against the data it scrapes. The sitemap goes upto¬†33,000 anime IDs where as the wildcard search results¬†yields more than¬†107,000¬†anime IDs! I’ll go with the former that consists of 30~ish % of the entries.

You can also get the sitemap of manga, characters, people, news, featured articles, etc from https://myanimelist.net/sitemap/index.xml too. Pretty useful.

So we not only saved time – we’re also less prone to break MAL terms and conditions. >.>

A-Anyways. We’re down to downloading and populating our personal database.

The Process

  1. Create a links file from that XML file
  2. Write a basic script to load that file and use our API to fetch the data from those links
  3. That’s pretty much it.

1 – Making the list of links

We created a links file from the XML and ended up with 12,096 links. This pretty much shows how many anime IDs are numerically inept. entries

2 – Using our API to go through these links and scrape the data

I’ll be using the power of my shitty internet and laptop to do this, therefore no VPS will be used to induce a DoS attack through these requests.

 

 

Ofcourse, it’s not that fast. I just commented out the scraping part before running it. It will however look like that

Here’s the code that was used:¬†https://gist.github.com/irfan-dahir/70a51ba26a03161db6d451d855944e47

 

 

3 – That’s pretty much it!

Anime details get stored in a JSON file and I’m able to load them whenever needed. There¬†is no user interface to show it to you but I could dump the JSON to my Github once I get enough data.

 

This concludes my own Anime Database. But there’s more to it. The interest of having an offline version of an anime database led to me developing a MAL API. And there’s¬†upcoming updates for that!

I’ll be sure to post some stats when the scraping completes.

Making My Own Anime Database (part 1)

WHY?

Simply, I wanted to build a recursive¬†web scraper/crawler and an updated anime database parsed in JSON was lacking on github.¬†And I’m doing so!¬†So what exactly¬†are the steps to make your own anime database?

First off, you can’t be¬†doing manual data¬†entries. You need a web crawler. And I’m¬†targeting MyAnimeList.¬†Not in any bad sense, love the site. o.o

MAL has it’s own API but it’s¬†terrible. You can not retrieve anime info without 3rd party APIs and wrappers. I’ve made Stats::Extract, ¬†which extracts data from an html file so this shouldn’t be¬†too advanced for me.

THE STEPS

  1. Make the Crawler (done!)
  2. Make the Wrapper (working on it!)
  3. Make the Scraper (not even a single line)

In this post, I’ll emphasize on

MAKING THE CRAWLER

The crawler is a script that¬†requires an entry point, a link if you may, to the web page and then from there it searches for whatever you’re looking for. In my case, I’m targeting the Anime (will do the mangas too).

The entry point is: https://myanimelist.net

What I’m looking for: https://myanimelist.net/anime/{anime id}

So, after¬†crawling into the entry point, it looks for anime page links and adds them to the “queue pool“. But it doesn’t end there. It does its job as a crawler and iterates through the queue pool, loading each and every page and further on adding more links extracted from those pages to the pool! ¬†Now, this is a long process.

If you¬†understood what I’m having it do, you might as why in the world don’t I extract the anime¬†info¬†using a wrapper since I’m already on it’s web page?! Well, you see. By the time I was done, I realized that the process was so slow. I’ve started researching multithreading/forking in PHP so I can utilize that on the Scraper instead. Further more,¬†I had the scraper only go through 2000~ anime listings until I got tired of it. It proved my point, it was working. I could use it for¬†anime that get newly added in the database or something.

I got the rest of the animes from users on MAL which had the most watched anime entries.

The crawler is completely CLI (command line interface). The Wrapper will be a PHP Library and the Scraper will be CLI too.

I’ll release the source code on github when it’s a presentable state (soon).

THE PLAN

  1. Make a basic wrapper which fetches anime information (such as name, episodes, studios, producers, ratings, date aired, genre, etc). This would be a simple wrapper for the database which doesn’t need¬†all the information stored on¬†MAL anime pages.
  2. Make a scraper with multithreading/forking to use the anime database of their MAL links I have right now to fetch their data and make my database.
  3. Re-write the wrapper as a complete NON-AUTHENTICATION API to fetch each and everything about anime, manga, people, character, etc. Basically a complete wrapper for the whole site. And release it on github because¬†MAL’s own API is lackluster.
  4. Re-write the scraper with the¬†crawler and the wrapper as it’s main components. So this time, asynchronously, the scraper will add anime links to the pool and extract the anime information on those pages directly. This could¬†probably be the ultimate MAL Scraper.

That is, if I get it done.

Oh, and a sneak peak at the wrapper.

wrapper

 

Part 2: https://irfandahir.wordpress.com/2017/05/13/making-my-own-anime-database-part-2/

Project.Extract Cloud (Alpha) is live

There’s been delays but it’s here. The Alpha version of the CS2D log data extractor, Project.Extract Cloud, is up and running. There’s are¬†some stuff left to do. I’ll explain this in a second.

Other than that you can only extract 1 file.¬†I might as well set this as the limit. I’m gratified to be hosted for free by¬†BroHosting¬†as a¬†testing for their hosting services and so far there’s absolutely no¬†critical problems.

 

 

checkout.png

WHAT YOU SHOULD CHECK OUT!

That would be the server statistics functionality. The core of the application lies within there. Feel free to drop in whatever log of your choice and get as much as information out of it as possible!

 

TODO!

Text Searching

The text searching page right now is bare minimum,¬†it’s simply 5% done. It’ll look more polished and organized like the ‘server statistics’ page.

User Database

User database will be a offline feature only of PE4, it’ll¬†automatically store player information¬†as a database for you to easily access.

Server Statistics Polishing

As complete as it looks, it’s still a bit far from done. First off, the map graph you see is a complete dummy. It’s not implemented at all. Secondly, there are some design polishing I need to do. Apart from that I want to see if I can fit in more data and graphs in there.

Usage Statistics

You’ve¬†probably noticed a blank space in the black bar at the top after you click it. What’s meant to be stored there is a graph of your usage statistics of the browser app.¬†The core functionality of this is complete but I’m planning to add the graphs and such at the end.

 

PRIVACY

Some of you might be wondering about the log files that you’re uploading to the server.¬†I’ll let you know before hand that these log files are stored. The reason for this is that they’re cached incase you reload the page.¬†An JSON format of the extracted contents are stored as well.

When I release beta, what I said will still be applicable to your offline version of Project.Extract but the cloud version won’t store anything. Nothing will be cached.

 

That’s it for now, until the beta phase.