December 2016 – Irfan Dahir

WHY?

Simply, I wanted to build a recursive web scraper/crawler and an updated anime database parsed in JSON was lacking on github. And I’m doing so! So what exactly are the steps to make your own anime database?

First off, you can’t be doing manual data entries. You need a web crawler. And I’m targeting MyAnimeList. Not in any bad sense, love the site. o.o

MAL has it’s own API but it’s terrible. You can not retrieve anime info without 3rd party APIs and wrappers. I’ve made Stats::Extract, which extracts data from an html file so this shouldn’t be too advanced for me.

THE STEPS

Make the Crawler (done!)
Make the Wrapper (working on it!)
Make the Scraper (not even a single line)

In this post, I’ll emphasize on

MAKING THE CRAWLER

The crawler is a script that requires an entry point, a link if you may, to the web page and then from there it searches for whatever you’re looking for. In my case, I’m targeting the Anime (will do the mangas too).

The entry point is: https://myanimelist.net

What I’m looking for: https://myanimelist.net/anime/{anime id}

So, after crawling into the entry point, it looks for anime page links and adds them to the “queue pool“. But it doesn’t end there. It does its job as a crawler and iterates through the queue pool, loading each and every page and further on adding more links extracted from those pages to the pool! Now, this is a long process.

If you understood what I’m having it do, you might as why in the world don’t I extract the anime info using a wrapper since I’m already on it’s web page?! Well, you see. By the time I was done, I realized that the process was so slow. I’ve started researching multithreading/forking in PHP so I can utilize that on the Scraper instead. Further more, I had the scraper only go through 2000~ anime listings until I got tired of it. It proved my point, it was working. I could use it for anime that get newly added in the database or something.

I got the rest of the animes from users on MAL which had the most watched anime entries.

The crawler is completely CLI (command line interface). The Wrapper will be a PHP Library and the Scraper will be CLI too.

I’ll release the source code on github when it’s a presentable state (soon).

THE PLAN

Make a basic wrapper which fetches anime information (such as name, episodes, studios, producers, ratings, date aired, genre, etc). This would be a simple wrapper for the database which doesn’t need all the information stored on MAL anime pages.
Make a scraper with multithreading/forking to use the anime database of their MAL links I have right now to fetch their data and make my database.
Re-write the wrapper as a complete NON-AUTHENTICATION API to fetch each and everything about anime, manga, people, character, etc. Basically a complete wrapper for the whole site. And release it on github because MAL’s own API is lackluster.
Re-write the scraper with the crawler and the wrapper as it’s main components. So this time, asynchronously, the scraper will add anime links to the pool and extract the anime information on those pages directly. This could probably be the ultimate MAL Scraper.

That is, if I get it done.

Oh, and a sneak peak at the wrapper.

Part 2: https://irfandahir.wordpress.com/2017/05/13/making-my-own-anime-database-part-2/

SVG

It’s been used around a lot since the past year and I thought “why not?”. I made some really simple SVGs that are right-angled triangles to give the edges of containers a padded and – er – better look? THEY LOOK GREAT, and that’s what I believe matters? Okay. In addition, these are vector graphics and are ludicrously small in size. Going to use them more in further projects after I get a good grip on them.

BOOTSTRAP

The only reason I’ve started using bootstrap is because of its preset configuration, glyphs and the responsive grids. Oh, the responsive grids – never have I ever made something without spending much time on the frames.

I pat my self for making it look better but I still believe it’s got it’s ways to go before it reaches perfection. The thought of upgrading the current design rather than making a complete new one was really proper choice.

WHAT’S NEXT?

I’m planning a twitter feed and a cool little vertical ‘timeline’ under my bio which would show which programming language/feat I achieved at which year.

I also think that the SVG triangles are a little to large? Not sure, but I’m going to play around with that.

	Tsunayoshi on Jikan Update – October…
	Irfan on Jikan News & Updates…
	Kuroyuki Haru on Jikan News & Updates…
	Jikan News & Upd… on Jikan API – Vision 2018…
	Irfan on Jikan API – Vision 2018…

	Tsunayoshi on Jikan Update – October…
	Irfan on Jikan News & Updates…
	Kuroyuki Haru on Jikan News & Updates…
	Jikan News & Upd… on Jikan API – Vision 2018…
	Irfan on Jikan API – Vision 2018…

Irfan Dahir

Coding, Web design.

Month: December 2016

Making My Own Anime Database (part 1)