WHY?
Simply, I wanted to build a recursive web scraper/crawler and an updated anime database parsed in JSON was lacking on github. And I’m doing so! So what exactly are the steps to make your own anime database?
First off, you can’t be doing manual data entries. You need a web crawler. And I’m targeting MyAnimeList. Not in any bad sense, love the site. o.o
MAL has it’s own API but it’s terrible. You can not retrieve anime info without 3rd party APIs and wrappers. I’ve made Stats::Extract, which extracts data from an html file so this shouldn’t be too advanced for me.
THE STEPS
- Make the Crawler (done!)
- Make the Wrapper (working on it!)
- Make the Scraper (not even a single line)
In this post, I’ll emphasize on
MAKING THE CRAWLER
The crawler is a script that requires an entry point, a link if you may, to the web page and then from there it searches for whatever you’re looking for. In my case, I’m targeting the Anime (will do the mangas too).
The entry point is: https://myanimelist.net
What I’m looking for: https://myanimelist.net/anime/{anime id}
So, after crawling into the entry point, it looks for anime page links and adds them to the “queue pool“. But it doesn’t end there. It does its job as a crawler and iterates through the queue pool, loading each and every page and further on adding more links extracted from those pages to the pool! Now, this is a long process.
If you understood what I’m having it do, you might as why in the world don’t I extract the anime info using a wrapper since I’m already on it’s web page?! Well, you see. By the time I was done, I realized that the process was so slow. I’ve started researching multithreading/forking in PHP so I can utilize that on the Scraper instead. Further more, I had the scraper only go through 2000~ anime listings until I got tired of it. It proved my point, it was working. I could use it for anime that get newly added in the database or something.
I got the rest of the animes from users on MAL which had the most watched anime entries.
The crawler is completely CLI (command line interface). The Wrapper will be a PHP Library and the Scraper will be CLI too.
I’ll release the source code on github when it’s a presentable state (soon).
THE PLAN
- Make a basic wrapper which fetches anime information (such as name, episodes, studios, producers, ratings, date aired, genre, etc). This would be a simple wrapper for the database which doesn’t need all the information stored on MAL anime pages.
- Make a scraper with multithreading/forking to use the anime database of their MAL links I have right now to fetch their data and make my database.
- Re-write the wrapper as a complete NON-AUTHENTICATION API to fetch each and everything about anime, manga, people, character, etc. Basically a complete wrapper for the whole site. And release it on github because MAL’s own API is lackluster.
- Re-write the scraper with the crawler and the wrapper as it’s main components. So this time, asynchronously, the scraper will add anime links to the pool and extract the anime information on those pages directly. This could probably be the ultimate MAL Scraper.
That is, if I get it done.
Oh, and a sneak peak at the wrapper.
Part 2: https://irfandahir.wordpress.com/2017/05/13/making-my-own-anime-database-part-2/