Making My Own Anime Database (part 1)

WHY?

Simply, I wanted to build a recursive web scraper/crawler and an updated anime database parsed in JSON was lacking on github. And I’m doing so! So what exactly are the steps to make your own anime database?

First off, you can’t be doing manual data entries. You need a web crawler. And I’m targeting MyAnimeList. Not in any bad sense, love the site. o.o

MAL has it’s own API but it’s terrible. You can not retrieve anime info without 3rd party APIs and wrappers. I’ve made Stats::Extract,  which extracts data from an html file so this shouldn’t be too advanced for me.

THE STEPS

  1. Make the Crawler (done!)
  2. Make the Wrapper (working on it!)
  3. Make the Scraper (not even a single line)

In this post, I’ll emphasize on

MAKING THE CRAWLER

The crawler is a script that requires an entry point, a link if you may, to the web page and then from there it searches for whatever you’re looking for. In my case, I’m targeting the Anime (will do the mangas too).

The entry point is: https://myanimelist.net

What I’m looking for: https://myanimelist.net/anime/{anime id}

So, after crawling into the entry point, it looks for anime page links and adds them to the “queue pool“. But it doesn’t end there. It does its job as a crawler and iterates through the queue pool, loading each and every page and further on adding more links extracted from those pages to the pool!  Now, this is a long process.

If you understood what I’m having it do, you might as why in the world don’t I extract the anime info using a wrapper since I’m already on it’s web page?! Well, you see. By the time I was done, I realized that the process was so slow. I’ve started researching multithreading/forking in PHP so I can utilize that on the Scraper instead. Further more, I had the scraper only go through 2000~ anime listings until I got tired of it. It proved my point, it was working. I could use it for anime that get newly added in the database or something.

I got the rest of the animes from users on MAL which had the most watched anime entries.

The crawler is completely CLI (command line interface). The Wrapper will be a PHP Library and the Scraper will be CLI too.

I’ll release the source code on github when it’s a presentable state (soon).

THE PLAN

  1. Make a basic wrapper which fetches anime information (such as name, episodes, studios, producers, ratings, date aired, genre, etc). This would be a simple wrapper for the database which doesn’t need all the information stored on MAL anime pages.
  2. Make a scraper with multithreading/forking to use the anime database of their MAL links I have right now to fetch their data and make my database.
  3. Re-write the wrapper as a complete NON-AUTHENTICATION API to fetch each and everything about anime, manga, people, character, etc. Basically a complete wrapper for the whole site. And release it on github because MAL’s own API is lackluster.
  4. Re-write the scraper with the crawler and the wrapper as it’s main components. So this time, asynchronously, the scraper will add anime links to the pool and extract the anime information on those pages directly. This could probably be the ultimate MAL Scraper.

That is, if I get it done.

Oh, and a sneak peak at the wrapper.

wrapper

 

Part 2: https://irfandahir.wordpress.com/2017/05/13/making-my-own-anime-database-part-2/

Advertisements

6 thoughts on “Making My Own Anime Database (part 1)

    1. Thanks! The database here can refer to a JSON type, since it’s quick and light-weight. There isn’t any open-sourced anime database available online (AFAIK) so an up-to date one that’s easily accessible without any API usage could really be beneficial to developers. Imagine a repository of anime database that’s constantly being updated through these crawlers.

      An open-source database can be applicable to a lot of stuff, such as your own version of MAL or ANIDB as mobile apps or as a stand-alone website. All you’d have to do is fetch it from a repository available online and you’re set to go with thousands of titles. Right now I’ve managed to download 3000~ titles (and their basic details required for any anime database); and it’s just 2.8MB. That’s equivalent to going to 3000~ anime pages and taking out the juicy parts.

      There’s more to come! That was simply Day 1. I’m nearly finished with the API (wrapper) which takes out (and stores) the data out of the pages. Will document on that soon.

      Liked by 1 person

      1. Wow! 2.8 MB only. But if my understanding is not wrong, it means if you insert anime cover/posters in database, it should use more memory usage, or not?

        I think this is for personal database, but if you don’t mind, I want to see your success work (maybe in Youtube or some media). :3

        Liked by 1 person

  1. Will include media in future posts! 🙂 I ended up cutting this post quite short because I simply needed to document it to get myself going.
    I didn’t include the cover/image but I will be including the link of the images in the final one and give it the possibility to download them as well.

    This is what it can currently retrieve from any anime/manga page: Title, synonym title, japanese title, type (tv, movie, ova, etc), episodes, status (completed, airing), aired (date), premiered (date), broadcast (day, .e.g on Mondays), producers, licencors, studios (white fox, ghibli, etc), source (adaption source), genre, duration (per ep), rating, score, ranking, popularity, members, favorites and last but not least, the synopsis.

    When I’m done with the API, it should be able to retrieve all of these (http://imgur.com/a/w7ph8) and probably a lot more, including character, people and search results. It might also include a wrapper for the Authentication features of MAL. (I simply have way too much time on my hands right now)

    But I’m going to be using the API to just get the basic details of any anime to accumulate them as a “database”, such as the information it’s currently able to get (minus the popularity, members, etc. But I’ll keep the ratings so I’d be able to compare them with other sources as well).

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s